Overview of Annotations

The RUEG corpus is a multi-layer corpus of both written and spoken language. We use several annotation formats in the process of annotation, but all annotations, except for the dependency annotations, are part of the EXMARaLDA file in the exb directory. In addition to the editable EXMARaLDA format, the corpus is also converted to the ANNIS format (annis directory) for search and visualization.

Dependencies between annotation layers

Most annotation layers depend on other annotations. This can to lead to complex dependencies, as visualized by the following graph:

Graph of the dependencies between the annotations

Meta data fields

In addition to the annotation layers, each document has also meta data fields which are stored in the .meta file next to each EXMARaLDA file. The meta data is also included in the ANNIS format.

field name type description
speaker-id String
formality String informal/formal
mode String spoken/written
speaker-bilingual Boolean yes/no
elicitation-session Number 1 (monolinguals, bilinguals in first session)
2 (bilinguals in second session)
elicitation-language String Language that is elicited from the speaker
elicitation-country String
elicitation-order Number 1-8
elicitator-good-id String project- and people-number of "good cop"
elicitator-bad-id String project- and people-number of "bad cop"
elicitation-date String 2018-XX-XX
transcriber-id String comma-separated list of project- and person-number XX-XX
normalizer-id String comma-separated list of project- and person-number XX-XX
annotator-id String comma-separated list of project- and person-number XX-XX
speaker-language-s String Languages as given by the participants and separated by comma
speaker-age-group String children/adolescents/adults
speaker-gender String m/f/d
speaker-age Number two-digit number year
speaker-AoO Number Age Of Onset in years (two-digits)
speaker-AoO-answer Number complete, but anonymized answer string
speaker-personality-score-X Number Personality score (1-7) for each of the questions 1-6 of the personality test
speaker-extravert-score Number aggregated extravert score

cu (Communication Unit)

Value set: open

Segmentation and transcription of Communication Units For spoken data, the start and the end of the CUs are manually aligned with the audio.

See the transcriptions guidelines for details.

Processing steps

# type step output format
1 manual Transcription exb

dipl (Tokenization)

Value set: open

Automatic tokenization of the text into words.

  • as defined by the TreeTagger tokenization script
  • extra handling for emojis and pauses

Language-specific differences

  • language specific abbreviations

Processing steps

# type step output format
1 automatic TreeTagger exb

norm (corpus-wide normalization)

Value set: open

A common normalization that is the same for written and spoken data. This allows a search across registers.

  • segmented into graphemic words
  • emojis are a single word
  • text messsage acronyms are treated as single word
  • punctuation is considered a token if not part of an emoji
  • following standard orthography
  • no word order corrections
  • no grammatical corrections

Language-specific differences

  • script is normalized to language standard
  • each language decided on
    • orthographic standard
    • clitics
    • script

Processing steps

# type step output format
1 automatic Copy base text exb
2 manual Normalize exb

lemma (Lemmatization)

Value set: open

Lemmatization based on the normalization (norm).

Processing steps

# type step output format
1 automatic lemmatization (part of the POS-tagging) exb
2 manual correction exb

pos (Universal part of speech)

Value set: closed

Part of speech annotation using the Universal POS tags.

Processing steps

# type step output format
1 automatic Automatic POS tagging exb

pos_lang (Language specific Part of speech)

Value set: closed

Part of speech annotation with a tag-set for each language.

  • there is one common tag-set for each language
  • text message acronyms get their own tag manually (or if the tagger supports it, automatically)

Different tagsets are used for each language:

language tag set reference
English British National Corpus / Claws 4 Leech et al. 19941
German STTS 2.0 Westpfahl 20142
Russian MyStem tag set Segalovich 20033
Turkish MULTILIT tag set Schroeder et al. 20154

Processing steps

# type step output format
1 automatic Automatic POS tagging with tool exb
2 manual correction exb

language (Language/Foreign Material)

Value set: closed

Describes the language.

  • per-token
  • ISO three letter language code
  • every token has this category assigned
  • no dialects

Processing steps

# type step output format
1 automatic Fill out default language exb
2 manual Mark foreign material exb

message (Chat Message span)

Value set: natural numbers

Span annotation for each message in the chat. Contains its consecutive number.

line (Chat Message line)

Value set: open

Span annotation with the chat message text as content.

Processing steps

# type step output format
1 automatic exb

dep (Universal Dependencies)

Value set: closed

Automatic Universal Dependency parsing.

Processing steps

# type step output format
1 automatic UD Parsing CoNLL
1

Leech, Geoffrey, Roger Garside, and Michael Bryant. 1994. “CLAWS4: The Tagging of the British National Corpus.” In COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics. Vol. 1.

2

Westpfahl, Swantje. 2014. “STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data.” In Proceedings of Law Viii-the 8th Linguistic Annotation Workshop, 1–10.

3

Segalovich, Ilya. 2003. “A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine.” In MLMTA, 273–80. Citeseer.

4

Schroeder, Christoph, Christin Schellhardt, Mehmet-Ali Akinci, Meral Dollnick, Ginesa Dux, Esin Işil Gülbeyaz, Anne Jähnert, et al. 2015. “MULTILIT.” Universität Potsdam. https://publishup.uni-potsdam.de/opus4-ubp/frontdoor/index/index/docId/8039.