RUEG Corpus Documentation

The RUEG corpus is a multi-layer corpus of both written and spoken language. We use several annotation formats in the process of annotation, but all annotations, except for the dependency annotations, are part of the EXMARaLDA file in the exb directory. In addition to the editable EXMARaLDA format, the corpus is also converted to the ANNIS format (annis directory) for search and visualization.

Most annotation layers depend on other annotations. This can to lead to complex dependencies, as visualized by the following graph:

Graph of the dependencies between the annotations

In addition to the annotation layers, each document has also meta data fields which are stored in the .meta file next to each EXMARaLDA file. The meta data is also included in the ANNIS format.

field name	type	description
speaker-id	String
formality	String	informal/formal
mode	String	spoken/written
speaker-bilingual	Boolean	yes/no
elicitation-session	Number	1 (monolinguals, bilinguals in first session) 2 (bilinguals in second session)
elicitation-language	String	Language that is elicited from the speaker
elicitation-country	String
elicitation-order	Number	1-8
elicitator-good-id	String	project- and people-number of "good cop"
elicitator-bad-id	String	project- and people-number of "bad cop"
elicitation-date	String	2018-XX-XX
transcriber-id	String	comma-separated list of project- and person-number XX-XX
normalizer-id	String	comma-separated list of project- and person-number XX-XX
annotator-id	String	comma-separated list of project- and person-number XX-XX
speaker-language-s	String	Languages as given by the participants and separated by comma
speaker-age-group	String	children/adolescents/adults
speaker-gender	String	m/f/d
speaker-age	Number	two-digit number year
speaker-AoO	Number	Age Of Onset in years (two-digits)
speaker-AoO-answer	Number	complete, but anonymized answer string
speaker-personality-score-X	Number	Personality score (1-7) for each of the questions 1-6 of the personality test
speaker-extravert-score	Number	aggregated extravert score

Value set: open

Segmentation and transcription of Communication Units For spoken data, the start and the end of the CUs are manually aligned with the audio.

See the transcriptions guidelines for details.

#	type	step	output format
1	manual	Transcription	exb

Value set: open

Automatic tokenization of the text into words.

as defined by the TreeTagger tokenization script
extra handling for emojis and pauses

language specific abbreviations

#	type	step	output format
1	automatic	TreeTagger	exb

Value set: open

A common normalization that is the same for written and spoken data. This allows a search across registers.

segmented into graphemic words
emojis are a single word
text messsage acronyms are treated as single word
punctuation is considered a token if not part of an emoji
following standard orthography
no word order corrections
no grammatical corrections

script is normalized to language standard
each language decided on
- orthographic standard
- clitics
- script

#	type	step	output format
1	automatic	Copy base text	exb
2	manual	Normalize	exb

Value set: open

Lemmatization based on the normalization (norm).

#	type	step	output format
1	automatic	lemmatization (part of the POS-tagging)	exb
2	manual	correction	exb

Value set: closed

Part of speech annotation using the Universal POS tags.

#	type	step	output format
1	automatic	Automatic POS tagging	exb

Value set: closed

Part of speech annotation with a tag-set for each language.

there is one common tag-set for each language
text message acronyms get their own tag manually (or if the tagger supports it, automatically)

Different tagsets are used for each language:

language	tag set	reference
English	British National Corpus / Claws 4	Leech et al. 1994¹
German	STTS 2.0	Westpfahl 2014²
Russian	MyStem tag set	Segalovich 2003³
Turkish	MULTILIT tag set	Schroeder et al. 2015⁴

#	type	step	output format
1	automatic	Automatic POS tagging with tool	exb
2	manual	correction	exb

Value set: closed

Describes the language.

per-token
ISO three letter language code
every token has this category assigned
no dialects

#	type	step	output format
1	automatic	Fill out default language	exb
2	manual	Mark foreign material	exb

message (Chat Message span)

Value set: natural numbers

Span annotation for each message in the chat. Contains its consecutive number.

line (Chat Message line)

Value set: open

Span annotation with the chat message text as content.

#	type	step	output format
1	automatic		exb

Value set: closed

Automatic Universal Dependency parsing.

#	type	step	output format
1	automatic	UD Parsing	CoNLL

Leech, Geoffrey, Roger Garside, and Michael Bryant. 1994. “CLAWS4: The Tagging of the British National Corpus.” In COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics. Vol. 1.

Westpfahl, Swantje. 2014. “STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data.” In Proceedings of Law Viii-the 8th Linguistic Annotation Workshop, 1–10.

Segalovich, Ilya. 2003. “A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine.” In MLMTA, 273–80. Citeseer.

⁴

Schroeder, Christoph, Christin Schellhardt, Mehmet-Ali Akinci, Meral Dollnick, Ginesa Dux, Esin Işil Gülbeyaz, Anne Jähnert, et al. 2015. “MULTILIT.” Universität Potsdam. https://publishup.uni-potsdam.de/opus4-ubp/frontdoor/index/index/docId/8039.

Anonymisation on Speaker tier

Replace name of participant with the respective speaker code, e.g. USbi02FR
If whole names or surnames of friends are mentioned, replace with the participant code + _P, e.g. USbi02FR_P

COMMENT: To our information we anonymize all part of the speaker name, also just the first name.

Places that could lead to the identification of a participant, like streetnames, schoolnames, etc.:

If the participant names for example "Friedrichstraße", you transcribe "{streetname}Straßezzz". You leave out the actual name of the street, in the example here you replace the "Friedrich". "zzz" has to be replaced by any inflectional suffixes you hear in the data or erased, if there are no inflectional suffixes. In this example you would transcribe"{streetname}Straße", because there are no inflectional suffixes. Accordingly if schools are named, for example "Apple Highschools" ("s" in the end for an English plural form), you transcribe "{schoolname}schools". There should be no spaces following the {...}. The placeholder you put in between the {...} is in English, regardless of the language you elicitate and transcribe in. Over time a list of these placeholders should develop, so that all identifiers can be replaced and put between the {...} ("{streetname}Allee"). This list is the same for every project and language.

Communication Unit (CU) is used as a segmentation unit
No punctuation marks at all on the transcription layer
No accents, no intonation patterns are marked
In Exmaralda: blank space at the end of each event (* no punctuation marks on norm layer)

No capital letters
Abbreviations/acronyms are transcribed as full words in the phonology of the language heard in the recording(e.g. German BMW = beemwe, English BMW = beemdoubleyou)
speaker codes need to be partly capitalized to follow their correct pattern

Adapted from KiDKo2014

'unwanted' material are questions of participants concerning the procedure and eventual responses from the elicitator
first, figure out if you can exclude this kind of data with 'unwanted' material and repeat the elicitation
If this is not possible, mark those passages as: <Q> communication with elicitor </Q>
they get an extra-event

Merged forms are transcribed as they are articulated, but with an equal sign linking the merged elements
Examples: so=ne (= so eine)

reduced syllables are transcribed as articulated
Examples: goin (= going), bi tane (= bir tane), hab ich ein Tadel bekommen (= einen Tadel)

Do not leave anything out and do not add anything which is not there!
Use / to mark unfinished words, e.g. “The bl/ blue car crashes um stops“
word internal cancellations/corrections are transcribed as follows: dipl: "be$ha$ come" (norm: "become")
Onomatopoeias/echoisms are separated tokens (e.g. gutschi gutschi gutschi), only transcribed as one single token if they are very short (e.g. eieiei)

always measured to the first decimal
0.2 - 1 sec: (-)
1 - 3 secs: (--)
More than 3 secs: (5.5) to be measured
Wordinternal pauses are marked as followed: be(-)have ¹
keep in mind that there might be persons who talk really slowly (makes no sense to put a pause after every word/token)
pauses inside a CU do not get an extra-event on the CU tier
pauses which occur between two CUs get an extra-event on the CU tier

vocals realized longer than normal (0.2 - 2sec) are marked with : (e.g. so lo:ng)
vocals that are realized longer tha 2 seconds are marked with :: (z.B. so lo::ng)
also possible for consonants (e.g. mum:)
doubling of vocal syllables with % (e.g. by%ye, tschü%üß)

non-verbal events such as a participant laughing or coughing are noted in square brackets on the transcription tier, e.g. [laughing]
if participants speak and laugh at the same time, you note it as: [[laughing]speech]

List of meta comments used in the RUEG project

[coughing]
[gulping]
[laughing]
[pfing] for a sound like "pfff"
[sighing]
[throatclearing]
[tongueclicking], including tsking as disapproval, while thinking and just mouth opening with a click
[whispering]
[stuttering]
[imitating], for when they imitate a sound related to the story (e.g., car crash)
[sniffing]

uninterpretable material is to be marked as (UNK) on Speaker-tier
longer than 2secs: (UNK, 2.1)
assumed content is placed in between brackets, each token separated: (assumed) (content)
if the uninterpretable material can be identified as belonging to a CU, there is no seperated event on the CU tier for it

For every language, we define a set of hesitation markers/interjections/reception markers
create a list with those markers
If heritage speakers use particles from their ‘other’ language, we transcribe them as they sound, consistent with the procedure on foreign language material²

Choose a spelling for each item following one of those options:

transcribe phonographically (e.g. engl. like = germ. leik) OR
use orthographic spelling of the "other" language

Create a list where you document the spelling of each item in alphabetic order
put the file name that includes the word and the time of the appearance in the list
each time you encounter foreign language material in your data, check the list to guarantee a consistent form for those items³

Keep conventionalized spelling (e.g. Renault = renault)
document your decisions, create a list with those items
Language specific decisions: Russian: put it in the spelling and script of the actually spoken language to avoid loss of phonetic/morphological/syntactic information Turkish and Greek: use Latin alphabet and conventionalized spelling

Symbols	Meaning
`<Q> communication with elicitor </Q>`	instances of questions concerning the procedure and/or verbal interventions of elicitators
(-)	pauses up to 1sec
(--)	pauses 1-3secs
(3.2)	pauses longer than 3secs
(UNK)	uninterpretable material
(UNK, 2.2)	uninterpretable material longer than 2secs
(assumption)	assumed material
[...]	non-verbal material
[[...]...]	non-verbal & verbal event
:	unusually long vocal or consonant (under 2secs)
::	unusually long vocal or consonant (longer than 2secs)
=	merged forms
/	interruption of a word
$...$	word internal cancellations
%	doubled syllables
{...}	specification of an anonymised place

Our corpus pipeline faces 4 types of input for each speaker that participated in the elicitation:

formal written (fw) text files
informal written (iw) chat exports
formal spoken (fs) transcriptions
informal spoken (is) transcriptions

All data types are divided in above-word-level ("phrasal") segments:

fw: sentences indicated by punctuation used by the participant
iw: messages and lines indicated by separators used by the participant (punctuation, new line, message separation)
fs and is: communication units (CUs) indicated by transcriber (some projects might use intonation phrases additionally)

We annotate our CUs on the annotation tier "cu" (in lowercase letters). These segments are then automatically tokenized (roughly speaking divided in character sequences between two whitespaces), which provides our first word-level segmentation: The diplomatic layer, consisting of graphemic words represented in the phrasal segments.

This word-level segmentation is then to be normalized to facilitate search and automatic annotation.

The result of normalization is an additional word-level segmentation layer, similar to dipl. Starting at the diplomatic tokenization, explicit rules are to be applied to obtain orthographically normalized tokens on the norm layer. Starting of as a copy of the diplomatic token layer, the normalized tokens can (but do not necessarily need) to undergo operations such as edition, deletion, split or merge.

Our normalization is not the last step of analysis and does not remove information from the overall corpus
Normalization takes place mainly on word-level to obtain standardized forms w.r.t. an orthography of reference
We will agree on an orthography of reference for each language in our corpus
The normalized layer uses the language's native script, whereas the diplomatic layer sticks to the script used by the participants for the written registers (fw, iw)
We do not normalize syntax. This way we are able to learn about an underlying grammar more easily rather than analysing the elicitated language in terms of a standard grammar. Note that this comes with disadvantages but is more appropriate for our projects' idea. We want to obtain a layer that is orthographically normalized but still allows to do syntactical analyses. This is why we do not normalize syntax and try to keep as many elements as possible.
Normalization is also a technically necessary process. Orthographically standardized forms help all researchers, those not familiar with our data included, to search linguistic phenomena. Furthermore, annotation to be integrated in our corpus can be pre-generated automatically and efforts of correction are reduced.
The highest priority is to ensure that we will always be able to answer our research questions.
We do not alternate lexical choice when normalizing.
The norm layer will enable us to identify grammatical units.
A normalization decision is always influenced by the question of how controversial and/or lossy in terms of information that decision might be.
Since this is our first layer of normalization, we normalize very carefully.
There will be no grammatical normalizations (adaption of cases or gender or other grammatical features that deviate from our understanding of "standard")
We normalize to a single standardized form (if possible) to reduce noise in the tokenization
We always normalize as far as we can go, which means we are more careful in some cases compared to others. As an example, 're stays 're in the normalization, because tempus is unclear, whereas 'll can be normalized to will, since it is entirely uncontroversial what the diplomatic form expresses.
When normalizing, we try to be as uninterpretative as possible.
1:1 mappings from diplomatic and normalized tokens are easier to decide on, because the alternation in the form is easy to trace.
The standardized representation should be influenced by the underlying orthography explicitly and implicitly, e. g. for German clitics we do not use an apostrophe in the separated token, whereas we do for English.
Normalization should not alternate meaning (if possible).

language	orthography of reference	reference
English	American English	Merriam-Webster
German	„Neue deutsche Rechtschreibung“	Amtliche Regelung der dt. Rechtschreibung, Duden
Greek	Standard Modern Greek	Λεξικό της κοινής νεοελληνικής and David Holton, Peter Mackridge, Irene Philippaki-Warburton (1997) Greek: A Comprehensive Grammar of the Modern Language, London: Routledge
Russian	современный русский литературный язык	Малый академический словарь (МАС) https://rus-academic-dict.slovaronline.com/
Turkish	Turkish / Modern Standard Turkish (ISO 639-3 = tur)	Türk Dil Kurumu

For the written registers (iw, fw), we currently lack a syntactic unit. Furthermore, CUs as segmentation units were, amongst other reasons, chosen due to their applicability for spoken and written data. Having CUs in all types of texts facilitates cross-mode / cross-register comparisons.

To do this in EXMARaLDA, you first need to create the respective annotation tier. After having opened your file in EXMARaLDA, add a new tier through the menu bar (Tier > Add Tier). Choose the following settings:

Speaker: dipl [dipl]
Type: A(nnotation)
Category: cu

By depending on the diplomatic tokenization, the tier has one timeslot per event on dipl. To annotate a CU, merge all timeslots in the CU tier that overlap the tokens (dipl) belonging to the CU (including sentence-final punctuation). Finally, assign a simple annotation value "cu" to the newly created span.

The CU tier does not explicitly contain any text. By overlapping its respective tokens, the entire text of a CU is already defined and can be searched.

Please do not edit the CU layer of the transcriptions. They are final.

For each dipl token there is a lanuage annotation. When preparing your data, this is automatically set to the language of elicitation. Nevertheless, in some cases you might have to adapt that value. Please use the ISO 639-2 language code. A list of valid codes can be found here.

For cancellations, pauses, emojis and non-verbal material, please delete the language annotation.

The rules are always applied on the normalized tokenization. The diplomatic tokenization remains as is.

The following lists and examples are planned to be extended and updated with more exmaples, especially from Greek, Russian and Turkish.

Notation: Normalization rules are formulated as diplomatic token(s) → normalized token(s). Token boundaries are represented with a slash (/).

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comments / Rationale / Criticism
(CU-initial) capitalization	do not correct. Only adapt if a word itself has to be capitalized. Do not capitalize merely because of a CU-initial position	am / i / in / london → am / I / in / London	das / auto / konnte / nicht / mehr / bremsen → das / Auto / konnte / nicht / mehr / bremsen

Note: Rule of thumb: Capitalize, if the word needs to be capitalized in the orthography of reference. Only introduce capitalization if it is the correct spelling independent of whether it appears in an initial position or not. Also, do not undo CU-initial capitalization if provided by writer.

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comments / Rationale / Criticism
orthographic errors or phonetic assimilations / slip of the tongue	normalize orthography	orthographie → orthography	hunt → Hund fümf → fünf	~	¹
orthographic errors or phonetic assimilations / slip of the tongue (within vocabulary)	correct to target hypothesis if meaning of corrected form is much more likely AND the normalization implies a change in category	I / was / their → I / was / there	ihr / seit / dort → ihr / seid / dort Umfall → Unfall	~	²
written data: composed forms	refer to above mentioned reference and normalize	inter-national → international caraccident → car / accident	Auto / Unfall → Autounfall

This is the prototypical case of normalization.

Both corrections are much more likely, since it is very hard to derive any meaning from the uncorrected examples, considering they were meant to be this way. Additionally, since it is 1-1 normalization in terms of words, normalization is not particularly invasive. Last but not least, since the elicitations take place in a more or less closed setting, it is much easier to make a judgement on the likelihood of an expression in a case of potential within-vocabulary errors.

Phenomenon	What to do on norm level?	English example	German example	Russian example
missing words	do not add them, we do not normalize syntax	you / there / ? → you / there / ?	bin / unterwegs → bin / unterwegs

Phenomenon	What to do on norm level?	English example	German example	Russian example
repetitions	keep, normalize orthography	the / the / cAr → the / the / car	ein / ein / Autounfall

Phenomenon	What to do on norm level?	English example	German example	Russian example
self-corrections	see repetitions / cancellations	the / hatchback / (eh) / car → the / hatchback / car (filled pause deleted, rest remains)	ein / Typ / ein / Fußgänger / hat / dann remains as is, whereas ein / Fußg/ / ein / Typ → ein / ein / Typ, since Fußg/ was explicitly marked as cancelled by the transcriber

Note: The proposed procedure for self-corrections is the least theory-driven, thus, avoids making any grammatical assumptions when normalizing. Nevertheless, or therefore, it also might be very controversial. It definitely does not facilitate later syntactic annotation.

Phenomenon	What to do on norm level?	English example	German example	Russian example
spoken data: cancellations³	drop the incomplete element	car / acc/ → car	Autounf/ →

Note: Sometimes it is difficult to tell whether something is a cancellation or not, because the incomplete part might already be a word on its own. In these cases we rely on the transcriber's decision: If something is marked as cancellation in the transcript, we stick to that interpretation. Nevertheless, that should not keep your project from revising transcriptions (given the current state of corpus creation still allows for such revisions).

Phenomenon	What to do on norm level?	English example	German example	Russian example
spoken data: phonetic markers	remove and normalize to orthography	a::nd → and	und / da::nn → und / dann

Phenomenon	What to do on norm level?	English example	German example	Russian example
reduced forms	if reconstructable normalize orthographically⁴	ca → car	nich → nicht wolln → wollen

⁴

Note: reduction has to be out of vocabulary. If not, check with the guidelines on within-vocabulary errors above.

Note: There might be (even in the dictionary) widely accepted variants of standardized writings that are more or less the result of reduction (or other processes). Examples are cross instead of across in English, grade instead of gerade in German. We need to focus on the goal of normalization: To have a text segmentation where orthographic variation is reduced / removed, i. e. where multiple expressions are mapped to one standard representation. Thus, as long as there is no argument that normalizing an item alternates meaning, these reductions should be normalized. We only deal with mappings from one token to one token, i. e. retrieving the originally chosen expression is very easy and we do not lose information. A counterexample are cases of drauf in German. As a pronomial adverb it is often a reduction of darauf and should be normalized as that. Nevertheless, it can also be the separated particle of the particle verb draufgehen (to die, fig.) and normalizing it to darauf would not be appropriate/"correct". When normalizing, we will most likely encounter much more subtle cases of difference in meaning. We need to discuss and document our decisions.

Phenomenon	What to do on norm level?	English example	German example	Russian example
colloquial forms, dialect	normalize	coz → because rulz → rules cha → you	nix → nichts ick → ich jetze → jetzt itzt → jetzt

Phenomenon	What to do on norm level?	English example	German example	Russian example
merged forms	split into normalized items as long as both items are visibly represented in the merged form	wanna → want / to gonna → going / to ⁵	kannste → kannst / du ⁶

⁵

Note: regarding (1) and (2)
what / cha / gonna / do → what / you / going / to / do

⁶

Note: kannst / mir / mal / helfen
remains as is, since there is no overt material for du without assuming a reduction of kannst, i. e. this is a case of missing material, not merge (see above)

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comment
merges / clitics	separate and normalize, if possible	isnt → is / n't withe / ball → with / e / ball he's / guilty → he / 's / guilty	auf=m → auf / m binsch → bin / ich

Note: We normalize a clitic if there is no ambiguity about the standard form. If the standard form cannot be determined unambiguously, the clitic is represented as a clitic, but in a standardized way (see table below for examples).

We always separate the clitics, i. e. divide a single diplomatic token in two norm tokens. See the following list of examples:

Diplomatic	Normalized
n't	not
'll	will
's	's
're	're
isn't or isnt	is / not
aren't or arent (or even arnt)	are / not
cannot or can't or cant	can / not
won't or wont	will / not
didn't or didnt	did / not
don't or dont	do / not
doesn't or doesnt (or even dosnt)	does / not
ain't or aint	ai / not
couldn't or couldnt	could / not
shouldn't or shouldnt	should / not
wouldn't or wouldnt	would / not
she'll	she / will
he's	he / 's
you're	you / 're
Thomas' / computer	Thomas / ' / computer
Anke's / mail or Ankes / mail	Anke / 's / mail

Separate clitics analogous to English clitics, but do not use an apostrophe.

Represent them according to the chosen orthographic standard.

Concatenated morphological elements can remain a unit, since this is part of the language. If you prefer to separate them anyway, make sure you work consistently.

Phenomenon	What to do on norm level?	English example	German example	Russian example
lexicalized merged forms	+ do not normalize + work with list? + refer to orthography reference?	Rock'n'Roll → Rock'n'Roll	im → im zum → zum beim → beim ins → ins

Phenomenon	What to do on norm level?	English example	German example	Russian example
non-standard variation within a paradigm	normalize	$\dots$	helf / mir → hilf / mir

Note: Non-standardized but maybe widely established and accepted forms can be understood as alternative spellings. For alternative spellings our rules guide us to choose a standard and always use that to guarantee consistency. Therefore, normalizing to the true standard form is a good way to go. It is also a safe procedure, since we are dealing with single word to single word mappings, i. e. the original diplomatic form is always retrievable.

Phenomenon	What to do on norm level?	English example	German example	Russian example
speaker codes	do not normalize

Phenomenon	What to do on norm level?	English example	German example	Russian example
foreign material / proper names in a foreign language	normalize according to agreement		leik → like

Note: The transcription of foreign material is always phonographic. The normalization sticks to the habit of the respective language. Whereas the phonographic leik (dipl) might be like (en) on the German norm layer, whereas a Russian that says Autowerkstatt in German in a Russian context, the dipl and norm form will be phonographic representations (аутоверкштат).

Also: Remember to adapt the language annotation for these cases.

Phenomenon	What to do on norm level?	English example	German example	Russian example
Abbreviations	(merge and) normalize according to standard orthography		pekawe → PKW

Phenomenon	What to do on norm level?	English example	German example	Russian example
cardinal numbers	normalize according to regulations in orthography of reference

Note: In the written data, it is enough to correct the spelling of numeral words, there is no need to convert actual numbers to words or vice versa.

Phenomenon	What to do on norm level?	English example	German example	Russian example
floating point numbers	normalize as numbers	one / point / three → 1.3	eins / Komma / drei → 1,3

Phenomenon	What to do on norm level?	English example	German example	Russian example
times and dates	are a composition of elements for which there are normalization rules to be applied individually

Phenomenon	What to do on norm level?	English example	German example	Russian example
fractions	normalize/keep as words	one / third	ein / Drittel

Phenomenon	What to do on norm level?	English example	German example	Russian example
pauses, noise, turn-holders, laughing, ...	not represented, i. e. delete token

Phenomenon	What to do on norm level?	English example	German example	Russian example
filled pause vs. particle, etc.	If you decide to consider certain filled pauses lexical units, decide on a standardized representation and represent that unit always the same way in the normalization layer. If you are currently undecided on the status of such an element and this is actually part of your research, delete that element on the normalization layer.

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comment
interaction/communication with elicitor	not represented	...	...	There is a transcription standard for these events. These events will not be normalized. Delete These events on the norm layer.

Phenomenon	What to do on norm level?	English example	German example	Russian example
incomprehensible, no hypothesis (unknown material)	not represented, i. e. delete token on norm	(UNK)	(UNK)

Phenomenon	What to do on norm level?	English example	German example	Russian example
incomprehensible with hypothesis	hypothesis is represented in a normalized version	(car) → car	(jetze) → jetzt

Phenomenon	What to do on norm level?	English example	German example	Russian example
punctuation	do not add nor normalize, but separate on both dipl and norm	accident.Then / the → accident / . / Then / the	,dass → , / dass

NOTE: This is a phenomenon you may face in written data. Since there is no obligation for authors to stick to pure typing-conventions and use spaces after punctuation, we will treat cases like the above as tokenization errors caused by the pipeline. To remain within a certain frame of complexity we cannot resolve these issues within automatic tokenization. Please perform the separation during normalization on both - dipl and norm. Each element of punctuation should be its own token, unless it is a sequence of punctuation characters without whitespaces in between. In these cases the characters remain a unit (e.g. "..." stays like this on dipl and norm).

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comment
multiple representations allowed in orthography of reference	Make a decision, document it and stay consistent.		Albtraum vs. Alptraum	~	This also implies, that if a participant uses the alternative form you should normalize to the variant you chose as a standard in the corpus.

If there is a mismatch in case, gender or any other grammatical property/feature, do not normalize it. This has to remain to be up to investigation. You might, of course, find non-trivial cases between orthographic normalization and an actual grammatical target hypothesis. Document these case and raise a discussion if necessary.

we prepare lemma, pos and feature annotations for you (automatically)
you will correct these annotations in EXMARaLDA
you will find the data in the same subfolder of exb/ as during normalization

If you ever face the following error message when opening an EXMARaLDA file:

Tier ... is not stratified. Please choose a method for stratifying the tier:

Choose Stratify by deletion.

binds several occurences within the corpus to a common type
facilitates / enables for search with / for lexical items
lemma: standardized form which might occur in different realizations / variants
lemma usually (but not necessarily!) determined by paradigms, i. e.:

inflected form → prototype = a single inflected form within paradigm or uninflected form = lemma

lemmatization varies across annotation schemes and languages
lemmatization is linked to part of speech

we would like to categorize several lemmata to more abstract categories
factors for determining a category can be:
syntactic / distributional
semantic
morphological
graphical
...

part of speech is the baseline for many further annotations
we need correct annotations
we will measure agreement
highly connected to lemmatization, thus underlies the same restrictions and parameters of variation

Always trust the guidelines more than your “grammatical intuition”, but in cases of doubt consult both.

language specific: you might have to find new ways / rules for undescribed phenomena (Please document!)
UD: strictly stick to UD guidelines for your language and please do not decide by what seems more logical to you

English	German	Greek	Russian	Turkish
British National Corpus Part of Speech Tagset	STTS 2.0	Universal POS tags	MyStem Morphology	MULTILIT

Universal Dependencies POS-tags, lemmas, and features

English	German	Greek	Russian	Turkish
correct BNC-POS, lemma, features (?)	correct STTS-POS, lemma, features (?)	correct UD-POS, UD features, lemma	correct MyStem-POS, lemma, features, and UD-POS	correct MULTILIT-POS, lemma, features (?), and UD-POS (?)

English	German	Greek	Russian	Turkish
derivable	derivable	needs manual correction	needs manual correction (?)	derivable (?)

delete tier norm [CU]
feel free to move the tiers up and down
rename pos_1_1 to pos_lang
look out for errors in the data and report them immediately

lemma "F16" is tagged as a proper noun with the respective tag from the specific tagset you are using (e.g. "PROPN" from Universal Postags)
all kinds of greetings should be treated as interjections and tagged with the respective tag from the specific tagset you are using (e.g. "INTJ" from Universal Postags), unless there is a specific tag for greetings in the language specific tagset you are using
regarding the lemmatization of informal greetings: you can just copy the word form from the norm/dipl-layer

Each language has additional transcription and annotation decisions. These are documented in the following sub-sections. We apologize for any parts of documentation that are not available in English until now, these will be translated and included in future releases of the corpus.

Transcription decisions: English

Spelling lists for CU Layer.

1. Compounds with hyphens

rear-end/rear-ended > rearend/rearended (if it is a verb, but if it is a noun, e.g. "rear end of the car", insert space)

2. Hesitation markers

ɑ > ah
ɛ > eh
ɹ [syllabic] > er
oʊ > oh
ə > uh
əm/ʌm > um
em
oh
ha
mm

3. Merged forms

couldn't > couldn=t
didn't > didn=t
doesn't > doesn=t
don't > don=t
I'd > i=d
I'm > i=m
it'd > it=d
shouldn't > shouldn=t
that's > that=s
there's > there=s
they're > they=re
Unclear determiner ["a" versus "the"] > (det) TBD
wasn't > wasn=t
we'd > we=d
what's > what=s
who's > who=s
you're > you=re
guy's ball -> guys, guy's (guy is) running -> guy=s
The following are lexicalized forms that appear in the dictionary: gonna, kinda, wanna.

COMMENT: How do we handle ambiguous cases with determiners e.g. with a vs. with the suggestion: with=e

4. Notations

[coughing]
[gulping]
[laughing]
[pfing] for a sound like "pfff"
[tongueclicking], including tsking as disapproval, while thinking and just mouth opening with a click
[throatclearing]
[whispering]
[sighing]
[sniffing]
[imitating] - when a person imitates a sound of a crash or other sounds related to the story
[yawning]

5. Other

basketball
ɛf > ef (i.e., case number ef sixteen)
soccer ball
renault
volkswagen
aysap (for a.s.a.p. 'as soon as possible', if pronounced 'ay-sap', not 'ay ess ay pee')

6. Foreign language material

hello - preevyet (привет)
Olya (name), not Olia

ef > f
pe:em > pm
be:em:doubleyou > bmw
aysap > asap

ef sixteen > F16 (one token)
nine | one | one > 911 (one token)
-_- emoticons to be left as one token
consecutive punctuation such as "..." are to be left as one token, unless they are already separated on dipl layer.
twelve > twelve

preevyet (hello) > privet

sec > second
prob > "probably" or "problem", depending on the context

I
Renault, Volkswagen, Golf

rear end (if noun), rear-end (if adjective or verb) e.g. "rear-end collision", "rear-ended", "bumped into the rear end"

BNC: Tag List

Might be less compatible with American English spellings
Greater number of tags for accuracy; the tags are highly specific, though not all are necessary for our purposes (i.e., four categories for punctuation). Regardless, researchers searching for broader categories in the corpus should be able to do so by filtering the data appropriately.
Intuitive tag names
Multiple codes for determiners

Hi/Hello/Hey : ITJ (Interjection)
F16: NP0 (proper noun)
I : PNP
am -> be: VBB
like: ITJ (interjection)
okay (ie. 'he is okay'): AJ0
kind (of): AV0
e (det): AT0
same: AJ0
as: CJS
(in) front: PRP
behind: PRP
Police: NN0
911: NP0 (proper noun)
no (AT0) one (PNI)

(in German)

Nach jedem Intervall muss ein Leerzeichen stehen!

Sprecher_in (Benennung nach der Sigle, z.B. DEmo09FD; type: transcription)
Comment (Kommunikation unter Transkribierenden; type: description; Informationen dieser Ebene verschwinden später)
Anonymisierung (für das Festlegen von Events, die verrauscht werden sollen; type: description)

Segmentierung nach Communication Units (CU)
Keine Akzentsetzung
Keine Satzzeichen
Faustregel: Unabhängige Sätze bilden eine CU, abhängige Sätze bilden gemeinsam mit dem jeweiligen unabhängigen Satz eine CU

Sprachliches Material, das sich auf die Aufgabenstellung bezieht, wie Nachfragen durch Proband_innen, und evtl. Unterbrechungen durch Erhebende, wird wie folgt markiert: <Q> communication with elicitor </Q> und bekommt ein eigenes Event.
Strikte Kleinschreibung
Weitestgehend orthografische Transkription gemäß der deutschen Rechtsschreibregeln
ABER: Transkribiere wie gehört in Fällen, die NICHT allgemeiner Standard gesprochener Sprache sind! Nichts weglassen oder ergänzen.
- darunter fällt z.B.: nich vs. nicht, kein vs. keinen, jetz vs. jetzt, n vs. ein, is vs. ist, ne vs. eine, ma vs. mal, was vs. etwas, brauch vs. braucht, rum vs. herum, ham vs. haben, isch vs. ich
- wenn die komplette Endung wegfällt (-en und nicht nur Schwa), z.B. gesprung, angefang, dein
- dialektale, soziolektale und ethnolektale Variation wird verschriftlicht (nicht übertrieben genau)
Standardphänome der gesprochenen Sprache, die NICHT transkribiert werden, sondern gemäß der deutschen Orthografie, sind¹:
- Wegfall von Nebensilbenvokale / Schwatilgung: immer orthografisch schreiben (sehn = sehen, machn = machen); siehe offene Fragen
- Auslautverhärtung (z.B. bald nicht balt)
- Wortfinales er (z.B. koffer nicht koffa)
- Wortfinales g als ch (z.B. König nicht könich)
- ä bleibt ä (z.B. später nicht speta)
- sch in der Verbindung sp/st wird als sp/st notiert
- Vereinfachung von pf zu f wird nicht notiert
- v und f wird den Rechtschreibregeln gemäß verwendet
- Diphthonge werden gemäß den Rechtschreibregeln verwendet
- einzelne Wörter: eigentlich, irgendwie
Metasprache ist Englisch (alles was nicht von Srecher*innen produziert wird, z.B. non-verbal)

Eine Liste mit Einzelentscheidungen zu Schreibweisen findet sich hier

Nichts weglassen, was gesprochen wird. Nichts ergänzen, was nicht gesprochen wird.
keine Apostrophe als Auslassungszeichen

Onomatopoetika/Echoismen:

einzelne Token (z.B. gutschi gutschi gutschi)
es sei denn, sie sind sehr kurz, dann zusammen (z.B. eieiei)

Reihungen, Wiederholunge, Abbrüche

transkribieren wie gehört, nichts weglassen
Wortabbrüche durch / kennzeichnen
wortinterne Abbrüche werden so gekennzeichnet, dass das abgebrochene Element durch Dollarzeichen eingeklammert wird, z.B. dipl: Kinder$gar$ wagen

Verschmelzungen/Reduktionsformen/Auslassungen

verschmolzene Formen (Reduktion + schneller Anschluss) werden durch Gleichheitszeichen gekennzeichnet (z.B. is=er, ham=se)
Verb + Personalpronomen mit = (z.B. hättest=e, is=er)
andere mit = (z.B. so=ne)
Präpostion + Artikel Verschmelzungen werden nicht extra markiert, sondern werden zusammengeschrieben. Wir unterschieden nicht zwischen verschiedenen Formen von Präp + Art. Verschmelzungen. Dazu gehören:
- zum, beim, aufm, mitm, fürs, ins, ans etc.
- Doppelschreibung von Konsonantenen bei Silbengelenk: vonner, inner, mitter etc.
- ABER: auser, aufer, zuner etc.
Eine Liste mit Schreibweisen findet sich hier

Dehnung

Orthographische Schreibung mit Dehnungs-h oder Vokalverdopplung bleibt erhalten, die Doppelpunkte (siehe folgende Punkte) kommen hinzu
Längung bei Buchstabenkombinationen: vor dem Dehnungs-h (z.B. spä::ht), nach dem Dipthong (z.B. polzei::), nach dem Doppelbuchstaben (z.B. see:), nach der r-Vokalisierung (z.B. über:)
Länger als Normal (0.2-2 Sek.) durch Doppelpunkt (z.B. so la:ng)
Auffällig lang (ab 2 Sek.) mit :: (z.B. so la::ng)
Dopplung vokalischer Silben mit % (z.B. tschü%üss); kann mit Dehnungsdoppelpunkten kombiniert werden (z.B. tschü:%üss); das gleiche gilt für Diphthonge (z.B. hei%ei)
langgezogene und lang behauchte Konsonanten werden auch durch Doppelpunkte (wie bei den Vokalen) gekennzeichnet

Generell zusammen schreiben (z.B. lehrerzimmer)
Kompositionen aus Nomen+Adjektiv (saumüde), zwei Adjektiven (supertoll) oder Adjektiv+Präposition (übergeil) und Partikelverben (leidtun, weggehen) werden als ein Token transkribiert
Abkürzungen/Wörter mit Bindestrich werden nicht gesondert markiert (z.B. "mathe emesa prüfung" für Mathe-MSA-Prüfung)²
Koordination von Komposita OHNE Bindestrich (z.B. mittag und abendessen)³

Ausschreiben
Komplexe Zahlen auch zusammenschreiben (z.B. zweikommadrei)
Bruchzahlen getrennt schreiben (z.B. zwei drittel)
Schulnoten getrennt schreiben (z.B. zwei minus)
Datum getrennt schreiben (z.B. elfter dritter)

Ein Wort (z.B. "beemwe" für BMW)
Komposita aus zwei Abkürzungen werden nicht gesondert markiert (z.B. "mercedes eselfka" für Mercedes-S11K)²

Schreibung wird beibehalten (z.B. renault)

auf Sprecher-tier transkribieren
Pausen zwischen zwei CUs erhalten ein eigenes Event
Pausen innerhalb einer CU wird innerhalb dieser transkribiert, kein eigenes Event
0.2-1.0 Sekunde: (-)
1.1-3.0 Sekunden: (--)
über 3 Sekunden: gemessenen Wert in die Klammern eintragen, z.B. (5.5)
Wortinterne Pausen werden ebenfalls so gekennzeichnet, z.B. auf(-)geschlossen (ohne Leerzeichen vor und nach der Klammer), Ausnahme: Wortinterne Pause und gefüllte Pause (ähm) mit Leerzeichen, z.B. auf (-) äh geschlossen

non-verbale Handlungen wie lachen oder husten werden auf speaker-tier in eckigen Klammern notiert, z.B. [laughing]
gleichzeitiges Lachen und Sprechen, wie folgt: [[laughing]ball]
CU-Zuordnung: wie bei Häsitationsmarkern wird non-verbale Handlung zur folgenden CU zugeordnet, z.B. "(-) | [laughing] (-) ähm ja das hat mir meine mama schon immer gesagt"

Geräuschimitationen werden auf speaker-tier transkribiert (z.B. "dann hör ick nur so wuh")
Sprecher_in-ungebundene Unterbrechungen (z.B. starker Autolärm, Computer piepsen) werden nur als Pause notiert; interner Vermerk auf comment-tier⁴

unverständliches innerhalb einer CU, wird innerhalb der CU in Klammern transkribiert, also kein eigenes Intervall angelegt (UNK) bzw. (UNK, 2.2) über 2 Sekunden
Wenn nicht eindeutig ist zu welcher CU das Gesagte gehört, wird ein eigenes Event angelegt
Bei Vermutungen des Inhalts jedes Lexem einzeln in Klammern setzen, z.B. (vermuteter) (Inhalt), Vermutungen beziehen sich auf existierende Lexeme, d.h. wenn man nur "ga" hört, muss das entweder als Abbruch markiert werden innerhalb der Klammern, als "(ga/)" oder als (UNK)

Transkribieren wie gehört in deutscher Orthographie
AUßER bei englischem Material, das wird in englischer Schreibweise belassen
UND: Wie im Duden bei "fremdsprachlichen Material", das dort gelistet ist (z.B. adieu)
Liste mit der Schreibung von fremdsprachlichen Wörtern und deren Vorkommen erstellen, damit sie immer gleich geschrieben werden, diese Liste findet sich hier

Personennamen

Name (Vor- und Nachnamen) des Sprechenden durch Sprechersigle ersetzen (z.B. DEmo02FD)
Genitiv-s wird an diese Sigle angehängt (z.B. DEmo02FDs)
Vornamen von nicht anwesenden Genannten müssen nicht anonymisiert werden
Nachname oder Vor- und Nachname von nicht Anwesenden müssen anonymisiert werden durch Sprechersigle_P (z.B. DEmo02FD_P)

Ortsangaben

nur anonymisieren bei Rückschlussmöglichkeit auf den Wohnort, wenn z.B. Straßennamen genannt werden, die der Adresse des Sprechenden entsprechen könnten
Schule: {Kategorie}, z.B. Alexander-Puschkin-Schule als {schoolname}schule ⁵
Straße: {streetname}straße
Hat der Ort eine Flexion, z.B. Plural, wird sie drangehängt: {streetname}straßen
Unverständliche Personennamen ohne Angabe der Kategorie

Verneinung: 'hm'hm, nee, 'ä'ä
Bejahung/Rezeptionssignal: 'hmhm
Rezeptionssignal: hm, mhmh

Filler: äh, ähm, öh, öhm, (auch wenn sie mit einem glottal stop produziert werden), hm (auch wenn nur m produziert wird)

immer: ne, gell
manchmal: oder, ja, okay

Eine Liste aller Interjektionen, die vorkommen können befindet sich hier und wird während des Transkriptionsprozesses stetig erweitert.

Symbole	Bedeutung
`<Q>...</Q>`	Nachfragen zur Aufgabenstellung
(-)	Pause 0.2-1.0 Sek.
(--)	Pause 1.1-3.0 Sek.
(3.2)	Pausen länger als 3 Sek.
(UNK)	unverständliches Material
(UNK, 2.2)	unverständliches Material länger als 2 Sek.
(assumption)	vermutetes Material
[...]	non-verbale Handlung
[[...]...]	non-verbale Handlung & verbales Material
:	auffällig lange Vokale/Konsonanten (unter 0.2-2 Sek.)
::	außergewöhnlich lange Vokale/Konsonanten (über 2 Sek.)
=	reduzierte Formen mit schnellem Anschluss
/	Wortabbruch
$...$	wortinterner Abbruch
%	Silbendopplung
{...}	Spezifizierung eines anonymisierten Ortes
'	Glottisverschlusslaut

Gründe: Konsistenz, agreement, Lesbarkeit

⁶

Bindestrich auf Normebene

⁷

Störende Geräusche, die den Redefluss unterbrechen: Brauchen wir diese Information später und wo soll sie hin?

⁸

Verfahren wir so auch mit Unis? Und soll über diese Orte auch eine Liste angefertigt werden, welche Abkürzungen was sind?

⁹

Erst auf Normebene

ach
ah
ah ja
ebend
ey
ha (artikuliertes +sth ausatmen)
halt
ja
joa
jo
na ja

ach
ah
aha
ey, e:y
grr
hossa
ih, i:h
lol
na
oh
oah
ohoh
'oh'oh
tja
uh
ui
ja doch
ach so
ach ja
oh nein

baby (duden, wenn "bebi/be:bi)" produziert wird)
basecap (wie im duden)
beemwe
bissn (wenn nicht bisschen sonder bissen oder bissn produziert wird, z.B. DEbi02FG_fsD: 22.05)
cap (wie im duden) ABER käppi
cops (wie im duden)
crash (wie im duden)
einskommafünf liter flasche (1,5l-Flasche)
fauwe (VW)
genuch (wenn es so produziert wird)
ha u (für HU, also die Abkürzung der Humboldt Universität)
hey (duden, wenn "häi" produziert wird)
hi (duden, wenn "hai" produziert wird)
ick (ich auf berlinerisch)
iks üpsilon straße (für xy straße)
käppi
limousine (spricht natürlich limusine)
nix (wie duden)
pekawe (oder besser pekawe, weil ein Doppelpunkt eine Längung über 0.2 Sek. anzeigt, die meist nicht vorhanden ist bei pekawe, wenn doch, dann natürlich mit :)
pekawe fahrer (PKW-Fahrer)
revue passieren (wie im duden)
so was
tschau (empfohlene Schreibweise duden)
van (Van wie im duden) (auch minivan)

englische Ausdrücke werden in englischer Orthografie transkribiert:

crashen (DEmo12MD_isD / 00:30, Ebi08MG_isD / 00:44) im online Duden gelistet (eingesehen 15.04.2019, 9:57), Wert auf lang: deu
bystander (DEmo32FD_isD / 01:24) nicht im online Duden gelistet (eingesehen 15.04.2019, 10:00), Wert auf lang: eng
dude (DEmo53FD_isD / 00:24) nicht im online Duden nicht gelistet (eingesehen 15.04.2019, 9:59), Wert auf lang: eng
plep [pleb = dt. Prolo] (DEmo74MD_isD / 00:28) nicht im online Duden gelistet (eingesehen 24.04.2019, 12:32), Wert auf lang: eng

alle anderen, wie gehört:

dawei (rus)
vallah (ara)

kränk [von krank mit englischer aussprache] (DEmo71FD_isD / 01:35)

anner
aufer
aufm
aufn
auser
ausm
beim
hinterm
hinters
im
inner
ins
mitm
mitn
übern
vonnem
vonner
vors
zum
zur

son, sone (für solch, solche), also bei folgendem Substantiv im Plural

auch=n (auch ein)
bist=e
dass=es
d=is (das ist)
d=is=n (das ist ein)
gab=s
geht=s (geht es)
gib=s
grad=n
hab=s
ha=ich (hab ich)
hat=ter (hat der / hat er)
hat=n
hat=s
hättest=e
ich=n
ich=s
is=er (ist er)
is=es (ist es)
is=n
kennst=e (kennst du)
muss=er
noch=n
nu=ma (nur mal)
ob=s
sa=ma (stark zusammengezogenes sag mal)
so=m (so einem)
so=n (so ein)
so=ne (so eine)
und=n
un=an (und dann, komplett ohne Plosiv produziert)
war=n
war=s
weil=et (weil es)
wie=s
wird=s (wird es)
wollt=er (wollte er; so geschrieben, weil nicht entscheidbar ist, ob es wollt er oder wollte er ist)
zwar=n

(in German)

Orthografische Normalisierung
Syntax bleibt
Grammatik wird nicht verändert, d.h. z.B. keine Normalisierung von Genus- und Deklination: "schön guten tag", "mit ein hund", "so ein klein hund"
Abbrüche und Pausen werden gelöscht und bekommen ein leeres Event (auf language-Ebene wird der Wert für Abbrüche und Pausen gelöscht --> leeres Event); AUSNAHME bei wortinternen Abbrüchen (dipl: vorbeige$le$ rollt, norm: vobeigerollt)
Wiederholungen bleiben
spoken: non-verbales Material, wie [laughing], wird nicht übernommen --> leeres Event
Satzzeichen werden nicht hinzugefügt

Hesitation marker werden einheitlich zu „äh“ normalisiert, darunter fallen äh, ähm, öh, hm etc.

dipl	norm
äh, öh, ähm, hm etc.	äh

bei eindeutig gleichbleibender Bedeutung, einen Standard festlegen und dokumentieren, wie:

dipl	norm
aufgrund, auf Grund	aufgrund
andren,anderen	anderen
Dicker, Digger	Dicker (29.05.2019)
etwas, was	etwas
grad, grade, gerade	gerade
gern, gerne	gerne
habe, hab	habe
hey, hi, hei(als Begrüßung, nicht als Ausruf)	hi
langlaufen	entlanglaufen
mache , mach (Imperativ)	mach
nichts, nix	nichts
noch mal, nochmal	nochmal (28.05.2019)
rumspielen	herumspielen
rum	herum
runterfallen	herunterfallen
sodass, so dass (wenn Konjunktion)	sodass
vorn, vorne	vorne

bei möglicher Bedeutungsänderung bzw. Kontextrestriktionen, Lexeme so belassen, Varianten bleiben:
- daran, dran
- darin, drin, drinnen
- drauf, darauf
- sone (wie "sone autos", also nur bei Substantiv im Plural), solche
- reinfahren, hereinfahren, hineinfahren
- auffahren, rauffahren, drauffahren
- reinpacken, einpacken, hineinpacken

FM mit deutscher Flexion, z.B.:

dipl	norm	lang
gecrasht	gecrasht	eng/deu

im Duden aufgenommenes Material, wie sorry, Van etc. werden auf language-Ebene als deu markiert

bis zwölf werden ausgeschrieben
ab 13 werden als Ziffern geschrieben
in den schriftlichen Texten, Variante beibehalten, die der/die Proband*in gewählt hat

dipl	norm
pekawe	PKW
ef sechzehn	F16

'nen' zu 'ein' normalisieren in Fällen von Nominativ Maskulin und Nom./Akk. Neutrum, z. B.:
- dipl: "da is nen auto um die ecke gebogen". norm: "da ist ein auto..."
- dipl: "ich hab nen auto gesehen". norm: "ich habe ein auto..."
ABER nicht in: dipl: "nen ne vollbremsung" zu norm: "einen eine vollbremsung"
so=n entweder zu "so ein" oder zu "so einen" je nach Kontext:
- dipl: "so=n typ hat mitm ball...". norm: "so ein Typ hat mitm Ball"
- dipl: "die frau hat so=n hund dabei". norm: "die Frau hat so einen Hund dabei"
ABER:
- dipl: "sie hatte so ein hund dabei" bleibt auf norm: "sie hatte so ein Hund dabei"
gender gap
- dipl: Fußgänger innen; norm: Fußgänger_innen

dipl	norm
deu	deutsch
eng	englisch
ara	arabisch
tur	türkisch
spa	spanisch

Schriftliche Texte

CU-Ebene hinzufügen

Satzzeichen:
- Keine einfügen, nicht korrigieren, außer bei fehlendem Leerzeichen:
dipl norm

eingepackt.auf eingepackt / . / auf
- mehrere Satzzeichen hintereinander ohne Leerzeichen, z.B. drei Punkte: … in einem Event lassen
dipl norm

... / ... /
- wenn Leerzeichen dazwischen, dann auch so lassen, z.B. . /. / .
dipl norm

. . . /. / . / . /
Emojis
- Emojis wie :) und Acronyme wie lol werden auf norm übernommen und in einem Segement belassen

(partly in German)

Here you find the Guidelines: Westpfahl_Schmidt_Jonietz_Borlinghaus_STTS_2_0_2017.pdf

F16 als NE
1-Wort Begrüßungen/Verabschiedungen wie hi, hallo, tschüss sind Interjektionen (NGIRR)
Speaker-Codes, anonymisierte Straßennahmen, etc. werde zu Eigenname (NE)
von den Proband_innen selber anonymisierte Namen, wie "Frau XX" oder "XY-Straße", werden zu Nichtwort (XY)
drei Punkte sind eine satzbeendende Interpunktion ($.)
wenn durch Abbrüche oder sonstiges nicht entscheidbar ist, welches POStag einem token zugewiesen werden sollte, bleibt der Wert leer
konventionalisierte Abkürzungen, wie "d.h." werden als ADV getaggt (siehe guidelines S.13)
"also" je nach Kontext SEDM, ADV oder KOUS:
1. "also" als SEDM im Vor-Vorfeld, z.B. "also/SEDM ich heiße..."
2. "also" als ADV, meist Spezifizierungen ohne Verb, z.B. "...eine Familie, also/ADV Frau, Mann, Kind"
3. "also" als KOUS, z.B. "also/KOUS ging ich die Straße entlang"
ein paar Beispiele

token	POS tag
/bis /später	/APPR /ADJD
/gegenüber /von	/ADV /APPR
/gegenüber /dem /Auto	/APPR /ART /NN

Beispiele

DEbi56FR_fsD: und der hat natürlich/ADJD auf den Ball reagiert

Nominalisierungen werden beibehalten (Verletzte, Folgendes, Fahrer...) und dann nach der untenstehenden Tabelle für verschiedene Formen, die kleinste gemeinsame Form nehmen (Verletzte, Folgende, Fahrer), sodass pos und lemma übereinstimmen (z.B. [norm]: "das Spielen", [pos_lang]: NN, [lemma]: Spielen)
Speaker codes bleiben so
Merged forms Artikel+Präposition werden zum lemma der Präposition: aufm wird zu auf, mitm zu mit, zum zu zu, im zu in
Datum zu @card@
Reflexive werden zu Personalpronomen (sich zu er|sie|es), Possesive nicht, siehe auch Tabelle Formen Lemma
Ordinalzahlen bleiben (damit es mit [lemma] übereinstimmt) und werden nicht zu Kardinalzahlen
Bei unterschiedlichen Formen bei verschiedenem Genus nehmen wir immer die kleinste gemeinsame Form (siehe Tabelle), Ausnahme: NN, die Personen bezeichnen, bleiben in der Form, die auf [norm] ist, z.B. "Augenzeugin" bleibt und wird nicht zu "Augenzeug"
der, die, das wird immer d, unabhängig davon, ob es als Artikel oder Relativpronomen gebraucht wird.
Verschmelzungen Präpositionen+Artikel (vom, beim, aufm) werden immer zur Präposition lemmatisiert (vom --> von, beim --> bei, aufm --> auf)
Pluralformen werden zu Singularformen, d.h. Einkäufe wird auf [lemma] zu Einkauf

Verschiedene Formen	lemma
all, alle, alles, aller	all
andere, anderer, anderes	ander
eine, einer, ein	ein
der, die, das	d
diese, dieser, dieses (atrribuierende Demonstrativpronomen)	diese
dieser, dies, dieses (substituierendes Demostrativpronomen)	dies
Folgendes, Folgende, Folgender	Folgende
jener, jenes, jene	jene
mein, meiner, meine, meins	mein
welche, welcher, welches	welch
vordere, vorderer, vorderes (ADJA)	vordere
zweit, zweite, zweiter, zweites	zweit

no capital letters
abbreviations/acronyms are transcribed as full words (e.g. ДТП = дэтэпэ)
lower case for all words, even at the beginning of a sentence ==> exceptions: participant code, participant code + _P and symbols like (UNK) etc.
the transcription of the participant speech shall generally take place in accordance to the standard orthographic rules of Russian
but: if the participant articulates utterances or words, which are not typical neither for the standard Russian nor for the oral vernacular language (повседневный язык) of Russian, transcribe it as it was articulated by the participant

Example from USbi52MR_fsR:

Participant: потому что они два два (-) не видели (-) ==> Standard and vernacular Russian: потому что они друг-друга не видели ==> два два is not typical for the standard or oral vernacular Russian ==> Transcription: потому что они два два (-) не видели (-)

Example:

Participant: мужик играл с футболом ==> standard and vernacular Russian: мужик играл с мячом ==> с футболом in this context is not typical neither for the standard nor for the vernacular Russian ==> Transcription: мужик играл с футболом
typical phenomena for standard and vernacular Russian, which shouldn't be transcribed:
- reduced vokals (if it is not a special dialect of Russian)
  
  Example:
  
  Participant: ана талкает каляску, а мужык играет смячикам ==> Transcription: она толкает коляску, а мужик играет с мячиком
- so called phonetic words (= words, which are articulated as one word ==> it often concerns prepositions and the following noun)
  
  Example:
  
  Participant: он вышел издому ==> Transcription: он вышел из дому

two tiers should be used
- 1. tier = speaker tier ==> is only used for the transcription of the participant speech; it gets marked with the participant code
- 1. tier = comment tier ==> the comment tier is an optional tier and used for communication between transcribers; later (that means: after the transcription) the comment tier will be deleted

NB:
- 1 independent/main clause (главное предложение) = 1 simple sentence (простое предложение)
- 1 independent clause (главное предложение) + 1 or more dependent clauses (придаточное предложение) = complex sentence (сложноподчинённое предложение)
- 1 independent clause (главное предложение) + 1 or more independent clauses (главное предложение) = compound sentence (сложносочинённое предложение)
hint: an independent clause can always stand alone; a dependent clause should never stand alone, because without its independent clause the dependent clause wouldn't make sense
in addition: a compound sentence can be easily recognized by certain conjunctions, which connect the independent clauses in that type of sentence: these conjunctions are coordinating (соединительный), adversative (противительный) or disjunctive (разделительный) conjunctions (союзы), such as и, но, а, или, либо...либо etc.
dependent clauses in complex sentences can be recognized by conjunctions like потому что, когда, что, кто, который, чтобы, так как, но и etc.
the participant speech gets segmented in communication units (CUs)
1 CU correlates with 1 simple sentence or with 1 complex sentence; sentences, which consist of more than 1 independent clause (= compound sentence), are in every case more than 1 CU:
- Simple sentence
  
  Example from DEbi52FR_isR:
  
  я стала свидетельницей (-) а: (-) столкновение двух машин | ==> 1 CU
- Complex sentence
  
  Example from DEbi52FR_isR:
  
  виноваты были не машины а: (-) один (-) эм мужчина который (-) ну кот/ ещё более такой (-) молодой | ==> 1 CU
- Compound sentence
  
  Example from DEbi03FR_isR:
  
  хотела с тобой это поделить | но я была здесь на парковке у реве | ==> 2 CU's
if a compound sentence includes a subject or a verb ellipsis, then every ellipsis will be defined as an extra CU

Example:

она вышла из магазина | [subject ellipsis] уронила пакет | и [subject ellipsis] пошла дальше | ==> 3 CU's
discourse marker (ну, ну там, вот, так, как бы, получается, эм, то есть etc.) and the following utterance will be seen as 1 CU

Example from DEbi03FR_isR:

ну там я предполагаю там ребёнок внутри был | ==> 1 CU
discourse markers (ну, ну там, вот, так, как бы, получается, эм, то есть etc.), which specify the precursory utterance, will be seen as 1 CU, too

Example from DEbi02FR_fsR:

хм их было трое то есть э маленький ребёнок э: женщина и мужчина | ==> 1 CU ("маленький ребёнок э: женщина и мужчина" is a specification of "трое")
greetings (привет, здравствуйте, здорово, здрасте etc.) will be defined as an extra CU

Example from DEbi03FR_isR:

привет DEbi03FR_P | слушай я сейчас видела здесь такую ситуацию | ==> 2 CU's
question tags such as правда?, или?, правильно?, правильно понял?, не так ли? etc. belong to the previous CU

Example:

ты вася пупкин, правильно | ==> 1 CU
in case you are not sure, make two or more CU's, because merging them afterwards is possible in contrast to CU splitting ==> CU splitting afterwards is not possible at all
keep in mind, that punctuation marks are not used at all ==> that means: no full stops, no commas etc.

replace the name of the participant with the respective code ==> e.g.: DEbi52FR
if whole names or surnames of the participant’s friends are mentioned, replace them with the participant code + _P

Example from DEbi52FR_isR:

привет DEbi52FR_P
places, that could lead to the identification of the participant, should be replaced as following

Example:

я хожу в Leo-Tolstoi-Schule ==> я хожу в {schoolname}шуле

я живу на улице Шютценштрассе ==> я живу на улице {streetname}штрассе
anonymization in Audacity: the name of the participant should be anonymized with the aid of white noise

hesitation markers do not represent an own event ==> they belong to the concerned CU

Example from DEbi52FR_isR:

я когда шла э: на автобусную остановку (-) эм: ==> 1 CU
general notation:
- m-hm (confirming) = угу
- ehm = эм or э:м
- hm = хм
- eh/uh = э or э:
- ɑha = ага
- ah = а:

vocals pronounced longer than normal (under 2 seconds) are marked with a colon ==> e.g.: ну: да
vocals that are pronounced extremely long (2 seconds and longer) are marked with two colons ==> e.g.: ну:: да
long pronunciation is also possible for consonants ==> e.g.: тс: тише
doubling of vocal syllables are marked with % ==> e.g.: ты точно сделал? да%а

are transcribed on speaker tier
a pause between two CU´s is marked as an own event ==> the pause gets two boundaries
pauses in a CU get transcribed within the concerned CU ==> they do not represent an own event
word internal pauses are marked in the words and without a space between the parts of the concerned word ==> e.g.: с э(-)тим мячиком ==> exception: pauses with эм in a word ==> e.g.: они на (-) эм крыли стол
general notation:
- 0.2-1 second ==> (-)
- 1-3 seconds ==> (--)
- longer than 3 seconds ==> time should be measured and noted in brackets ==> e.g.: (3.1), (5.5)
background noise like traffic noise, phone ringing or computer noise are noted as pauses

merged forms are transcribed as they are articulated, but with an equal sign linking the merged elements

Example from USbi52MR_fsR:

с одной стороны (-) дороги (-), э, шли муж=женой

general rule: reduced syllables should be transcribed in its full length, even if it was differently articulated

Example:

participant: она токо что шла на улице ==> transcription: она только что шла на улице
exception: if a word can be found with its reduced syllables in a dictionary (e.g. MAC ==> Link ) and the participant articulated the word in its reduced form, then the reduced form of the word should be preferred for transcription

Example:

participant: здрасте ==> transcription: здрасте Link zum MAC
use / to mark unfinished words

Example from DEbi52FR_isR:

сегодня (-) э когда я шла на авто/ астобв/ (-) а (-) автобусную остановку

numbers should be transcribed by words, since Russian numbers are often inclined or morphed

Example:

я вижу двух* женщин*
dates should be transcribed by words, too

Example:

я родился двадцать первого* января тысяча девятьсот девяносто пятого года*

general rule: foreign words should be transcribed into Russian as they are articulated
in addition: for this case exists a special list, where you can search for words like that or add new words Link ==> important: all transcribers have to transcribe these words into Russian eaqually
- Autowerkstatt = аутоверкштат
- Truck = трак
- Ort = орт
- REWE = реве
- Renault = рено
- also = алзо
- OK_ = окe
- WhatsApp = воцап
- {schoolname}schule = {schoolname}шуле
- {streetname}straße = {streetname}штрассе

table for russified lexicals

German/English word	Russified word	Code	File	Second
accident	аксидент	USbi06FR	fsR	6,94
accidentally	аксидальтально	USbi07MR	fsR	15,1
Aldi	алди	DEbi64MR	fsR	8
also	алзо	DEbi56FR	fsR	36,85
Autowerkstatt	аутоверкштат	DEbi51FR	fsR	93,75
bag	бег	USbi59FR	isR	164,23
Ball (mit dem)	болом	DEbi12FR	fsR	16,9
Ball	бол	DEbi12FR	fsR	19
in the back	ин зе бэк	USbi74MR	isR	53,92
ciao	чао	DEbi04MR	isR	29,4
case	кэйз	USbi86FR	fsR	4,62
crash	крэш	DEbi12FR	fsR	47,78
crashed	крэшовали	DEbi15MR	isR	37,56
hey	хей	USbi73FR	isR	0,64
like	лайк	USbi86Fr	fsR	73,13
message	месседж	USbi16FR	isR	8,23
911/nine one one	найн уон уон	USbi59FR	isR	83,64
911/nine eleven	найн элэвэн	USbi73FR	fsR	59,29
OK	окей	USbi05FR	isR	64,7
Ort	орт	DEbi53FR	fsR	23,19
parking lot	паркинг лот	USbi74MR	isR	7,48
Renault	рено	DEbi10MR	isR	31,71
representative	рэпрезэнтэтиф	USbi74MR	fsR	3,85
Rewe	реве	DEbi03FR	isR	13,9
Schützenstraße	Шютценштрассе	DEbi04MR	fsR	5,14
spilled/spilt	сплыть	USbi58FR	fsR	36,41
stopped	стопт	USbi58FR	fsR	25,22
stroller	строллер	USbi79MR	isR	51,8
Truck	трак	USbi52MR	fsR	77,39
turn	торн	USbi74MR	isR	36,32
Vans	вэнс	USbi59FR	isR	12,59
WhatsApp	воцап	USbi52MR	isR

if already exists a conventionalized spelling in Russian for a foreign word, the conventionalized spelling should be preferred

Example:

Messenger = мессенджер

non-verbal events like laughing or coughing are noted in square brackets on speaker tier and always belong to the concerned CU
general notation
- [throatclearing]
- [coughing]
- [laughing]
- [pfing] ==> for a sound like „pff“
- [sighing]
- [sniffing]
- [tongueclicking] ==> including tsking as disapproval, while thinking and just mouth opening with a click
- [yawning]
- [gulping]
- [whispering]
- [breathing]
if the participant speaks and makes a non-verbal event at the same time, it is noted as:
- [[coughing]word]
- [[laughing]word]
- [[sighing]word]
- [[tisking]word]
- [[yawning]word]
- [[gulping]word]
- [[whispering]word]
Example from DEbi52FR_isR:

ты [[laughing]знаешь] что сегодня случилось
uninterpretable material is to be marked as (UNK) on speaker-tier
if it is not clear, to which CU the UNK belongs, make an own event ==> that means: write the UNK between two boundaries
if the UNK is longer than two seconds, measure the time and write the time together with UNK in one bracket ==> e.g.: (UNK, 2.1)
assumed content is noted in brackets, each token has to be separated ==> e.g.: (assumed) (content)
background noise such as traffic noise, phone ringing or Computer noise should be noted as pauses

Symbol	Meaning
`<Q> speech </Q>` ==> e.g.: `<Q> можно я ещё раз </Q>`	for questions to the procedure on the part of the participant or for verbal interventions on the part of the elicitor
(-)	for pauses 0.2-1 second
(--)	for pauses 1-3 seconds
(time) ==> e.g.: (3.1)	for pauses longer than 3 seconds
(UNK)	for uninterpretable material
(UNK, time) ==> e.g.: (UNK, 2.1)	for uninterpretable material longer than 2 seconds
(assumed word)	for assumed material
[non-verbal action] ==> e.g.: [laughing]	for non-verbal material
[[non-verbal action]word] ==> e.g.: [[laughing]знаешь]	for a non-verbal & verbal event
:	for unusually long vocal or consonant (under 2 seconds)
::	for unusually long vocal or consonant (longer than 2 seconds)
=	for merged words
/	for interruption of a word
%	for doubled syllables
{...} ==> e.g.: {schoolname}шуле	for anonymised places

the conversion of the participant speech ¹ into a **correct orthographic form **², which correlates with the applicable linguistic norm/with the linguistic standards of the respective language (in this case of the Russian language), is called normalization
during the normalization of the participant speech morphosyntactic (grammatical) mistakes don´t get respected /don´t get changed into its grammatically correct forms

Example:

Participant: | Я калаской видел | ==> Normalization: | Я коляской видел | ³
the normalization is manually carried out in the application program EXMARaLDA

EXMARaLDA on normalization level is structured in the follwoing layers:
- dipl-[dipl]-layer
- norm-[norm]-layer
- dipl-[language]-layer
- dipl-[CU]-layer ⁴

Layer	Function
dipl-[dipl]	shows the original speech of the participant, which never gets corrected - neither grammatically nor orthographically
norm-[norm]	here the speech of the participant should be manually normalized (orthographical - not grammatical - corrected)
dipl-[language]	shows, to which language the concerned word belongs ⁵
dipl-[CU]	here the speech of the participant is segmented into communication units (which is one of the results from the transcription); in case of written files the person, who wants to normalize the file, has to add at first the dipl-[CU]-layer (because it is absent) ⁶ and then he has to segmente the participant speech in CUs by himself.

... files from DEbi---R; USbi---R and RUmo---R with the following symbols at the end:
- _fsR (formal spoken Russian)
- _fwR (formal written Russian)
- _isR (informal spoken Russian)
- _iwR (informal written Russian)

1. step: Push/Pull/Fetch in GitHub
1. step: Open EXMARaLDA Partitur-Editor
1. step: File ==> Open ==> rueg repository ==> GitHub ==> rueg-corpus ==> exb ==> P3 ==> 1, 2, 3 …
1. step: verify (in case of the spoken files), if the CUs on dipl-[CU]-layer correlate with the CU-guidelines - if not, please correct it; in case of the written files you have to add an extra CU-layer and then to segment the speech of the participant into CUs according to the CU-guidelines
1. step: verify, if every single word correlates with its right language on the dipl[language]-layer - if not, please correct it
1. step: normalization according to the orthographical rules of the Малый академический словарь MAC ==> translate the speech of the participant from the dipl-[dipl]-layer in its orthographically correct form on norm-[norm]-layer and delete all phenomena, which are not necessary for the following annotation levels (lemmatization, POS-taggin etc.) ==> which phenomena this in practice concerns, can be taken from the table in 4. Normalization guidelines and problems
1. step: Save your results
1. step: go to GitHub ==> submit your file ==> push/pull/fetch

Phenomenon/Problem	What to do on norm-[norm]-layer	Example dipl-[dipl]-layer	result on norm-[norm]-layer
pauses	delete them ⁷	привет (-) ты не (--) ты не поверишь что случилось	привет ты не ты не поверишь что случилось
hesitation markers	delete them ⁸	там короче эм шла женщина с коляской и э	там короче шла женщина с коляской и
phonetic markers	delete them ⁹	мальчик играл с мячиком и:: и потом мячик покатился на дорогу	мальчик играл с мячиком и и потом мячик покатился на дорогу
emojis	they don´t get removed and receive an own event	/ :-P привет / я / сегодня / увидел / аварию :-( /	/ :-P / привет/ я / сегодня / увидел / аварию / :-( /
punctuation marks (in general)	they don´t get removed and receive an own event ¹⁰	/ мячик / покатился / на / дорогу . / потом / собака / начала / лаять . /	мячик / покатился / на / дорогу /. / потом / собака / начала / лаять / . /
punctuation marks (many in immediate succession without spaces between them)	keep them in one event	/это / просто / был / кошмар / !!!!!	/это / просто / был / кошмар / !!!!! /
punctuation marks (many in immediate succession with spaces between them)	every punctuation mark receives an own event	/это / просто / был / кошмар / ! ! ! ! !	/это / просто / был / кошмар / ! / ! / ! / ! / ! /
merges / clitics ¹¹	separate and normalize, if possible	муж=женой	муж с женой
repetitions	they don´t get removed	/ потом / мяч / покатился /на /на/ дорогу /	/ потом / мяч / покатился /на / на / дорогу /
self-correction in whole words	they don´t get removed and each of them receives an own event	/ парень / играл / с/ в /мячик /	/ парень / играл / с / в /мячик /

Each token on dipl-[dipl]-layer has to be marked with an abbreviation, to which language it belongs (language value). In general, this gets conducted automatically, but in some cases you have to change the language value - for example in case of foreign words in the participant´s speech on dipl-[dipl]-layer. If this situation/problem occures, please change the language value (abbreviation). In case, that you don´t know the abbreviation of the respective language, use the ISO 639-2 language code. A list of valid codes can be found here.

Language	language value on dipl-[language]-layer
Russian	rus
German	ger
English	eng

6. Comments

¹ The original speech of the participant, which never gets changed (neither in process of normalization nor in process of lemmatization or POS-tagging), is located on the dipl-[dipl]-layer in EXMARaLDA.

² The orthographically corrected speech of the participant is located on the norm-[norm]-layer in EXMARaLDA. In the most cases, on this layer the events (gaps) are already filled in, but the orthographic accuracy of the words should be checked manually each time.

³ This CU gets translated into its orthographically correct form (калаской ==> коляской). In contrast to this, the grammatical (morphological and syntactic) incorrectness of the CU (the correct government of the verb видеть is видеть кого/что? (Acc.) ==> Я видел коляску** and the more correct word order in this case would be Я вижу калаской) doesn´t get respected or changed at all.

⁴ In case of written files (_fwR, _iwR) a dipl-[CU]-layer has to be added (in case of spoken files the dipl-[CU]-layer already exists) and the speech of the participant from the dipl-[dipl]-layer has to be segmented in CUs on the added dipl-[CU]-layer.

⁵ Keep in mind, that in case of foreign words, these words get transformed into the russian alphabet, but they still stay foreign words (even if they are written in Russian). Therefore, foreign words have to be marked on dipl-[language]-layer with the concerned abbreviation (usually ger or eng) from which language they originally came.

⁶ You can add a new layer in EXMARaLDA through the menu bar (Tier ==> Add Tier). then, choose the following settings:

Speaker: dipl [dipl]
Type: A(nnotation)
Category: CU

By depending on the diplomatic tokenization, the tier has one timeslot per event on dipl-[dipl]-layer. To annotate a CU, merge all timeslots on the added dipl-[CU]-layer that overlap the tokens on dipl-[dipl]-layer belonging to the CU (including sentence-final punctuation).Finally, assign a simple annotation value "CU" to the newly created span and everything is ready.

The CU tier does not explicitly contain any text. By overlapping its respective tokens, the entire text of a CU is already defined and can be searched.

⁷ If you delete pauses on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.

⁸ If you delete hesitation markers on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.

⁹ If you delete phonetic markers on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.

¹⁰ General rule: We do not add and correct any punctuation marks. Only in case, when the punctuation mark sticks to a word, then we have to correct it on norm-[norm]-layer.

¹¹ Clitics are synsemantic/function words (= words without an own semantic meaning [==> therefore, they have a grammatic meaning]), which phonologically "fuse" into their immediate "neighbour" words (which are in most of the cases nouns). That means, that clitics (which are in most of the cases prepositions) and their immediate "neighbour words" (nouns) get articulated as one (phonological) word. Depending on their position (behind or in front of their nouns to which they belong), clitics can be classified in proclitics (if the clitic stands in front of its noun to which it belongs) and in enclitics (if the clitic stands behind its noun to which it belongs).

Example:

работать из (=clitic) дому (=noun) ==> articulation: издому, врач на (=clitic) дом (=noun) ==> articulation: надом

Lemmatization

the term lemma may be defined as the base form of a word
the base form of a word is the form you can usually find in a dictionary
for verbs the base form correlates with the infinitive, for nouns with the nominative and for adjectives with the nominative in its masculine form
the conversion of a word into its base form is called lemmatization
the lemmatization is carried out semi-automatically in the application program EXMARaLDA using two POS- and lemma-taggers U-POS and MyStem, however the accuracy of the taggers should be checked manually each time
the lemmas or base forms of the words can be found in MyStem on the norm[mystem_lex] and in U-POS on the norm[lemma] layer

POS-Tagging

the term tagging means that each word of the participant is attributed with its part of speech (POS)
the tagging is carried out in the application program EXMARaLDA by semi-automatic U-POS and MyStem taggers, but the accuracy of the taggers should be checked manually each time
there are two taggers in EXMARaLDA which assume the task of POS-tagging - U-POS and MyStem
keep in mind that these two tagging-softwares are similar to each other, but not absolutely identical ¹

U-POS-Layers

to the U-POS-software belong the layers from norm[Animacy] to norm[voice] as well as the norm[lemma] and the norm[pos] layer
each layer in U-POS (and MyStem) correlates with a grammatical category
the meaning of each grammatical category in U-POS gets explained in the following table:

Layer	Grammatical categorie	Grammeme	Part of speech
norm[Animacy]	Одушевлённость	Одушевлённость (Anim); Неодушевлённость (Inan)	concerns only nouns
norm[Aspect]	Вид	Cовершенный вид [что сделать?] (Perf); Несовершенный вид [что делать?] (Imp)	concerns only verbs
norm[Case]	Падеж	им.п. (Nom); род.п. (Gen); дат.п. (Dat); вин.п. (Acc); твор.п. (Ins); предл.п. (Loc); зват.п. (Voc)	concerns all nominal categories of POS
norm[Degree]	Степень сравнения	положительная (Pos); сравнительная (Cmp); превосходная (Sup)	concerns adjectives and adverbs
norm[Foreign]	Определение заимствования	(Yes)	concerns all parts of speech
norm[Gender]	Род	муж.р. (Masc); жен.р. (Fem); сред.р. (Neut)	concerns only nouns, adjectives and pronouns
norm[Mood]	Наклонение	изъяв.н. (Ind); услов.н. (Cnd); повел.н. (Imp)	concerns only verbs
norm[Number]	Число	Единственное (Sing); Множественное (Plur)	concerns nouns, adjectives, personal pronouns and verbs
norm[Person]	Лицо	Первое лицо (1); Второе лицо (2); Третье лицо (3)	concerns personal pronouns and verbs
norm[Tense]	Время	Настоящее (Pres); Прошедшее (Past); Будущее (Fut)	concerns verbs and participles
norm[VerbForm]	Форма глагола	Начальная форма (Inf); Спрягаемая форма (Fin)	concerns verbs
norm[voice]	Залог	Действительный (Act); middle voice (Mid); Страдательный (Pas)	concerns verbs and participles
norm [lemma]	Начальная форма слова	------	concerns all parts of speech
norm[pos]	Определение части речи данного слова	существительное (NOUN); глагол (VERB); прилагательное (ADJ) ...	concerns all parts of speech

MyStem-Layers

to the MyStem-tagger belong the norm[mystem_gr] and the norm[mystem_lex] layers
each layer in MyStem (and U-POS) correlates with a grammatical category
the meaning of each grammatical category in MyStem can be explained as in the following table:

Layer	Grammatical category	Grammeme	Part of speech
norm[mystem_gr]	POS-Determination of the given word	Every redundant grammeme gets delated, except the first grammeme and - if they appear - the grammemes of transitivity (tran/intr) ² and parenthesis (parent)	concerns all parts of speech
norm[mystem_lex]	Base form of a word	should conform with the base form in U-POS	concerns all parts of speech

... files from DEbi---R; USbi---R and RUmo---R with following symbols at the end:
- _fsR (formal spoken Russian)
- _fwR (formal written Russian)
- _isR (informal spoken Russian)
- _iwR (informal written Russian)

1. step: Push/Pull/Fetch in GitHub
1. step: Open EXMARaLDA Partitur-Editor
1. step: File ==> Open ==> rueg repository ==> GitHub ==> rueg-corpus ==> exb ==> P3 ==> 1, 2, 3 …
1. step: verify if the CUs in every file correlate with the CU-guidelines - if not, please correct it
1. step: verify if every word correlates with its right language on the dipl[language]-layer - if not, please correct it
1. step: POS-Tagging ==> verify the accuracy of the POS-Tagging-softwares (U-POS and MyStem)
1. step: delete all features from the norm[mystem_gr]-layer except the first one and - if available - the features of transitivity, parenthesis and other features which are not redundant with U-POS features
1. step: Save your results
1. step: go to GitHub ==> submit your file ==> push/pull/fetch

Phenomenon/Problem	Solution	Example
личные местоимения	norm[mystem_gr]: SPRO; norm[pos]: PRON	-----
ф	все граммемы на уровнях U-POS удаляются; dipl[language]: rus; norm[mystem_gr]: S,persn; norm[pos]: PROPN	ф шестнадцать
его, её, их в качестве притяжательных местоимений	norm[case]: Gen	Он уронил её пакет
тут, там, так	norm[mystem_gr]: ADVPRO; norm[pos]: ADV	-----
слова с буквой ё	ё пишется на всех уровнях кроме dipl ==> norm[norm]: …ё…; norm[lemma]: …ё…; norm[mystem_lex]: …ё…	-----
значит	norm[mystem_gr]: ADV, parenth; norm[pos]: INTJ	значит, он уронил всё и пошёл
короче	norm[mystem_gr]: ADV, parenth; norm[pos]: INTJ	Короче, там шла женщина с коляской
окей	norm[mystem_gr]: PART; norm[pos]: PART	-----
всё (ещё)	norm[mystem_gr]: ADV; norm[pos]: ADV	-----
блин	norm[mystem_gr]: INTJ; norm[pos]: INTJ	-----
то есть	norm[mystem_gr]: СONJ (то) PART (есть); norm[pos]: СONJ (то) PART (есть)	-----
спасибо	norm[mystem_gr]: PART; norm[pos]: PART	-----
этот	norm[mystem_gr]: APRO; norm[pos]: DET	-----
свой	norm[mystem_gr]: APRO; norm[pos]: PRON	-----
привет, здравствуйте	norm[mystem_gr]: INTJ; norm[pos]: INTJ; norm[lemma]: привет, здравствуйте; norm[mystem_lex]: привет, здравствуйте	-----
играть	norm[mystem_gr]: tran	мальчик играл с мячом ³

5. Comments

¹ U-POS and MyStem use partly different features for the POS-tagging of words.

 Example: In case of the Russian personal pronoun я U-POS dismisses it to be a pronoun (PRON). Further specifications in U-POS are not given in this context. In contrast to that, MyStem specifies the pronoun. 
          MyStem dismisses я to be a noun-pronoun (SPRON).

⁴ In general, all reflexive verbs in Russian can be identified by the verb postfix -ся. But not all verbs which end with the postfix -ся are reflexive verbs. Verbs with a transitive word stem and the postfix -ся are not reflexive verbs, but verbs in passive voice. When in doubt, check the Russian verb by translating it into German. If you can translate the Russian verb with sich... into German, then it is very likely a reflexive verb. If that is not possible and you have to translate the verb into German with the aid of the passive construction wird/werden...ge-..., then it is very likely a transitive verb in its passive form.

 Example: Книга читается.
          Das Buch liest sich. ==> This translation wouldn't make sense (except in fairy-tales), because a book can't 
                                   usually read itself. 
          Das Buch wird gelesen. ==> This translation is more logical than the translation above (if we imagine, that the 
                                     context is not a fairy-tale), because the word stem is a transitive verb with the 
                                     postfix ending -ся. Therefore, the verb expresses the passive and can be translated 
                                     here in that way, that the book gets read by someone, who is unknown or who doesn't 
                                     want to be mentioned. 

 Example: Человек развивается.
          Der Mensch wird entwickelt. ==> Развивать is an transitive verb and the postfix -ся could lead to the 
                                          conclusion, that in this case we are dealing with the passive voice. Basically, 
                                          it is absolutely possible and without the context of course difficult to define. 
                                          In view of this, that we don't have a context, orient yourself on the 
                                          general meaning of this sentence, which is often used. 
          Der Mensch entwickelt sich. ==> This is the general meaning of this sentence, which is used quite often. In its 
                                          general meaning the verb doesn´t have a passive, instead a reflexive meaning. 
                                          This meaning can be preferred in such cases, in which the context doesn't exist 
                                          or is not very clear.

² Transitive verbs are verbs, which govern direct objects (objects in accusative without preposition). Between verbs and an accusative object is no preposition. Only transitive verbs can create the passive voice. The passive voice can be recognized by a word stem of a transitive verb + postfix -ся.

 Example: Мальчик  читает книгу. Книга читается  мальчиком. 
          Junge (Nom) liest (tran.verb) Buch (Acc.obj. wihtout preposition). Buch (Nom) wird gelesen (pass. voice of a 
          tran.verb) vom Jungen (Inst).

Intransitive verbs are verbs which govern indirect objects (objects in accusative with preposition or objects in other grammatical cases). Between verbs and object can appear a preposition. The objects can appear in accusative with a preposition, in dative with or without a preposition, in genitive with or without a preposition, in instrumental with or without a preposition and in locative with preposition (objects in locative always stand with a preposition, therefore the Russian locative is called the preposition case). Intransitive verbs can't create the passive voice.

 Example: Папа звонит маме. *Мама звонится папой.
          Papa (Nom) ruft (intr.verb) an Mama (Dat.). *Mama wird angerufen von Papa.

³ In this context the verb играть is intransitive, because the Russian preposition c usually requires the instrumental. However there exist cases in which играть can be used as a transitive verb.

 Example: Вася играет дурака в этом спектакле.
          Vasja (Nom) spielt (tran.verb) den Dummen (acc.object without a preposition between verb and object)  in diesem 
          Stück (Loc).

Therefore, all verbs which might have a transitive meaning in other contexts have to be defined as transitive on MyStem layer even if the verb is used as an intransitive verb in the current context! The reason is that a verb which can be used (theoretically) as a transitive verb gets always treated as a verb with a transitive basic meaning, no matter if this transitive meaning of the verb appears in the current situation or not.

If you have problems to decide to which part of speech the current word belongs, then look the word up in the
Национальный корпус русского языка and check their results or solution. But keep in mind that they have analyzed the speech of their participants partly under different conditions and assumptions.
All U-POS features are available here: Universal features part 1 and Universal features part 2
If you have problems to decide whether the current word is a transitive or an intransitive one or if you simply don't know in which grammatical case a word appears, use Викисловарь

create a TextGrid on Praat
import a TextGrid to EXMARaLDA

speaker tier (e.g TUmo01MT; type: transcription)
optional tier for segmentation in Intonation Phrases (IP)
Normalization in EXMARaLDA

According to Communication Units (CU) Communication_unit__P4_10.12.2018.pdf
No punctuation

Replace name of participant with the respective code (e.g TUmo01MT)
If whole names or surnames of friends are mentioned, replace with the participant code + _P (e.g. TUmo02FT_P)
Places that could lead to the identification of a participant (e.g. Atatürk okulunda = Axxx{schoolname} okulunda, Kızılay caddesi = Kxxx{streetname} caddesi)
if a phone number is mentioned, please anonymize it as {phonenumber}

If this is not possible mark those passages as: <Q> communication with elicitor </Q>

Merged forms are transcribed as they are articulated, but with an equal sign linking the merged elements
Examples from TUmo10MT_isT: n=apıyorsun (= ne yapıyorsun), TUmo11MT_isT: n=aber (= ne haber)

Tag Questions

tag questions (de mi) do not constitute a separate CU

reduced syllables are transcribed as articulated
Examples: bi tane (= bir tane), gidiyo (= gidiyorsun) yakıyosun (= yakıyorsun), içbiri (= hiçbiri)
Use / to mark unfinished words, e.g. “Çarb/ çarptı derken oldu bitti“

pronounced sounds are transcribed as articulated (e.g gardaşım (= kardeşim), but sounds which are not typical for Turkish are not represented.

0.2 - 1 sec: (-)
1-3 secs: (--)
More than 3 secs: (5.5) to be measured
Wordinternal pauses are marked as followed: top(-)la - no space between the parts.

vocals pronounced longer than normal (under 2sec) are marked with : (e.g. canı:m)
vocals that are pronounced extremely long (2sec and more) are marked with :: (e.g canı::m)
also possible for consonants (e.g. tamam:)
doubling of vocal syllables with % (e.g. ba%ay)

non-verbal events such as a participant laughing or coughing are noted in square brackets on speaker tier, e.g. [laughing], [whispering, [clears throat], [sighs], [sniffs], [snapsfingers]
if participants speak and laugh at the same time, it is noted as: [[laughing]speech]

uninterpretable material is to be marked as (UNK) on Speaker-tier
longer than 2secs: (UNK, 2.1)
assumed content in brackets, each token separated: (assumed) (content)

e (short "e") ee (long "ee") ı (short "ı") ııı (long "ııı")
thinking: "hmm, eem, ımm"
agreement: "hıhı"
negation: "ı ıh"
dissapointment: "tüh"

original spelling will be kept.

Keep conventionalized spelling (e.g. Renault = renault)

Numbers are spelled (e.g 155 = yüz elli beş)

Symbols	Meaning
`<Q> araştırmacıyla iletişim </Q>`	instances of questions concerning the procedure and/or verbal interventions of elicitators
(-)	0.5 - 1sec
(--)	pauses 1-3secs
(3.2)	pauses longer than 3secs
(UNK)	uninterpretable material
(UNK, 2.2)	uninterpretable material longer than 2secs
(assumption)	assumed material
[gülüşmeler/fısıldaşmalar]	non-verbal material
[[gülüşme]konuşma]	non-verbal & verbal event
:	unusually long vocal or consonant (under 2secs)
::	unusually long vocal or consonant (longer than 2secs)
=	merged forms
/	interruption of a word
%	doubled syllables
{...}	specification of an anonymised place

Basics

Orthographic normalization.
No changes in grammar (case, tense etc).
Pauses are deleted on a norm layer.
Non-verbal material (laughing, coughing) is deleted on a norm level.

Anonymized material

Participant's codes are deleted on a norm layer.
Anonymized material should be left the same way as on a dipl layer (e.g., kxxx{streetname}da).
- if an anonymized token is followed by an ending only, the ending should be placed in one box;
- if an anonymized token is followed by a word, like sokakta, mahallede etc, these words are placed in the next box.

Capitalization

Proper names are capitalized on a norm layer.
Names of countries, nations are capitalized (e.g., Almanya, Almanlar);
Names of days, months are capitalized (e.g., Salı, Ekim);
Nouns like sokak, cadde, mahalled are capitalized in case they are used together with proper names (e.g., Necatibey Sokağı'nda)
German nouns are capitalized (e.g., Auto)
For the capitalization norms of the word "Allah", check TDK Allah (Atasözü, deyim ve birleşik fiiller)
If a common noun is capitalized when it is not placed at the beginning of a sentence (e.g., Futbol Topu), on a norm level it should be corrected as (e.g., futbol topu).
if the whole text or some of its parts was written in CAPS, it should be changed to lower case except where Turkish standard orthography demands capitalization.

Apostrophe

Proper names (e.g., Starbucks), country names (e.g., Almanya) and nouns like sokak, cadde, mahallede require an apostrophe in case they are used together with proper names and used in a dative, accusative, or locative cases (e.g., Necatibey Sokağı'nda);

Dates and Numbers

Code 'ef on altı'/ 'fe on altı' is represented as F16 (written in one box);
If a date is written as 15.10.2018, it stays as 15.10.2018 on a norm level;
But if a date is spelled out (e.g., on beş ekim), the day is represented with numbers, but the month is spelled out and capitalized (e.g., 15 Ekim).

Cancellations, Repairings

If a canceled/repaired token is a meaningful word, then it stays on a norm level, but without a slash (e.g., çarpma/ çarpıştı --> çarpma çarpıştı -- in separate boxes).
if a canceled/repaired token is a word, but is not meaningful according to the context, we delete it on a norm layer (e.g., arı/ araba geliyordu --> araba geliyordu);
if a canceled/repaired token is not a word (e.g., kö/ köpek havladı), we delete it on a norm layer (e.g., kö/ köpek havladı --> köpek havladı).

Assumed material

if on a diplomatic layer, a word is placed in parentheses as assumed material, we leave the parentheses out and keep only the word itself (e.g., (muhtemelen) --> muhtemelen);
if the assumed material is not a word (e.g., trafik kasağı), we try to guess the closest meaningful word and normalize it as trafik kazası.
if the meaning of the assumed material cannot be guessed, we leave it empty on a norm layer.
orphographic mistakes (especially in written data) are corrected (for example, qma -> ama, cüpike --> köpek).

Foreign Material

Foreign material is normalized according to the rules of the language it is taken from.

Hesitation Markers

dipl layer	norm layer
e, ee, eee	e
ı, ıı, ııı	e
em, hm	e

Punctuation

Each punctuation mark is placed in a separate box.
A triple-dot punctuation mark (...) is placed in one box.

Material which is not found in TDK (please update the list during the normalization)

dipl layer	norm layer
laylaylom/lay lay lom	laylaylom
boooah	boah
off	of

Long Vocals & Consonants

Long vocals & consonants (may they be represented with a colon: or with the help of two vowels), the words are spelled according to the orthographic norms (e.g., ya: --> ya)

Emojis

Emojis are represented on a norm layer as they are on a dipl layer;
One emoji is placed into one box.

Dipl[lang] Layer

The code of Turkish is tur
if we see UNKNOWN instead of tur on a dipl[lang] layer, we change it into tur manually.
we delete the language code in case of pauses, non-verbal material (like laughing, coughing), punctuation marks.
we change the language code in case of foreign material BUT: change the code of the language only in case when the word has not been adopted into Turkish yet. E.g., hasar would be still tur, or tişört is tur. But bye bye is eng.

code	language
tur	Turkish
deu	German
eng	English
ara	Arabic
kur	Kurdish

Dipl[line] and dipl[message] Layers

disregard these layers.

RUEG Corpus Documentation

Overview of Annotations

Dependencies between annotation layers

Meta data fields

cu (Communication Unit)

Processing steps

dipl (Tokenization)

Language-specific differences

Processing steps

norm (corpus-wide normalization)

Language-specific differences

Processing steps

lemma (Lemmatization)

Processing steps

pos (Universal part of speech)

Processing steps

pos_lang (Language specific Part of speech)

Processing steps

language (Language/Foreign Material)

Processing steps

message (Chat Message span)

line (Chat Message line)

Processing steps

dep (Universal Dependencies)

Processing steps

Annotation Step 1: Transcription

Anonymisation

Segmentation

Spelling

Transcription

'Unwanted' material

Merged forms

Reduced syllables

Elisions, repetitions and interruption

Pauses

Long vocals & consonants

Non-verbal material

List of meta comments used in the RUEG project

Uninterpretable material

Hesitation markers / Interjections / Reception markers

Foreign language material

Proper/Brand names from "foreign language"

Table of symbols

Annotation Step 2: Normalization

Segmentations in our corpus

What is normalization in our corpus and what is its purpose

Underlying principles and things to keep in mind

Orthography of reference

Additional tasks

Annotate CUs for written data

Adapt language annotation

Normalization guidelines

English clitics

German clitics

Greek and Russian clitics

Turkish clitics

Special material / cases

Grammatical deviations

Annotation Step 3: POS and Lemma

Tools involved

Lemmatization

Part of Speech (POS)

Know the guidelines and document difficult cases

Rule of thumb in cases of doubt (Lemma & POS)

Language-specific POS vs. UD-POS

POS – Schemes by Language

All languages

POS – Who does what?

UD-POS by language

Remarks

General decisions for all languages:

Language Specific Decisions

Transcription Decisions English

English Normalization

Individual acronyms and letters are represented with lowercase letters.

Number less than twelve are written out, while numbers greater than twelve are represented as numerals.

Foreign language material

Abbreviations to be written normalized, even if they are in the dictionary

Capitalizations

Other