RUEG Corpus Documentation

The RUEG corpus is a multi-layer corpus of both written and spoken language. We use several annotation formats in the process of annotation, but all annotations, except for the dependency annotations, are part of the EXMARaLDA file in the exb directory. In addition to the editable EXMARaLDA format, the corpus is also converted to the ANNIS format (annis directory) for search and visualization.

Most annotation layers depend on other annotations. This can to lead to complex dependencies, as visualized by the following graph:

Graph of the dependencies between the annotations

DISCLAIMER: All sub-1.0 releases may feature inconsistencies in the formatting of the meta data values or show incompleteness of metadata.

In addition to the annotation layers, each document has also meta data fields which are stored in the .meta file next to each EXMARaLDA file. The meta data is also included in the ANNIS format.

field name	type	description
speaker-id	String
formality	String	informal/formal
mode	String	spoken/written
speaker-bilingual	Boolean	yes/no
elicitation-session	Number	1 (monolinguals, bilinguals in first session) 2 (bilinguals in second session)
elicitation-language	String	Language that is elicited from the speaker
elicitation-country	String
elicitation-order	Number	1-8
elicitator-good-id	String	project- and people-number of "good cop"
elicitator-bad-id	String	project- and people-number of "bad cop"
elicitation-date	String	2018-XX-XX
transcriber-id	String	comma-separated list of project- and person-number XX-XX
normalizer-id	String	comma-separated list of project- and person-number XX-XX
annotator-id	String	comma-separated list of project- and person-number XX-XX
speaker-language-s	String	Languages as given by the participants and separated by comma
speaker-age-group	String	children/adolescents/adults
speaker-gender	String	m/f/d
speaker-age	Number	two-digit number year
speaker-AoO	Number	Age Of Onset in years (two-digits)
speaker-AoO-answer	Number	complete, but anonymized answer string
speaker-personality-score-X	Number	Personality score (1-7) for each of the questions 1-6 of the personality test
speaker-extravert-score	Number	aggregated extravert score

field name	type	automatically retrievable from questionnaire	description / comments
speaker-region-of-birth	String	text value only	This and the following meta key are retrieved as place of birth, from which you are supposed to extract the region (e. g. "Bavaria", "North Carolina", "Krasnoyasrk", "Aegean Islands", "Central Anatolia", you might prefer the term federal state or province ) and/or the country. For privacy reasons please do not provide the city or even more detailed information.
speaker-country-of-birth	String	text value only	See above.
speaker-age-of-immigration	Number	yes	Age of arrival in country of majority language in years. Single digit. For the age in years and months, use one of the following options: for instance, for 3 years 6 months, you can write 3.5 or 3;6.
speaker-education-degree	String or Number	yes	Categorical values, provided by questionnaire. Please be careful with the adolescents: many of them selected high school as their highest degree completed but in fact they did not complete it yet. So we need to look at "grade. School year" to see if the adolescent is in high school or in college.
speaker-employment	String	yes	Categorial values, provided by questionnaire.
speaker-dialect-s	List of strings	yes	List of dialects spoken (comma-separated).
speaker-language-instructed-1	String	yes	A language the participant was instructed in. More languages possible (2, 3, ...).
speaker-language-instructed-1-duration	Number	no	Number of YEARS (other unit prefered?) the participant was instructed in language 1. As with the fields concerning age, you can write 3.5 or 3;6.
speaker-parent-1-...
speaker-parent-2-...
speaker-parent-3-...
speaker-parent-4-...
speaker-parent-1-name	String	yes	"Mother", "Father", "Sister", "Brother", etc. Capitalization does not matter, leave the words as they were originally written. Needs to be anonymized.
speaker-parent-1-country-of-birth	String	textual value	Please extract the name of the country from the given answer. Delete any more precise information.
speaker-parent-1-region-of-birth	String	textual value	Please extract the name of the region from the given answer. Delete any more precise information.
speaker-parent-1-degree	String or Number	yes	As above, highest degree, but for parent / adult.
speaker-parent-1-profession	String	yes	Profession of parent / adult.
speaker-parent-1-employment-institution	String	yes	Current employment (institution, category) of parent / adult. Might need anonymization. Note that sometimes only position or institution is derivable from the answer, so n/a should be used for unavailable meta values.
speaker-parent-1-employment-position	String	yes	Current employment (position, category) of parent / adult. Might need anonymization. Note that sometimes only position or institution is derivable from the answer, so n/a should be used for unavailable meta values.
speaker-parent-1-language-home-1	String	yes	Language spoken at home by parent / adult (to anybody). Capitalize the language!!
speaker-parent-1-language-home-2	String	yes	Language spoken at home by parent / adult (to anybody).
speaker-parent-1-language-home-3	String	yes	Language spoken at home by parent / adult (to anybody).
speaker-parent-1-dialect-s-home	List of strings	yes	Dialects spoken at home by parent / adult (to anybody). Capitalize the dialect!!
speaker-env-1-...			Those values are for adults in current environment, but also include the parents again.
speaker-env-2-...			Therefore we might not have to use all of them.
speaker-env-3-...			All values meta fields for parents have to be repeated for adults in environment.
speaker-env-4-...
speaker-shares-home-with-env-1	Boolean		Whether or not the speaker lives together with the respective adult in their environment.
speaker-shares-home-with-env-2	Boolean
speaker-shares-home-with-env-3	Boolean
speaker-shares-home-with-env-4	Boolean
speaker-frequency-of-visits	String	as text	How often the participant visits the country where the heritage language is spoken.
speaker-self-assessment-hl-oral-understanding	String or Number	yes	Self assessment by participant of oral understanding in heritage language.
speaker-self-assessment-hl-written-understanding	String or Number	yes	Self assessment by participant of understanding of written text in heritage language.
speaker-self-assessment-hl-oral-production	String or Number	yes	Self assessment by participant of oral production skills in heritage language.
speaker-self-assessment-hl-written-production	String or Number	yes	Self assessment by participant of written production in heritage language.
speaker-self-assessment-hl-native	Boolean	yes	Does the participant consider him-/herself a native speaker of the heritage language.
speaker-languages-used-regularly-to-env-1	List of strings	yes	Languages (comma-separated) the participant speaks regularly to adult 1 (environment).
speaker-languages-used-often-to-env-1	List of strings	yes	Languages (comma-separated) the participant speaks often to adult 1 (environment).
speaker-languages-used-rarely-to-env-1	List of strings	yes	Languages (comma-separated) the participant speaks rarely to adult 1 (environment).
...
speaker-languages-used-regularly-to-parent-1	List of strings	yes	Languages (comma-separated) the participant speaks regularly to parent 1.
speaker-languages-used-often-to-parent-1	List of strings	yes	Languages (comma-separated) the participant speaks often to parent 1.
speaker-languages-used-rarely-to-parent-1	List of strings	yes	Languages (comma-separated) the participant speaks often to parent 1.
...
parent-1-languages-used-regularly-to-speaker
parent-1-languages-used-often-to-speaker
parent-1-languages-used-rarely-to-speaker
env-1-languages-used-regularly-to-speaker
env-1-languages-used-often-to-speaker
env-1-languages-used-rarely-to-speaker
...
speaker-habits-video-consumption-hl	String	yes	How often does the participant consume videos in the heritage language. Can be delivered for other languages as well.
speaker-habits-audio-consumption-hl	String	yes	How often does the participant consume auditive media in the heritage language. Can be delivered for other languages as well.
speaker-habits-text-production-hl	String	yes	How frequently does the participant produce text in the heritage language. Can be delivered for other languages as well.
speaker-habits-text-consumption-hl	String	yes	How often does the participant read in the heritage language. Can be delivered for other languages as well.
speaker-habits-uses-native-script	String	as text	This needs some additional thought. A textual answer is delivered an we still need to think of useful value set. P2: we are replacing unhelpful answers like "script", "keyboard", and "alphabet" with* n/a.*
speaker-habits-messenger	String	yes	Which text messenger does the participant mostly use.
speaker-habits-activities	String	yes	Which activities does the participant exercise. Needs privacy check, could maybe be dropped and only languages are kept.
speaker-habits-activity-language-s	String	yes	Languages used during those activities.
elicitation-ease-formal	Boolean	yes	Was it easy for the participant to image herself in the formal situation.
elicitation-ease-informal	Boolean	yes	Was it easy for the participant to image herself in the informal situation.
elicitation-issues-with-smartphone	Boolean	yes	Did the participant face any issues dealing with the smartphone during elicitation.
elicitation-issues-with-smartphone-text	Boolean	yes	Text answer describing the issues with the smartphone. Empty if no issues occured.
speaker-habits-smartphone-type	String	yes	Not sure that is necessary, but we have it.

Value set: open

Segmentation and transcription of Communication Units For spoken data, the start and the end of the CUs are manually aligned with the audio.

See the transcriptions guidelines for details.

#	type	step	output format
1	manual	Transcription	exb

Value set: open

Automatic tokenization of the text into words.

as defined by the TreeTagger tokenization script
extra handling for emojis and pauses

language specific abbreviations

#	type	step	output format
1	automatic	TreeTagger	exb

Value set: open

A common normalization that is the same for written and spoken data. This allows a search across registers.

segmented into graphemic words
emojis are a single word
text messsage acronyms are treated as single word
punctuation is considered a token if not part of an emoji
following standard orthography
no word order corrections
no grammatical corrections

script is normalized to language standard
each language decided on
- orthographic standard
- clitics
- script

#	type	step	output format
1	automatic	Copy base text	exb
2	manual	Normalize	exb

Value set: open

Lemmatization based on the normalization (norm).

#	type	step	output format
1	automatic	lemmatization (part of the POS-tagging)	exb
2	manual	correction	exb

Value set: closed

Part of speech annotation using the Universal POS tags.

#	type	step	output format
1	automatic	Automatic POS tagging	exb

Value set: closed

Part of speech annotation with a tag-set for each language.

there is one common tag-set for each language
text message acronyms get their own tag manually (or if the tagger supports it, automatically)

Different tagsets are used for each language:

language	tag set	reference
English	British National Corpus / Claws 4	Leech et al. 1994¹
German	STTS 2.0	Westpfahl 2014²
Russian	MyStem tag set	Segalovich 2003³
Turkish	MULTILIT tag set	Schroeder et al. 2015⁴

#	type	step	output format
1	automatic	Automatic POS tagging with tool	exb
2	manual	correction	exb

Value set: closed

Describes the language.

per-token
ISO three letter language code
every token has this category assigned
no dialects

#	type	step	output format
1	automatic	Fill out default language	exb
2	manual	Mark foreign material	exb

message (Chat Message span)

Value set: natural numbers

Span annotation for each message in the chat. Contains its consecutive number.

line (Chat Message line)

Value set: open

Span annotation with the chat message text as content.

#	type	step	output format
1	automatic		exb

Value set: closed

Automatic Universal Dependency parsing.

#	type	step	output format
1	automatic	UD Parsing	CoNLL

Leech, Geoffrey, Roger Garside, and Michael Bryant. 1994. “CLAWS4: The Tagging of the British National Corpus.” In COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics. Vol. 1.

Westpfahl, Swantje. 2014. “STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data.” In Proceedings of Law Viii-the 8th Linguistic Annotation Workshop, 1–10.

Segalovich, Ilya. 2003. “A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine.” In MLMTA, 273–80. Citeseer.

⁴

Schroeder, Christoph, Christin Schellhardt, Mehmet-Ali Akinci, Meral Dollnick, Ginesa Dux, Esin Işil Gülbeyaz, Anne Jähnert, et al. 2015. “MULTILIT.” Universität Potsdam. https://publishup.uni-potsdam.de/opus4-ubp/frontdoor/index/index/docId/8039.

Anonymisation on Speaker tier

Replace name of participant with the respective speaker code, e.g. USbi02FR
If whole names or surnames of friends are mentioned, replace with the participant code + _P, e.g. USbi02FR_P
Places that could lead to the identification of a participant, like streetnames, schoolnames, etc.:
- "Friedrichstraßezzz", you transcribe as "{streetname}Straße". "zzz" has to be replaced by any inflectional suffixes/erase it if there are no such suffixes
- "Apple Highschools“ (with plural suffix), you transcribe as "{schoolname}schools".

! Attention: There should be no spaces following the {...}. ! Over time a list of these placeholders should be developed by every project

Communication Unit (CU) is used as a segmentation unit
No punctuation marks at all on the transcription layer
No accents, no intonation patterns are marked
In Exmaralda: blank space at the end of each event (* no punctuation marks on norm layer)

Our decisions to CU segmentation you find here: Decisions CU Segmentation

No capital letters
Abbreviations/acronyms are transcribed as full words in the phonology of the language heard in the recording(e.g. German BMW = beemwe, English BMW = beemdoubleyou)
speaker codes need to be partly capitalized to follow their correct pattern

Adapted from KiDKo2014

'unwanted' material are questions of participants concerning the procedure and eventual responses from the elicitator
first, figure out if you can exclude this kind of data with 'unwanted' material and repeat the elicitation
If this is not possible, mark those passages as: <Q> communication with elicitor </Q>
they get an extra-event

Merged forms are transcribed as they are articulated, but with an equal sign linking the merged elements
Examples: so=ne (= so eine)

reduced syllables are transcribed as articulated
Examples: goin (= going), bi tane (= bir tane), hab ich ein Tadel bekommen (= einen Tadel)

Do not leave anything out and do not add anything which is not there!
Use / to mark unfinished words, e.g. “The bl/ blue car crashes um stops“
word internal cancellations/corrections are transcribed as follows: dipl: "be$ha$ come" (norm: "become")
Onomatopoeias/echoisms are separated tokens (e.g. gutschi gutschi gutschi), only transcribed as one single token if they are very short (e.g. eieiei)

always measured to the first decimal
0.2 - 1 sec: (-)
1 - 3 secs: (--)
More than 3 secs: (5.5) to be measured
Wordinternal pauses are marked as followed: be(-)have ¹
keep in mind that there might be persons who talk really slowly (makes no sense to put a pause after every word/token)
pauses inside a CU do not get an extra-event on the CU tier
pauses which occur between two CUs get an extra-event on the CU tier

vocals realized longer than normal (0.2 - 2sec) are marked with : (e.g. so lo:ng)
vocals that are realized longer tha 2 seconds are marked with :: (z.B. so lo::ng)
also possible for consonants (e.g. mum:)
doubling of vocal syllables with % (e.g. by%ye, tschü%üß)

non-verbal events such as a participant laughing or coughing are noted in square brackets on the transcription tier, e.g. [laughing]
if participants speak and laugh at the same time, you note it as: [[laughing]speech]

List of meta comments used in the RUEG project

[coughing]
[gulping]
[laughing]
[pfing] for a sound like "pfff"
[sighing]
[throatclearing]
[tongueclicking], including tsking as disapproval, while thinking and just mouth opening with a click
[whispering]
[stuttering]
[imitating], for when they imitate a sound related to the story (e.g., car crash)
[sniffing]

uninterpretable material is to be marked as (UNK) on Speaker-tier
longer than 2secs: (UNK, 2.1)
assumed content is placed in between brackets, each token separated: (assumed) (content)
if the uninterpretable material can be identified as belonging to a CU, there is no seperated event on the CU tier for it

For every language, we define a set of hesitation markers/interjections/reception markers
create a list with those markers
If heritage speakers use particles from their ‘other’ language, we transcribe them as they sound, consistent with the procedure on foreign language material²

Choose a spelling for each item following one of those options:

transcribe phonographically (e.g. engl. like = germ. leik) OR
use orthographic spelling of the "other" language

Create a list where you document the spelling of each item in alphabetic order
put the file name that includes the word and the time of the appearance in the list
each time you encounter foreign language material in your data, check the list to guarantee a consistent form for those items³

Keep conventionalized spelling (e.g. Renault = renault)
document your decisions, create a list with those items
Language specific decisions: Russian: put it in the spelling and script of the actually spoken language to avoid loss of phonetic/morphological/syntactic information Turkish and Greek: use Latin alphabet and conventionalized spelling

Symbols	Meaning
`<Q> communication with elicitor </Q>`	instances of questions concerning the procedure and/or verbal interventions of elicitators
(-)	pauses up to 1sec
(--)	pauses 1-3secs
(3.2)	pauses longer than 3secs
(UNK)	uninterpretable material
(UNK, 2.2)	uninterpretable material longer than 2secs
(assumption)	assumed material
[...]	non-verbal material
[[...]...]	non-verbal & verbal event
:	unusually long vocal or consonant (under 2secs)
::	unusually long vocal or consonant (longer than 2secs)
=	merged forms
/	interruption of a word
$...$	word internal cancellations
%	doubled syllables
{...}	specification of an anonymised place

Our corpus pipeline faces 4 types of input for each speaker that participated in the elicitation:

formal written (fw) text files
informal written (iw) chat exports
formal spoken (fs) transcriptions
informal spoken (is) transcriptions

All data types are divided in above-word-level ("phrasal") segments:

fw: sentences indicated by punctuation used by the participant
iw: messages and lines indicated by separators used by the participant (punctuation, new line, message separation)
fs and is: communication units (CUs) indicated by transcriber (some projects might use intonation phrases additionally)

We annotate our CUs on the annotation tier "cu" (in lowercase letters). These segments are then automatically tokenized (roughly speaking divided in character sequences between two whitespaces), which provides our first word-level segmentation: The diplomatic layer, consisting of graphemic words represented in the phrasal segments.

This word-level segmentation is then to be normalized to facilitate search and automatic annotation.

The result of normalization is an additional word-level segmentation layer, similar to dipl. Starting at the diplomatic tokenization, explicit rules are to be applied to obtain orthographically normalized tokens on the norm layer. Starting of as a copy of the diplomatic token layer, the normalized tokens can (but do not necessarily need) to undergo operations such as edition, deletion, split or merge.

Our normalization is not the last step of analysis and does not remove information from the overall corpus
Normalization takes place mainly on word-level to obtain standardized forms w.r.t. an orthography of reference
We will agree on an orthography of reference for each language in our corpus
The normalized layer uses the language's native script, whereas the diplomatic layer sticks to the script used by the participants for the written registers (fw, iw)
We do not normalize syntax. This way we are able to learn about an underlying grammar more easily rather than analysing the elicitated language in terms of a standard grammar. Note that this comes with disadvantages but is more appropriate for our projects' idea. We want to obtain a layer that is orthographically normalized but still allows to do syntactical analyses. This is why we do not normalize syntax and try to keep as many elements as possible.
Normalization is also a technically necessary process. Orthographically standardized forms help all researchers, those not familiar with our data included, to search linguistic phenomena. Furthermore, annotation to be integrated in our corpus can be pre-generated automatically and efforts of correction are reduced.
The highest priority is to ensure that we will always be able to answer our research questions.
We do not alternate lexical choice when normalizing.
The norm layer will enable us to identify grammatical units.
A normalization decision is always influenced by the question of how controversial and/or lossy in terms of information that decision might be.
Since this is our first layer of normalization, we normalize very carefully.
There will be no grammatical normalizations (adaption of cases or gender or other grammatical features that deviate from our understanding of "standard")
We normalize to a single standardized form (if possible) to reduce noise in the tokenization
We always normalize as far as we can go, which means we are more careful in some cases compared to others. As an example, 're stays 're in the normalization, because tempus is unclear, whereas 'll can be normalized to will, since it is entirely uncontroversial what the diplomatic form expresses.
When normalizing, we try to be as uninterpretative as possible.
1:1 mappings from diplomatic and normalized tokens are easier to decide on, because the alternation in the form is easy to trace.
The standardized representation should be influenced by the underlying orthography explicitly and implicitly, e. g. for German clitics we do not use an apostrophe in the separated token, whereas we do for English.
Normalization should not alternate meaning (if possible).

language	orthography of reference	reference
English	American English	Merriam-Webster
German	„Neue deutsche Rechtschreibung“	Amtliche Regelung der dt. Rechtschreibung, Duden
Greek	Standard Modern Greek	Λεξικό της κοινής νεοελληνικής and David Holton, Peter Mackridge, Irene Philippaki-Warburton (1997) Greek: A Comprehensive Grammar of the Modern Language, London: Routledge
Russian	современный русский литературный язык	Малый академический словарь (МАС) https://rus-academic-dict.slovaronline.com/
Turkish	Turkish / Modern Standard Turkish (ISO 639-3 = tur)	Türk Dil Kurumu

For the written registers (iw, fw), we currently lack a syntactic unit. Furthermore, CUs as segmentation units were, amongst other reasons, chosen due to their applicability for spoken and written data. Having CUs in all types of texts facilitates cross-mode / cross-register comparisons.

To do this in EXMARaLDA, you first need to create the respective annotation tier. After having opened your file in EXMARaLDA, add a new tier through the menu bar (Tier > Add Tier). Choose the following settings:

Speaker: dipl [dipl]
Type: A(nnotation)
Category: cu

By depending on the diplomatic tokenization, the tier has one timeslot per event on dipl. To annotate a CU, merge all timeslots in the CU tier that overlap the tokens (dipl) belonging to the CU (including sentence-final punctuation). Finally, assign a simple annotation value "cu" to the newly created span.

The CU tier does not explicitly contain any text. By overlapping its respective tokens, the entire text of a CU is already defined and can be searched.

Please do not edit the CU layer of the transcriptions. They are final.

For each dipl token there is a lanuage annotation. When preparing your data, this is automatically set to the language of elicitation. Nevertheless, in some cases you might have to adapt that value. Please use the ISO 639-2 language code. A list of valid codes can be found here.

For cancellations, pauses, emojis and non-verbal material, please delete the language annotation.

The rules are always applied on the normalized tokenization. The diplomatic tokenization remains as is.

The following lists and examples are planned to be extended and updated with more exmaples, especially from Greek, Russian and Turkish.

Notation: Normalization rules are formulated as diplomatic token(s) → normalized token(s). Token boundaries are represented with a slash (/).

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comments / Rationale / Criticism
(CU-initial) capitalization	do not correct. Only adapt if a word itself has to be capitalized. Do not capitalize merely because of a CU-initial position	am / i / in / london → am / I / in / London	das / auto / konnte / nicht / mehr / bremsen → das / Auto / konnte / nicht / mehr / bremsen

Note: Rule of thumb: Capitalize, if the word needs to be capitalized in the orthography of reference. Only introduce capitalization if it is the correct spelling independent of whether it appears in an initial position or not. Also, do not undo CU-initial capitalization if provided by writer.

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comments / Rationale / Criticism
orthographic errors or phonetic assimilations / slip of the tongue	normalize orthography	orthographie → orthography	hunt → Hund fümf → fünf	~	¹
orthographic errors or phonetic assimilations / slip of the tongue (within vocabulary)	correct to target hypothesis if meaning of corrected form is much more likely AND the normalization implies a change in category	I / was / their → I / was / there	ihr / seit / dort → ihr / seid / dort Umfall → Unfall	~	²
written data: composed forms	refer to above mentioned reference and normalize	inter-national → international caraccident → car / accident	Auto / Unfall → Autounfall

This is the prototypical case of normalization.

Both corrections are much more likely, since it is very hard to derive any meaning from the uncorrected examples, considering they were meant to be this way. Additionally, since it is 1-1 normalization in terms of words, normalization is not particularly invasive. Last but not least, since the elicitations take place in a more or less closed setting, it is much easier to make a judgement on the likelihood of an expression in a case of potential within-vocabulary errors.

Phenomenon	What to do on norm level?	English example	German example	Russian example
missing words	do not add them, we do not normalize syntax	you / there / ? → you / there / ?	bin / unterwegs → bin / unterwegs

Phenomenon	What to do on norm level?	English example	German example	Russian example
repetitions	keep, normalize orthography	the / the / cAr → the / the / car	ein / ein / Autounfall

Phenomenon	What to do on norm level?	English example	German example	Russian example
self-corrections	see repetitions / cancellations	the / hatchback / (eh) / car → the / hatchback / car (filled pause deleted, rest remains)	ein / Typ / ein / Fußgänger / hat / dann remains as is, whereas ein / Fußg/ / ein / Typ → ein / ein / Typ, since Fußg/ was explicitly marked as cancelled by the transcriber

Note: The proposed procedure for self-corrections is the least theory-driven, thus, avoids making any grammatical assumptions when normalizing. Nevertheless, or therefore, it also might be very controversial. It definitely does not facilitate later syntactic annotation.

Phenomenon	What to do on norm level?	English example	German example	Russian example
spoken data: cancellations³	drop the incomplete element	car / acc/ → car	Autounf/ →

Note: Sometimes it is difficult to tell whether something is a cancellation or not, because the incomplete part might already be a word on its own. In these cases we rely on the transcriber's decision: If something is marked as cancellation in the transcript, we stick to that interpretation. Nevertheless, that should not keep your project from revising transcriptions (given the current state of corpus creation still allows for such revisions).

Phenomenon	What to do on norm level?	English example	German example	Russian example
spoken data: phonetic markers	remove and normalize to orthography	a::nd → and	und / da::nn → und / dann

Phenomenon	What to do on norm level?	English example	German example	Russian example
reduced forms	if reconstructable normalize orthographically⁴	ca → car	nich → nicht wolln → wollen

⁴

Note: reduction has to be out of vocabulary. If not, check with the guidelines on within-vocabulary errors above.

Note: There might be (even in the dictionary) widely accepted variants of standardized writings that are more or less the result of reduction (or other processes). Examples are cross instead of across in English, grade instead of gerade in German. We need to focus on the goal of normalization: To have a text segmentation where orthographic variation is reduced / removed, i. e. where multiple expressions are mapped to one standard representation. Thus, as long as there is no argument that normalizing an item alternates meaning, these reductions should be normalized. We only deal with mappings from one token to one token, i. e. retrieving the originally chosen expression is very easy and we do not lose information. A counterexample are cases of drauf in German. As a pronomial adverb it is often a reduction of darauf and should be normalized as that. Nevertheless, it can also be the separated particle of the particle verb draufgehen (to die, fig.) and normalizing it to darauf would not be appropriate/"correct". When normalizing, we will most likely encounter much more subtle cases of difference in meaning. We need to discuss and document our decisions.

Phenomenon	What to do on norm level?	English example	German example	Russian example
colloquial forms, dialect	normalize	coz → because rulz → rules cha → you	nix → nichts ick → ich jetze → jetzt itzt → jetzt

Phenomenon	What to do on norm level?	English example	German example	Russian example
merged forms	split into normalized items as long as both items are visibly represented in the merged form	wanna → want / to gonna → going / to ⁵	kannste → kannst / du ⁶

⁵

Note: regarding (1) and (2)
what / cha / gonna / do → what / you / going / to / do

⁶

Note: kannst / mir / mal / helfen
remains as is, since there is no overt material for du without assuming a reduction of kannst, i. e. this is a case of missing material, not merge (see above)

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comment
merges / clitics	separate and normalize, if possible	isnt → is / n't withe / ball → with / e / ball he's / guilty → he / 's / guilty	binsch → bin / ich

Note: We normalize a clitic if there is no ambiguity about the standard form. If the standard form cannot be determined unambiguously, the clitic is represented as a clitic, but in a standardized way (see table below for examples).

We always separate the clitics, i. e. divide a single diplomatic token in two norm tokens. See the following list of examples:

Diplomatic	Normalized
n't	not
'll	will
's	's
're	're
isn't or isnt	is / not
aren't or arent (or even arnt)	are / not
cannot or can't or cant	can / not
won't or wont	will / not
didn't or didnt	did / not
don't or dont	do / not
doesn't or doesnt (or even dosnt)	does / not
ain't or aint	ai / not
couldn't or couldnt	could / not
shouldn't or shouldnt	should / not
wouldn't or wouldnt	would / not
she'll	she / will
he's	he / 's
you're	you / 're
Thomas' / computer	Thomas / ' / computer
Anke's / mail or Ankes / mail	Anke / 's / mail

Separate clitics analogous to English clitics, but do not use an apostrophe.

Represent them according to the chosen orthographic standard.

Concatenated morphological elements can remain a unit, since this is part of the language. If you prefer to separate them anyway, make sure you work consistently.

Phenomenon	What to do on norm level?	English example	German example	Russian example
lexicalized merged forms	+ do not normalize + work with list? + refer to orthography reference?	Rock'n'Roll → Rock'n'Roll	im → im zum → zum beim → beim ins → ins

Phenomenon	What to do on norm level?	English example	German example	Russian example
non-standard variation within a paradigm	normalize	$\dots$	helf / mir → hilf / mir

Note: Non-standardized but maybe widely established and accepted forms can be understood as alternative spellings. For alternative spellings our rules guide us to choose a standard and always use that to guarantee consistency. Therefore, normalizing to the true standard form is a good way to go. It is also a safe procedure, since we are dealing with single word to single word mappings, i. e. the original diplomatic form is always retrievable.

Phenomenon	What to do on norm level?	English example	German example	Russian example
speaker codes	do not normalize

Phenomenon	What to do on norm level?	English example	German example	Russian example
foreign material / proper names in a foreign language	normalize according to agreement		leik → like

Note: The transcription of foreign material is always phonographic. The normalization sticks to the habit of the respective language. Whereas the phonographic leik (dipl) might be like (en) on the German norm layer, whereas a Russian that says Autowerkstatt in German in a Russian context, the dipl and norm form will be phonographic representations (аутоверкштат).

Also: Remember to adapt the language annotation for these cases.

Phenomenon	What to do on norm level?	English example	German example	Russian example
Abbreviations	(merge and) normalize according to standard orthography		pekawe → PKW

Phenomenon	What to do on norm level?	English example	German example	Russian example
cardinal numbers	normalize according to regulations in orthography of reference

Note: In the written data, it is enough to correct the spelling of numeral words, there is no need to convert actual numbers to words or vice versa.

Phenomenon	What to do on norm level?	English example	German example	Russian example
floating point numbers	normalize as numbers	one / point / three → 1.3	eins / Komma / drei → 1,3

Phenomenon	What to do on norm level?	English example	German example	Russian example
times and dates	are a composition of elements for which there are normalization rules to be applied individually

Phenomenon	What to do on norm level?	English example	German example	Russian example
fractions	normalize/keep as words	one / third	ein / Drittel

Phenomenon	What to do on norm level?	English example	German example	Russian example
pauses, noise, turn-holders, laughing, ...	not represented, i. e. delete token

Phenomenon	What to do on norm level?	English example	German example	Russian example
filled pause vs. particle, etc.	If you decide to consider certain filled pauses lexical units, decide on a standardized representation and represent that unit always the same way in the normalization layer. If you are currently undecided on the status of such an element and this is actually part of your research, delete that element on the normalization layer.

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comment
interaction/communication with elicitor	not represented	...	...	There is a transcription standard for these events. These events will not be normalized. Delete These events on the norm layer.

Phenomenon	What to do on norm level?	English example	German example	Russian example
incomprehensible, no hypothesis (unknown material)	not represented, i. e. delete token on norm	(UNK)	(UNK)

Phenomenon	What to do on norm level?	English example	German example	Russian example
incomprehensible with hypothesis	hypothesis is represented in a normalized version	(car) → car	(jetze) → jetzt

Phenomenon	What to do on norm level?	English example	German example	Russian example
punctuation	do not add nor normalize, but separate on both dipl and norm	accident.Then / the → accident / . / Then / the	,dass → , / dass

NOTE: This is a phenomenon you may face in written data. Since there is no obligation for authors to stick to pure typing-conventions and use spaces after punctuation, we will treat cases like the above as tokenization errors caused by the pipeline. To remain within a certain frame of complexity we cannot resolve these issues within automatic tokenization. Please perform the separation during normalization on both - dipl and norm. Each element of punctuation should be its own token, unless it is a sequence of punctuation characters without whitespaces in between. In these cases the characters remain a unit (e.g. "..." stays like this on dipl and norm).

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comment
multiple representations allowed in orthography of reference	Make a decision, document it and stay consistent.		Albtraum vs. Alptraum	~	This also implies, that if a participant uses the alternative form you should normalize to the variant you chose as a standard in the corpus.

If there is a mismatch in case, gender or any other grammatical property/feature, do not normalize it. This has to remain to be up to investigation. You might, of course, find non-trivial cases between orthographic normalization and an actual grammatical target hypothesis. Document these case and raise a discussion if necessary.

we prepare lemma, pos and feature annotations for you (automatically)
you will correct these annotations in EXMARaLDA
you will find the data in the same subfolder of exb/ as during normalization

If you ever face the following error message when opening an EXMARaLDA file:

Tier ... is not stratified. Please choose a method for stratifying the tier:

Choose Stratify by deletion.

binds several occurences within the corpus to a common type
facilitates / enables for search with / for lexical items
lemma: standardized form which might occur in different realizations / variants
lemma usually (but not necessarily!) determined by paradigms, i. e.:

inflected form → prototype = a single inflected form within paradigm or uninflected form = lemma

lemmatization varies across annotation schemes and languages
lemmatization is linked to part of speech

we would like to categorize several lemmata to more abstract categories
factors for determining a category can be:
syntactic / distributional
semantic
morphological
graphical
...

part of speech is the baseline for many further annotations
we need correct annotations
we will measure agreement
highly connected to lemmatization, thus underlies the same restrictions and parameters of variation

Always trust the guidelines more than your “grammatical intuition”, but in cases of doubt consult both.

language specific: you might have to find new ways / rules for undescribed phenomena (Please document!)
UD: strictly stick to UD guidelines for your language and please do not decide by what seems more logical to you

English	German	Greek	Russian	Turkish
British National Corpus Part of Speech Tagset	STTS 2.0	Universal POS tags	MyStem Morphology	MULTILIT

Universal Dependencies POS-tags, lemmas, and features

English	German	Greek	Russian	Turkish
correct BNC-POS, lemma, features (?)	correct STTS-POS, lemma, features (?)	correct UD-POS, UD features, lemma	correct MyStem-POS, lemma, features, and UD-POS	correct MULTILIT-POS, lemma, features (?), and UD-POS (?)

English	German	Greek	Russian	Turkish
derivable	derivable	needs manual correction	needs manual correction (?)	derivable (?)

delete tier norm [CU]
feel free to move the tiers up and down
rename pos_1_1 to pos_lang
look out for errors in the data and report them immediately

lemma "F16" is tagged as a proper noun with the respective tag from the specific tagset you are using (e.g. "PROPN" from Universal Postags)
all kinds of greetings should be treated as interjections and tagged with the respective tag from the specific tagset you are using (e.g. "INTJ" from Universal Postags), unless there is a specific tag for greetings in the language specific tagset you are using
regarding the lemmatization of informal greetings: you can just copy the word form from the norm/dipl-layer

Each language has additional transcription and annotation decisions. These are documented in the following sub-sections. We apologize for any parts of documentation that are not available in English until now, these will be translated and included in future releases of the corpus.

Transcription decisions: English

Spelling lists for CU Layer.

1. Compounds with hyphens

rear-end/rear-ended > rearend/rearended (if it is a verb, but if it is a noun, e.g. "rear end of the car", insert space)

2. Hesitation markers

ɑ > ah
ɛ > eh
ɹ [syllabic] > er
oʊ > oh
ə > uh
əm/ʌm > um
em
oh
ha
mm

3. Merged forms

couldn't > couldn=t
didn't > didn=t
doesn't > doesn=t
don't > don=t
I'd > i=d
I'm > i=m
it'd > it=d
shouldn't > shouldn=t
that's > that=s
there's > there=s
they're > they=re
Unclear determiner ["a" versus "the"] > (det) TBD
wasn't > wasn=t
we'd > we=d
what's > what=s
who's > who=s
you're > you=re
guy's ball -> guys, guy's (guy is) running -> guy=s
The following are lexicalized forms that appear in the dictionary: gonna, kinda, wanna.

COMMENT: How do we handle ambiguous cases with determiners e.g. with a vs. with the suggestion: with=e

4. Dropped final stops

dropped final stop for 'and' > an(d)
dropped final stop for past tense verbs, i.e. 'happened' > happen(ed)

5. Notations

[coughing]
[gulping]
[laughing]
[pfing] for a sound like "pfff"
[tongueclicking], including tsking as disapproval, while thinking and just mouth opening with a click
[throatclearing]
[whispering]
[sighing]
[sniffing]
[imitating] - when a person imitates a sound of a crash or other sounds related to the story
[yawning]

6. Other

basketball
ɛf > ef (i.e., case number ef sixteen)
soccer ball
renault
volkswagen
aysap (for a.s.a.p. 'as soon as possible', if pronounced 'ay-sap', not 'ay ess ay pee')

7. Foreign language material

hello - preevyet (привет)
Olya (name), not Olia

ef > f
pe:em > pm
be:em:doubleyou > bmw
aysap > asap

ef sixteen > F16 (one token)
nine | one | one > 911 (one token)
-_- emoticons to be left as one token
consecutive punctuation such as "..." are to be left as one token, unless they are already separated on dipl layer.
twelve > twelve

preevyet (hello) > privet

sec > second
prob > "probably" or "problem", depending on the context

I
Renault, Volkswagen, Golf

rear end (if noun), rear-end (if adjective or verb) e.g. "rear-end collision", "rear-ended", "bumped into the rear end"

BNC: Tag List

Might be less compatible with American English spellings
Greater number of tags for accuracy; the tags are highly specific, though not all are necessary for our purposes (i.e., four categories for punctuation). Regardless, researchers searching for broader categories in the corpus should be able to do so by filtering the data appropriately.
Intuitive tag names
Multiple codes for determiners

Hi/Hello/Hey : ITJ (Interjection)
F16: NP0 (proper noun)
I : PNP
am -> be: VBB
like: ITJ (interjection)
okay (ie. 'he is okay'): AJ0
kind (of): AV0
e (det): AT0
same: AJ0
as: CJS
(in) front: PRP
behind: PRP
Police: NN0
911: NP0 (proper noun)
no (AT0) one (PNI)
as (PRP) well (AV0)
"ish" should be removed during lemmatization (i.e. "smallish" --> "small")

(in German)

After every interval there must be a space!

speaker (named according to Sigle, e.g. DEmo09FD; type: transcription)
possibly Comment (communication between transcribers; type: description; information about this level disappears later)

segmentation according to communication units (CU) more information here
- rule of thumb: independent sentences represent a CU. Dependent sentences (verb at the end) represent a CU with the respective independent sentence.
- written data files: the subject's punctuation is taken into account. If a dependent sentence is distinguished from the independent part by the placing of a dot, the dependent sentence is considered a CU.
no accentuation
no punctuation

linguistic material that refers to the task such as inquiries from subjects and possibly interruptions by researchers is marked in the following way: <Q> communication with elicitor </Q> and receives an own event.
strict use of small initial letters
predominantly orthographic transcription according to German spelling regulations
BUT: transcribe according to ear in cases that do NOT comply with the general standard of spoken language! Don't leave out or add anything.
- this includes for instance: nich vs. nicht, kein vs. keinen, jetz vs. jetzt, n vs. ein, is vs. ist, ne vs. eine, ma vs. mal, was vs. etwas, brauch vs. braucht, rum vs. herum, ham vs. haben, isch vs. ich
- when the complete ending is omitted (-en and not only Schwa), e.g. gesprung, angefang, dein
- dialectal, sociolectal and ethnolectal variation is put into writing (not extremely precise)
standard phenomena of spoken language that will NOT be transcribed but will follow German orthography are:
- omission of vowels in final syllable / deletion of Schwa: always written orthographically (sehn = sehen, machn = machen)
- devoicing in final position (e.g. bald not balt)
- word final er (e.g. koffer not koffa)
- word final g as ch (e.g. König not könich)
- ä stays ä (e.g. später not speta)
- sch in connection with sp/st is noted as sp/st
- simplification from pf to f is not noted
- v and f are used according to spelling regulations
- diphthongs are used according to spelling regulations
- individual words: eigentlich, irgendwie
the metalanguage is English (everything that is not produced by speakers, e.g. non-verbal)

A list of individual choices regarding spelling can be found here

Don't leave anything out that is spoken. Don't add anything that is not spoken.
no apostrophes as signs of omission

onomatopoeia/echoisms:

individual tokens (e.g. gutschi gutschi gutschi)
unless they are very short - then together (z.B. eieiei)

sequences, repetitions, discontinuities

transcribe as audibly perceived, don't leave anything out
mark the discontinuity of words with /
word internal discontinuity shall be marked such that the discontinued element is bracketed by dollar signs, e.g. dipl: Kinder$gar$ wagen

fusional forms

fusional forms (reduction + quick addition) are marked by an equal sign (e.g. is=er, ham=se, so=ne)
preposition + article fusions are not additionally marked, but are written together. We do not differentiate between different forms of prep + art. fusions. These include:
- zum, beim, aufm, mitm, fürs, ins, ans etc.
- double consonants: vonner, inner, mitter etc.
- BUT: auser, aufer, zuner etc.
a list of spellings can be found here

lenghthening

orthographic spelling with lenghthening h or doubling of vowels remains, colons (see following examples) are added
lenghthening in letter combinations: in front of a lenghthening h (e.g. spä::ht), after a dipthong (e.g. polzei::), after a double letter (z.B. see:), after the vocalization of r (e.g. über:)
longer than normal (0.2-2 Sek.) with a colon (e.g. so la:ng)
strikingly long (ab 2 Sek.) with :: (e.g. so la::ng)
doubling of vowel syllables with % (e.g. tschü%üss); can be combined with lenghthening colons (e.g. tschü:%üss); the same goes for diphthongs (e.g. hei%ei)
drawn-out and long aspirated consonants are also marked with colons (just like the vowels)

generally to be written together (e.g. lehrerzimmer)
compositions of noun+adjective (saumüde), two adjetives (supertoll) or adjective+preposition (übergeil) and particle verbs (leidtun, weggehen) are transcribed as one token
abbreviations/words with hyphen are not separately marked (e.g. "mathe emesa prüfung" for Mathe-MSA-Prüfung)
coordination of compounds WITHOUT hyphen (e.g. mittag und abendessen)
hyphens, provided that correct spelling intends them, remain, e.g. "t-shirt"
compounds with names as first constituent are spelled with a hyphen, e.g. "rewe-parkplatz"

write out in full
write complex numbers together (e.g. zweikommadrei)
divide fractions (e.g. zwei drittel)
divide school marks (e.g. zwei minus)
divide dates (e.g. elfter dritter)

one word (e.g. "beemwe" for BMW)
compounds of two abbreviations are not separately marked (e.g. "mercedes eselfka" for Mercedes-S11K)¹

spelling is kept (e.g. renault)

transcribe onto speaker-tier
pauses between two CUs receive an own event
pauses within a CU is transcribed within, no own event
0.2-1.0 seconds: (-)
1.1-3.0 seconds: (--)
over 3 seconds: enter measured value in brackets, e.g. (5.5)
word internal pauses are marked the same way, e.g. auf(-)geschlossen (without space before and after the brackets), exception: word internal pause and filled pause (ähm) with spaces, e.g. auf (-) äh geschlossen

non-verbal actions such as laughing or coughing are noted on speaker-tier in square brackets, e.g. [laughing]
simultaneous laughing and speaking, as follows: [[laughing]ball]
assigning of CU: the same as hesitation markers, non-verbal actions are assigned to the following CU, e.g. "(-) | [laughing] (-) ähm ja das hat mir meine mama schon immer gesagt"

sound imitations are transcribed onto speaker-tier (e.g. "dann hör ick nur so wuh")
speaker-independent interruptions (e.g. loud car noise, beeping computer) are only transcribed as pause without specification

something incomprehensible within a CU is transcribed within the CU in brackets, so no extra interval (UNK) or (UNK, 2.2) über 2 Sekunden
if it's ambiguous to which CU it belongs, it receives an own event
assumptions about the content are put into brackets, each lexeme individually, e.g., (vermuteter) (Inhalt)
assumptions relate to existing lexemes, i.e. if one only hears "ga", one must either mark it as a discontinuity within the brackets, as (ga/) or as (UNK)

transcribe as heard in German orthography
EXCEPT with English material, this is kept in English spelling
AND: words that are listed in the Duden are not considered "Foreign Language Material". They are spelled accordingly (e.g. adieu)
draw up a list with the spelling of foreign words and their occurrence so they are always spelled the same way, this list can be found here

names of persons

replace name (first and surname) of speaker with speaker Sigle (e.g. DEmo02FD)
the Genetive -s is added to the Sigle (e.g. DEmo02FDs)
first names of absent mentioned people do not have to be anonymised
surname or first and surname of absent people must be anonymised by speaker Sigle_P (e.g. DEmo02FD_P)

indication of place

only anonymise if conclusion about place of residence is possible, e.g. when street names are mentioned that could be the address of the speaker
school: {category}, e.g. Alexander-Puschkin-Schule as {schoolname}schule
street: {streetname}straße
if the place is inflected, e.g. plural, the inflection is suffixed: {streetname}straßen
incomprehensible names of persons without indicating category

particles/signals/interjections

negation: 'hm'hm, nee, 'ä'ä
affirmation/signal of reception: 'hmhm
signal of reception: hm, mhmh

fillers: äh, ähm, öh, öhm, (even when they are produced with a glottal stop), hm (even when only m is produced)

interrogation particle (belong to preceding CU)

always: ne, gell
sometimes: oder, ja, okay

A list of all interjections that occur can be found here and is continuously expanded throughout the transcription process.

symbols	meaning
`<Q>...</Q>`	questions regarding the task
(-)	pause 0.2-1.0 sec.
(--)	pause 1.1-3.0 sec.
(3.2)	pauses longer than 3 sec.
(UNK)	incomprehensible material
(UNK, 2.2)	incomprehensible material longer than 2 sec.
(assumption)	assumed material
[...]	non-verbal action
[[...]...]	non-verbal action & verbal material
:	conspicuously long vowels/consonants (under 0.2-2 sec.)
::	extraordinarily long vowels/consonants (over 2 sec.)
=	reduced forms with quickly following connection
/	word discontinuity
$...$	word internal discontinuity
%	double syllable
{...}	specification of an anonymised place
'	glottal stop

This page documents the decisions taken concerning the spelling of specific words and documents the occurences of foreign language material (FM).

baby (duden, wenn "bebi/be:bi" produziert wird)
basecap (wie im duden)
beemwe
bissn (wenn nicht bisschen, sondern bissen oder bissn produziert wird, z.B. DEbi02FG_fsD: 22.05)
cap (wie im duden) ABER käppi
cops (wie im duden)
crash (wie im duden)
einskommafünf liter flasche (1,5l-Flasche)
fauwe (VW)
genuch (wenn es so produziert wird)
ha u (für HU, also die Abkürzung der Humboldt Universität)
hey (duden, wenn "häi" produziert wird)
hi (duden, wenn "hai" produziert wird)
ick (ich auf berlinerisch)
iks üpsilon straße (für xy straße)
käppi
limousine (spricht natürlich limusine)
nix (wie duden)
nochmal
pekawe (oder besser pekawe, weil ein Doppelpunkt eine Längung über 0.2 Sek. anzeigt, die meist nicht vorhanden ist bei pekawe, wenn doch, dann natürlich mit :)
pekawe fahrer (PKW-Fahrer)
revue passieren (wie im duden)
so was
tschau (empfohlene Schreibweise duden)
tschüss
van (Van wie im duden) (auch minivan)

english expressions are transcribed following the english orthografie:

crashen (DEmo12MD_isD / 00:30, Ebi08MG_isD / 00:44) im online Duden gelistet (eingesehen 15.04.2019, 9:57), Wert auf lang: deu
bystander (DEmo32FD_isD / 01:24) nicht im online Duden gelistet (eingesehen 15.04.2019, 10:00), Wert auf lang: eng
dude (DEmo53FD_isD / 00:24) nicht im online Duden nicht gelistet (eingesehen 15.04.2019, 9:59), Wert auf lang: eng
plep [pleb = dt. Prolo] (DEmo74MD_isD / 00:28) nicht im online Duden gelistet (eingesehen 24.04.2019, 12:32), Wert auf lang: eng

other foreign language material, as perceived:

dawei (rus)
vallah (ara)

kränk [von krank mit englischer aussprache] (DEmo71FD_isD / 01:35)

This page documents merged forms that occur in the German RUEG data and how they are trancribed.

Merged prepositions and articles; without extra-marking:

anner
aufer
aufm
aufn
auser
ausm
beim
hinterm
hinters
im
inner
ins
mitm
mitn
übern
vonnem
vonner
vors
zum
zur

son, sone (für solch, solche) also bei folgendem Substantiv im Plural

auch=n (auch ein)
bist=e
dass=es
d=is (das ist)
d=is=n (das ist ein)
gab=s
geht=s (geht es)
gib=s
grad=n
hab=s
ha=ich (hab ich)
hat=ter (hat der / hat er)
hat=n
hat=s
hättest=e
ich=n
ich=s
is=er (ist er)
is=es (ist es)
is=n
kennst=e (kennst du)
muss=er
noch=n
nu=ma (nur mal)
ob=s
sa=ma (stark zusammengezogenes sag mal)
so=m (so einem)
so=n (so ein)
so=ne (so eine)
und=n
un=an (und dann, komplett ohne Plosiv produziert)
war=n
war=s
weil=et (weil es)
wie=s
wird=s (wird es)
wollt=er (wollte er; so geschrieben, weil nicht entscheidbar ist, ob es wollt er oder wollte er ist)
zwar=n

(in German)

The RUEG-Korpus' general guidelines for normalisaton apply: Step 2: Normalisation

In the following, a few principles will be repeated and language specific decisions included.

orthographic normalisation
no normalisation in the syntax
no grammatical normalisation
discontinuities and pauses are erased and receive an empty event (on language level the value for discontinuities and pauses is erased -> empty event); EXCEPT word internal discontinuities (dipl: vorbeige$le$ rollt, norm: vobeigerollt)
repetitions remain
spoken: non-verbal material, such as [laughing], is not transferred -> empty event
punctuation marks are not included

this counts also for the written files - (change 7.10.2019)

reductions of determiners, adjectives and nouns are not normalised with respect to case and gender information*:

dipl	norm
schön guten tag	schön guten Tag
mit ein hund	mit ein Hund
so ein klein hund	so ein klein Hund

reductions and slip of the tongues of verbs, nouns etc. not related to case and gender marking are normalised:

dipl	norm
is	ist
Umfall	Unfall
gesprung	gesprungen

short forms of indefinite articles are normalised as:

dipl	norm
n	ein, einen
nen	ein, einen
ne	eine
eim, nem	einem

normalise 'nen' as 'ein' in cases of nominative masculine and nominative/accusative neuter, e.g.:
- dipl: "da is nen auto um die ecke gebogen". norm: "da ist ein auto..."
- dipl: "ich hab nen auto gesehen". norm: "ich habe ein auto..." BUT not in: dipl: "nen ne vollbremsung" as norm: "einen eine vollbremsung"
so=n either as "so ein" or as "so einen", depending on the context:
- dipl: "so=n typ hat mitm ball...". norm: "so ein Typ hat mitm Ball"
- dipl: "die frau hat so=n hund dabei". norm: "die Frau hat so einen Hund dabei"

According to the Duden and scholarly literature (e.g. Vogel 2006, Schäfer & Sayatz 2014), 'n' and 'nen' can each represent both ('ein', 'einen'). One normalises according to the principle of minimal deviation from the standardly expected form.

hesitation markers are all normalised as „äh“, these include äh, ähm, öh, hm etc.

dipl	norm
äh, öh, ähm, hm etc.	äh

when meaning is clearly constant, determine and document a standard, such as:

dipl	norm
aufgrund, auf Grund	aufgrund
andren,anderen	anderen
bro, brother	brother (lang=eng)
Dicker, Digger	Dicker (29.05.2019)
etwas, was	etwas
grad, grade, gerade	gerade
gern, gerne	gerne
habe, hab	habe
hey, hi, hei (as greeting, not as outcry)	hi
langlaufen	entlanglaufen
mache , mach (imperative)	mach
nichts, nix	nichts
noch mal, nochmal	nochmal (28.05.2019)
rumspielen	herumspielen
rum	herum
runterfallen	herunterfallen
sodass, so dass (when conjunction)	sodass
vorn, vorne	vorne

when change of meaning is possible or when context is restricted, leave lexemes as they are, the variations remain:
- daran, dran
- darin, drin, drinnen
- drauf, darauf
- sone (as in "sone autos", so only for plural nouns), solche
- reinfahren, hereinfahren, hineinfahren
- auffahren, rauffahren, drauffahren
- reinpacken, einpacken, hineinpacken

FM with German inflection, e.g.:

dipl	norm	lang
gecrasht	gecrasht	eng/deu

material that is included in the Duden, such as sorry, Van etc. are marked as deu on language level. The Online-Duden serves as a reference, the date of the viewing must be documented (list FM).

until twelve: spelled out
beginning with 13: numerals
in the written texts, keep the variation the subject chose

dipl	norm
pekawe	PKW
ef sechzehn	F16

gender gap
- dipl: Fußgänger innen; norm: Fußgänger_innen

dipl	norm
deu	deutsch
eng	englisch
ara	arabisch
tur	türkisch
spa	spanisch

written texts

include CU level
'dass' as conjunction if spelled 'das' is normalised to 'dass'
punctuation marks:
- do not add any, do not correct any, except when missing a space:
dipl norm

eingepackt.auf eingepackt / . / auf
- several punctuation marks one behind the other without space, e.g. three dots: … leave in an event
dipl norm

... / ... /
- if there is a space in between, then also leave it, e.g. . /. / .
dipl norm

. . . /. / . / . /
Emojis
- include Emojis such as :) on norm
abbreviations/acronyms
- conventionalised abbreviations are left on norm
- unconventionalised abbreviations or acronyms are spelled out, e.g. dipl: kp norm: kein | Plan
- acronyms that are also "action words" (e.g., lol) are left this way on norm

(partly in German)

The guidelines can be found here: Westpfahl_Schmidt_Jonietz_Borlinghaus_STTS_2_0_2017.pdf

Here you find some data specific decisions and some cases that are specified in STTTS 2.0 and highlighted here:

Following Rehbein 2013, we add the tag EMO for emticons and emoji to the STTS 2.0 tagset
F16 as NE
one word greetings and terms for saying goodbye as hi, hallo, tschüss are interjections (NGIRR)
speaker-codes, anonymised streetnames, etc. are proper nouns (NE)
names that were anonymised by the speaker, e.g., "Frau XX" or "XY Straße" receive the tag XY (non-word)
if it is not possible to decide on a POS tag, e.g., due to unfinished utterances, the event stays empty
conventionalised abbreviations (e.g., "d.h.") receive the POS tag ADV (guidelines p.13)
"also" receives the tag SEDM or ADV depending on the context:
- "also"/SEDM in the pre-prefield, e.g., "also/SEDM ich heiße..."
- "also"/ADV: adverbial connector, e.g. "also/ADV ging ich die Straße entlang", connector signaling a specification (without verb), e.g., "...eine Familie, also/ADV Frau, Mann, Kind" or a correction, e.g., "derweil ist dann ein Auto gekommen äh entgegen also entlanggekommen"
"wie" in "wie folgt" as KOKOM (see guidelines p.44 for other uses)
"als"
- "als"/KOUS if it introduces a subordinate clause
- "als"/KOKOM in prototypical cases such as "ich bin größer als du", here also in "ich möchte als Zeuge aussagen"
if "natürlich" can be replaced by "selbstverständlich" it receives the tag ADV
interrogative adverbs "wo, wie, worüber, warum" can be used as interrogatives or can serve as relative pronouns. In both cases, they get the POStag PWAV (STTS, S.26). Examples:
- "auf dem Mittelstreifen, wo/PWAV der Unfall passiert ist"
- "ich weiß nicht, wo/PWAV du bist"
- "wo/PWAV bist du"
"was, welche" can appear
- as interrogative pronouns, also in embedded contexts
  - substitutively: "Ich weiß nicht, was/PWS du gemacht hast"
  - attributively: "Welche/PWAT Farbe hat der Hut?";
- as interrogative pronouns with a relative use after verbs of dicendi/sentiendi nature
  - "Er erzählt, was er gesehen hat"
- as relative pronoun (PRELS) if the antecedent is mentioned previously
  - "das Kind, welches/PRELS sich auf der anderen Seite befand"

Weitere Beispiele

token	POS tag
/aufgrund /von	/ADV /APPR
/aufgrund (des Unfalls)	/APPR
/bis /später	/APPR /ADJD
/gegenüber /von	/ADV /APPR
/gegenüber /dem /Auto	/APPR /ART /NN
/nichts /weiter	/PIS /PTKMWL
/weder /noch	/KON /KON
zwar	ADV
...	$.

lemma represents the shortest converging form
nominalisations stay (Verletzte, Folgendes, Fahrer, etc.). The lemma represents the shortest converging form, so that POS and lemma match (e.g., norm: "das Spielen", pos_lang: NN, lemma: Spielen)
speaker codes stay as they are
the lemma of merged forms of articles and prepositions is the preposition: norm:"aufm", lemma:"auf"; norm:"mitm", lemma:"mit"; norm:"zum", lemma:"zu"
dates are represented by @card@
cardinal numbers stay on lemma as they are on norm layer, e.g., "zwei", "16"
reflexive pronouns on lemma are their corresponding personal pronouns (e.g., sich zu er|sie|es)
ordinal numbers stay as they are on norm layer
different forms of one lexeme, because related to gender and case marking, are reduced to the shortest converging form (see table below); EXCEPTION: NN denotating persons stay in the same gender form as on norm layer, e.g., "Augenzeugin" and "Augenzeuge"
"der", "die", "das" are always reduced to "d", no matter if it used as article, relative pronoun or demonstrative pronoun
forms in plural get the singular form on lemma (e.g., norm: Einkäufe, lemma: Einkauf)

different forms	lemma
all, alle, alles, aller	all
andere, anderer, anderes	ander
eine, einer, ein	ein
der, die, das	d
diese, dieser, dieses (atrribuierende Demonstrativpronomen)	diese
dieser, dies, dieses (substituierendes Demostrativpronomen)	dies
Folgendes, Folgende, Folgender	Folgende
jener, jenes, jene	jene
mein, meiner, meine, meins	mein
weit, weiter, weitere, weiterer, weiteres	weit
welche, welcher, welches	welch
vordere, vorderer, vorderes (ADJA)	vordere
zweit, zweite, zweiter, zweites	zweit

no capital letters
abbreviations/acronyms are transcribed as full words (e.g. ДТП = дэтэпэ)
lower case for all words, even at the beginning of a sentence ==> exceptions: participant code, participant code + _P and symbols like (UNK) etc.
the transcription of the participant speech shall generally take place in accordance to the standard orthographic rules of Russian
but: if the participant articulates utterances or words, which are not typical neither for the standard Russian nor for the oral vernacular language (повседневный язык) of Russian, transcribe it as it was articulated by the participant

Example from USbi52MR_fsR:

Participant: потому что они два два (-) не видели (-) ==> Standard and vernacular Russian: потому что они друг-друга не видели ==> два два is not typical for the standard or oral vernacular Russian ==> Transcription: потому что они два два (-) не видели (-)

Example:

Participant: мужик играл с футболом ==> standard and vernacular Russian: мужик играл с мячом ==> с футболом in this context is not typical neither for the standard nor for the vernacular Russian ==> Transcription: мужик играл с футболом
typical phenomena for standard and vernacular Russian, which shouldn't be transcribed:
- reduced vokals (if it is not a special dialect of Russian)
  
  Example:
  
  Participant: ана талкает каляску, а мужык играет смячикам ==> Transcription: она толкает коляску, а мужик играет с мячиком
- so called phonetic words (= words, which are articulated as one word ==> it often concerns prepositions and the following noun)
  
  Example:
  
  Participant: он вышел издому ==> Transcription: он вышел из дому

two tiers should be used
- 1. tier = speaker tier ==> is only used for the transcription of the participant speech; it gets marked with the participant code
- 1. tier = comment tier ==> the comment tier is an optional tier and used for communication between transcribers; later (that means: after the transcription) the comment tier will be deleted

NB:
- 1 independent/main clause (главное предложение) = 1 simple sentence (простое предложение)
- 1 independent clause (главное предложение) + 1 or more dependent clauses (придаточное предложение) = complex sentence (сложноподчинённое предложение)
- 1 independent clause (главное предложение) + 1 or more independent clauses (главное предложение) = compound sentence (сложносочинённое предложение)
hint: an independent clause can always stand alone; a dependent clause should never stand alone, because without its independent clause the dependent clause wouldn't make sense
in addition: a compound sentence can be easily recognized by certain conjunctions, which connect the independent clauses in that type of sentence: these conjunctions are coordinating (соединительный), adversative (противительный) or disjunctive (разделительный) conjunctions (союзы), such as и, но, а, или, либо...либо etc.
dependent clauses in complex sentences can be recognized by conjunctions and relativizers like потому что, когда, что, кто, который, чтобы, так как, но и etc.
the participant speech gets segmented in communication units (CUs)
1 CU correlates with 1 simple sentence or with 1 complex sentence; sentences, which consist of more than 1 independent clause (= compound sentence), are in every case more than 1 CU:
- Simple sentence
  
  Example from DEbi52FR_isR:
  
  я стала свидетельницей (-) а: (-) столкновение двух машин | ==> 1 CU
- Complex sentence
  
  Example from DEbi52FR_isR:
  
  виноваты были не машины а: (-) один (-) эм мужчина который (-) ну кот/ ещё более такой (-) молодой | ==> 1 CU
- Compound sentence
  
  Example from DEbi03FR_isR:
  
  хотела с тобой это поделить | но я была здесь на парковке у реве | ==> 2 CU's
if a compound sentence includes a VP coordination or an ellipsis, such sentence is to be annotated as one CU:

Example:

она вышла из магазина [subject ellipsis] уронила пакет и [subject ellipsis] пошла дальше | ==> 1 CU
discourse marker (ну, ну там, вот, так, как бы, получается, эм, то есть etc.) and the following utterance will be seen as one CU

Example from DEbi03FR_isR:

ну там я предполагаю там ребёнок внутри был | ==> 1 CU
discourse markers (ну, ну там, вот, так, как бы, получается, эм, то есть etc.), which specify the precursory utterance, will be seen as one CU, too

Example from DEbi02FR_fsR:

хм их было трое то есть э маленький ребёнок э: женщина и мужчина | ==> 1 CU ("маленький ребёнок э: женщина и мужчина" is a specification of "трое")
greetings (привет, здравствуйте, здорово, здрасте etc.) will be defined as an extra CU

Example from DEbi03FR_isR:

привет DEbi03FR_P | слушай я сейчас видела здесь такую ситуацию | ==> 2 CU's
question tags such as правда?, или?, правильно?, правильно понял?, не так ли? etc. belong to the previous CU

Example:

ты вася пупкин, правильно | ==> 1 CU
in case you are not sure, make less CU's, to facilitate the SUD annotation
keep in mind, that punctuation marks are not used at all ==> that means: no full stops, no commas etc.

replace the name of the participant with the respective code ==> e.g.: DEbi52FR
if whole names or surnames of the participant’s friends are mentioned, replace them with the participant code + _P

Example from DEbi52FR_isR:

привет DEbi52FR_P
places, that could lead to the identification of the participant, should be replaced as following

Example:

я хожу в Leo-Tolstoi-Schule ==> я хожу в {schoolname}шуле

я живу на улице Шютценштрассе ==> я живу на улице {streetname}штрассе
anonymization in Audacity: the name of the participant should be anonymized with the aid of white noise

hesitation markers do not represent an own event ==> they belong to the concerned CU

Example from DEbi52FR_isR:

я когда шла э: на автобусную остановку (-) эм: ==> 1 CU
general notation:
- m-hm (confirming) = угу
- ehm = эм or э:м
- hm = хм
- eh/uh = э or э:
- ɑha = ага
- ah = а:

vocals pronounced longer than normal (under 2 seconds) are marked with a colon ==> e.g.: ну: да
vocals that are pronounced extremely long (2 seconds and longer) are marked with two colons ==> e.g.: ну:: да
long pronunciation is also possible for consonants ==> e.g.: тс: тише
doubling of vocal syllables are marked with % ==> e.g.: ты точно сделал? да%а

are transcribed on speaker tier
a pause between two CU´s is marked as an own event ==> the pause gets two boundaries
pauses in a CU get transcribed within the concerned CU ==> they do not represent an own event
word internal pauses are marked in the words and without a space between the parts of the concerned word ==> e.g.: с э(-)тим мячиком ==> exception: pauses with эм in a word ==> e.g.: они на (-) эм крыли стол
general notation:
- 0.2-1 second ==> (-)
- 1-3 seconds ==> (--)
- longer than 3 seconds ==> time should be measured and noted in brackets ==> e.g.: (3.1), (5.5)
background noise like traffic noise, phone ringing or computer noise are noted as pauses

merged forms are transcribed as they are articulated, but with an equal sign linking the merged elements

Example from USbi52MR_fsR:

с одной стороны (-) дороги (-), э, шли муж=женой

general rule: reduced syllables should be transcribed in its full length, even if it was differently articulated

Example:

participant: она токо что шла на улице ==> transcription: она только что шла на улице
exception: if a word can be found with its reduced syllables in a dictionary (e.g. MAC ==> Link ) and the participant articulated the word in its reduced form, then the reduced form of the word should be preferred for transcription

Example:

participant: здрасте ==> transcription: здрасте Link zum MAC
use / to mark unfinished words

Example from DEbi52FR_isR:

сегодня (-) э когда я шла на авто/ астобв/ (-) а (-) автобусную остановку

numbers should be transcribed by words, since Russian numbers are often inclined or morphed

Example:

я вижу двух* женщин*
dates should be transcribed by words, too

Example:

я родился двадцать первого* января тысяча девятьсот девяносто пятого года*

general rule: foreign words should be transcribed into Russian as they are articulated
in addition: for this case exists a special list, where you can search for words like that or add new words Link ==> important: all transcribers have to transcribe these words into Russian eaqually
- Autowerkstatt = аутоверкштат
- Truck = трак
- Ort = орт
- REWE = реве
- Renault = рено
- also = алзо
- OK_ = окe
- WhatsApp = воцап
- {schoolname}schule = {schoolname}шуле
- {streetname}straße = {streetname}штрассе

table for russified lexicals

German/English word	Russified word	Code	File	Second
accident	аксидент	USbi06FR	fsR	6,94
accidentally	аксидальтально	USbi07MR	fsR	15,1
Aldi	алди	DEbi64MR	fsR	8
also	алзо	DEbi56FR	fsR	36,85
Autowerkstatt	аутоверкштат	DEbi51FR	fsR	93,75
bag	бег	USbi59FR	isR	164,23
Ball (mit dem)	болом	DEbi12FR	fsR	16,9
Ball	бол	DEbi12FR	fsR	19
in the back	ин зе бэк	USbi74MR	isR	53,92
ciao	чао	DEbi04MR	isR	29,4
case	кэйз	USbi86FR	fsR	4,62
crash	крэш	DEbi12FR	fsR	47,78
crashed	крэшовали	DEbi15MR	isR	37,56
hey	хей	USbi73FR	isR	0,64
like	лайк	USbi86Fr	fsR	73,13
message	месседж	USbi16FR	isR	8,23
911/nine one one	найн уон уон	USbi59FR	isR	83,64
911/nine eleven	найн элэвэн	USbi73FR	fsR	59,29
OK	окей	USbi05FR	isR	64,7
Ort	орт	DEbi53FR	fsR	23,19
parking lot	паркинг лот	USbi74MR	isR	7,48
Renault	рено	DEbi10MR	isR	31,71
representative	рэпрезэнтэтиф	USbi74MR	fsR	3,85
Rewe	реве	DEbi03FR	isR	13,9
Schützenstraße	Шютценштрассе	DEbi04MR	fsR	5,14
spilled/spilt	сплыть	USbi58FR	fsR	36,41
stopped	стопт	USbi58FR	fsR	25,22
stroller	строллер	USbi79MR	isR	51,8
Truck	трак	USbi52MR	fsR	77,39
turn	торн	USbi74MR	isR	36,32
Vans	вэнс	USbi59FR	isR	12,59
WhatsApp	воцап	USbi52MR	isR

if already exists a conventionalized spelling in Russian for a foreign word, the conventionalized spelling should be preferred

Example:

Messenger = мессенджер

non-verbal events like laughing or coughing are noted in square brackets on speaker tier and always belong to the concerned CU
general notation
- [throatclearing]
- [coughing]
- [laughing]
- [pfing] ==> for a sound like „pff“
- [sighing]
- [sniffing]
- [tongueclicking] ==> including tsking as disapproval, while thinking and just mouth opening with a click
- [yawning]
- [gulping]
- [whispering]
- [breathing]
if the participant speaks and makes a non-verbal event at the same time, it is noted as:
- [[coughing]word]
- [[laughing]word]
- [[sighing]word]
- [[tisking]word]
- [[yawning]word]
- [[gulping]word]
- [[whispering]word]
Example from DEbi52FR_isR:

ты [[laughing]знаешь] что сегодня случилось
uninterpretable material is to be marked as (UNK) on speaker-tier
if it is not clear, to which CU the UNK belongs, make an own event ==> that means: write the UNK between two boundaries
if the UNK is longer than two seconds, measure the time and write the time together with UNK in one bracket ==> e.g.: (UNK, 2.1)
assumed content is noted in brackets, each token has to be separated ==> e.g.: (assumed) (content)
background noise such as traffic noise, phone ringing or Computer noise should be noted as pauses

Symbol	Meaning
`<Q> speech </Q>` ==> e.g.: `<Q> можно я ещё раз </Q>`	for questions to the procedure on the part of the participant or for verbal interventions on the part of the elicitor
(-)	for pauses 0.2-1 second
(--)	for pauses 1-3 seconds
(time) ==> e.g.: (3.1)	for pauses longer than 3 seconds
(UNK)	for uninterpretable material
(UNK, time) ==> e.g.: (UNK, 2.1)	for uninterpretable material longer than 2 seconds
(assumed word)	for assumed material
[non-verbal action] ==> e.g.: [laughing]	for non-verbal material
[[non-verbal action]word] ==> e.g.: [[laughing]знаешь]	for a non-verbal & verbal event
:	for unusually long vocal or consonant (under 2 seconds)
::	for unusually long vocal or consonant (longer than 2 seconds)
=	for merged words
/	for interruption of a word
%	for doubled syllables
{...} ==> e.g.: {schoolname}шуле	for anonymised places

the conversion of the participant speech ¹ into a correct orthographic form ², which correlates with the applicable linguistic norm/with the linguistic standards of the respective language (in this case of the Russian language), is called normalization
during the normalization of the participant speech morphosyntactic (grammatical) mistakes don´t get respected /don´t get changed into its grammatically correct forms

Example:

Participant: | Я калаской видел | ==> Normalization: | Я коляской видел | ³
the normalization is manually carried out in the application program EXMARaLDA

EXMARaLDA on normalization level is structured in the follwoing layers:
- dipl-[dipl]-layer
- norm-[norm]-layer
- dipl-[language]-layer
- dipl-[CU]-layer ⁴

Layer	Function
dipl-[dipl]	shows the original speech of the participant, which never gets corrected - neither grammatically nor orthographically
norm-[norm]	here the speech of the participant should be manually normalized (orthographical - not grammatical - corrected)
dipl-[language]	shows, to which language the concerned word or emoji belongs ⁵
dipl-[CU]	here the speech of the participant is segmented into communication units (which is one of the results from the transcription); in case of written files the person, who wants to normalize the file, has to add at first the dipl-[CU]-layer (because it is absent) ⁶ and then he has to segmente the participant speech in CUs by himself.

... files from DEbi---R; USbi---R and RUmo---R with the following symbols at the end:
- _fsR (formal spoken Russian)
- _fwR (formal written Russian)
- _isR (informal spoken Russian)
- _iwR (informal written Russian)

1. step: Push/Pull/Fetch in GitHub
1. step: Open EXMARaLDA Partitur-Editor
1. step: File ==> Open ==> rueg repository ==> GitHub ==> rueg-corpus ==> exb ==> P3 ==> 1, 2, 3 …
1. step: verify (in case of the spoken files), if the CUs on dipl-[CU]-layer correlate with the CU-guidelines - if not, please correct it; in case of the written files you have to add an extra CU-layer and then to segment the speech of the participant into CUs according to the CU-guidelines
1. step: verify, if every single word correlates with its right language on the dipl[language]-layer - if not, please correct it
1. step: normalization according to the orthographical rules of the Малый академический словарь MAC ==> translate the speech of the participant from the dipl-[dipl]-layer in its orthographically correct form on norm-[norm]-layer and delete all phenomena, which are not necessary for the following annotation levels (lemmatization, POS-taggin etc.) ==> which phenomena this in practice concerns, can be taken from the table in 4. Normalization guidelines and problems
1. step: Save your results
1. step: go to GitHub ==> submit your file ==> push/pull/fetch

Phenomenon/Problem	What to do on norm-[norm]-layer	Example dipl-[dipl]-layer	result on norm-[norm]-layer
pauses	delete them ⁷	привет (-) ты не (--) ты не поверишь что случилось	привет ты не ты не поверишь что случилось
hesitation markers	delete them ⁸	там короче эм шла женщина с коляской и э	там короче шла женщина с коляской и
phonetic markers	delete them ⁹	мальчик играл с мячиком и:: и потом мячик покатился на дорогу	мальчик играл с мячиком и и потом мячик покатился на дорогу
emojis	they don´t get removed and receive an own event	/ :-P привет / я / сегодня / увидел / аварию :-( /	/ :-P / привет/ я / сегодня / увидел / аварию / :-( /
punctuation marks (in general)	they don´t get removed and receive an own event ¹⁰	/ мячик / покатился / на / дорогу . / потом / собака / начала / лаять . /	мячик / покатился / на / дорогу /. / потом / собака / начала / лаять / . /
punctuation marks (many in immediate succession without spaces between them)	keep them in one event	/это / просто / был / кошмар / !!!!!	/это / просто / был / кошмар / !!!!! /
punctuation marks (many in immediate succession with spaces between them)	every punctuation mark receives an own event	/это / просто / был / кошмар / ! ! ! ! !	/это / просто / был / кошмар / ! / ! / ! / ! / ! /
merges / clitics ¹¹	separate and normalize, if possible	муж=женой	муж с женой
repetitions	they don´t get removed	/ потом / мяч / покатился /на /на/ дорогу /	/ потом / мяч / покатился /на / на / дорогу /
self-correction in whole words	they don´t get removed and each of them receives an own event	/ парень / играл / с/ в /мячик /	/ парень / играл / с / в /мячик /
unknown material	delete it ¹²	UNK
interaction/communication with the elicitor	delete it ¹³	`<Q> можно я ещё раз </Q>`
interrupted speech	delete it ¹⁴	пошёл чтобы пс/ посмотреть	пошёл чтобы посмотреть
non-verbal material	delete it ¹⁵	первая машина [tongueclicking] тормозила	первая машина тормозила
foreign material	normalize according to agreement and adapt the language from which the word originally comes
capitalization (in written data ==> iw, fw)	do not correct it at the beginning of a new sentence, but correct it in the middle of a sentence, if it doesn´t conform to the russian orthographic rules	/ . / В / семье / был / Муж / , / Жена / и / их / ребёнок / . /	/ . / В / семье / был / муж / , / жена / и / их / ребёнок / . /
Abbreviations	(merge and) normalize according to standard orthography	100 км/ч	сто / км/ч

Each token on dipl-[dipl]-layer has to be marked with an abbreviation, to which language it belongs (language value). In general, this gets conducted automatically, but in some cases you have to change the language value - for example in case of foreign words in the participant´s speech on dipl-[dipl]-layer. If this situation/problem occures, please change the language value (abbreviation). In case, that you don´t know the abbreviation of the respective language, use the ISO 639-2 language code. A list of valid codes can be found here.

Language	language value on dipl-[language]-layer
Russian	rus
German	ger
English	eng

Common short or colloquial forms that are acceptable according to Малый академический словарь or other academic dictionaries shown on https://rus-academic-dict.slovaronline.com:

normalized	short or colloquial form
этот etc.	тот, того, та, той, то, те, тех
здравствуйте	здрасте
сейчас/час	щас
привет	здарова/здорова

Common short or colloquial forms that are not acceptable according to Малый академический словарь:

normalized	short or colloquial form
только	тока

7. Comments

¹ The original speech of the participant, which never gets changed (neither in process of normalization nor in process of lemmatization or POS-tagging), is located on the dipl-[dipl]-layer in EXMARaLDA.

² The orthographically corrected speech of the participant is located on the norm-[norm]-layer in EXMARaLDA. In the most cases, on this layer the events (gaps) are already filled in, but the orthographic accuracy of the words should be checked manually each time.

³ This CU gets translated into its orthographically correct form (калаской ==> коляской). In contrast to this, the grammatical (morphological and syntactic) incorrectness of the CU (the correct government of the verb видеть is видеть кого/что? (Acc.) ==> Я видел коляску** and the more correct word order in this case would be Я вижу калаской) doesn´t get respected or changed at all.

⁴ In case of written files (_fwR, _iwR) a dipl-[CU]-layer has to be added (in case of spoken files the dipl-[CU]-layer already exists) and the speech of the participant from the dipl-[dipl]-layer has to be segmented in CUs on the added dipl-[CU]-layer.

⁵ Keep in mind, that in case of foreign words or emojis (f.e. ХД), these words or emojis get transformed into the russian alphabet, but they still stay foreign words (even if they are written in Russian). Therefore, foreign words have to be marked on dipl-[language]-layer with the concerned abbreviation (usually ger or eng) from which language they originally came.

⁶ You can add a new layer in EXMARaLDA through the menu bar (Tier ==> Add Tier). then, choose the following settings:

Speaker: dipl [dipl]
Type: A(nnotation)
Category: CU

By depending on the diplomatic tokenization, the tier has one timeslot per event on dipl-[dipl]-layer. To annotate a CU, merge all timeslots on the added dipl-[CU]-layer that overlap the tokens on dipl-[dipl]-layer belonging to the CU (including sentence-final punctuation).Finally, assign a simple annotation value "CU" to the newly created span and everything is ready.

The CU tier does not explicitly contain any text. By overlapping its respective tokens, the entire text of a CU is already defined and can be searched.

⁷ If you delete pauses on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.

⁸ If you delete hesitation markers on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.

⁹ If you delete phonetic markers on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.

¹⁰ General rule: We do not add and correct any punctuation marks. Only in case, when the punctuation mark sticks to a word, then we have to correct it on norm-[norm]-layer.

¹¹ Clitics are synsemantic/function words (= words without an own semantic meaning [==> therefore, they have a grammatic meaning]), which phonologically "fuse" into their immediate "neighbour" words (which are in most of the cases nouns). That means, that clitics (which are in most of the cases prepositions) and their immediate "neighbour words" (nouns) get articulated as one (phonological) word. Depending on their position (behind or in front of their nouns to which they belong), clitics can be classified in proclitics (if the clitic stands in front of its noun to which it belongs) and in enclitics (if the clitic stands behind its noun to which it belongs).

Example:

работать из (=clitic) дому (=noun) ==> articulation: издому, врач на (=clitic) дом (=noun) ==> articulation: надом

¹² If you delete unknown material on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer

¹³ If you delete the interaction/communication with the elicitor on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer

¹⁴ If you delete interrupted speech on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer

¹⁵ If you delete non-verbal material on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer

for normalization of the participant speech according to the orthographical rules of Russian on norm[norm]-layer use the Малый академический словарь MAC or Викисловарь

Lemmatization

the term lemma may be defined as the base form of a word
the base form of a word is the form, you can usually find in a dictionary
for verbs the base form correlates with the infinitive, for nouns with the nominative, and for adjectives with the nominative in its masculine form
the conversion of a word into its base form is called lemmatization
the lemmatization is carried out semi-automatically in the application program EXMARaLDA using two POS- and lemma-taggers U-POS and MyStem, however the accuracy of the taggers should be checked manually each time
the lemmas or base forms of the words can be found in MyStem on the norm[mystem_lex] and in U-POS on the norm[lemma] layer

POS-Tagging

the term tagging means that each word of the participant is attributed with its part of speech (POS)
the tagging is carried out in the application program EXMARaLDA by semi-automatic U-POS and MyStem taggers, but the accuracy of the taggers should be checked manually each time
there are two taggers in EXMARaLDA, which assume the task of POS-tagging - U-POS and MyStem
keep in mind, that these two tagging-softwares are similar to each other, but not absolutely identical ¹

U-POS-Layers

to the U-POS-software belong the layers from norm[Animacy] to norm[voice] as well as the norm[lemma] and the norm[pos] layer
each layer in U-POS (and MyStem) correlates with a grammatical category
the meaning of each grammatical category in U-POS gets explained in the following table:

Layer	Grammatical category	Grammeme	Part of speech
norm[Animacy]	Одушевлённость	Одушевлённость (Anim); Неодушевлённость (Inan)	concerns only nouns
norm[Aspect]	Вид	Cовершенный вид [что сделать?] (Perf); Несовершенный вид [что делать?] (Imp)	concerns only verbs
norm[Case]	Падеж	им.п. (Nom); род.п. (Gen); дат.п. (Dat); вин.п. (Acc); твор.п. (Ins); предл.п. (Loc); зват.п. (Voc)	concerns all nominal categories of POS
norm[Degree]	Степень сравнения	положительная (Pos); сравнительная (Cmp); превосходная (Sup)	concerns adjectives and adverbs
norm[Foreign]	иностранное слово	(Yes)	concerns all words, which do not belong to the Russian language
norm[Gender]	Род	муж.р. (Masc); жен.р. (Fem); сред.р. (Neut)	concerns only nouns, adjectives and pronouns
norm[Mood]	Наклонение	изъяв.н. (Ind); услов.н. (Cnd); повел.н. (Imp)	concerns only verbs
norm[Number]	Число	Единственное (Sing); Множественное (Plur)	concerns nouns, adjectives, personal pronouns and verbs
norm[Person]	Лицо	Первое лицо (1); Второе лицо (2); Третье лицо (3)	concerns personal pronouns and verbs
norm[Tense]	Время	Настоящее (Pres); Прошедшее (Past); Будущее (Fut)	concerns verbs and participles
norm[VerbForm]	Форма глагола	Неопределённая форма глагола (Inf); Финитная форма глагола (Fin); Причастие (Part); Деепричастие/Герундий (conv)	concerns verbs
norm[voice]	Залог	Действительный (Act); middle voice (Mid); Страдательный (Pas)	concerns verbs and participles
norm [lemma]	Base form of a word (Начальная форма слова)	------	concerns all parts of speech
norm[pos]	POS-Determination of the given word according to UPOS principles	существительное (NOUN); глагол (VERB); прилагательное (ADJ); determiner (DET) [abandon in all cases] ...	concerns all parts of speech
norm[Reflex]	Real reflexive verbs (настоящие возвратные глаголы) ²	(Yes)	concers verbs and participles

MyStem-Layers

to the MyStem-tagger belong the norm[mystem_gr] and the norm[mystem_lex] layers
each layer in MyStem (and U-POS) correlates with a grammatical category
the meaning of each grammatical category in MyStem can be explained as in the following table:

Layer	Grammatical category	Grammeme	Part of speech
norm[mystem_gr]	POS-Determination of the given word according to MyStem principles	Every redundant grammeme on this layer gets deleted, except the first grammeme and - if they appear - the grammeme of transitivity (tran/intr) ³ and parenthesis (parenth)	concerns all parts of speech
norm[mystem_lex]	Base form of a word	should conform with the base form in U-POS	concerns all parts of speech

... files from DEbi---R; USbi---R and RUmo---R with following symbols at the end:
- _fsR (formal spoken Russian)
- _fwR (formal written Russian)
- _isR (informal spoken Russian)
- _iwR (informal written Russian)

1. step: Push/Pull/Fetch in GitHub
2. step: Open EXMARaLDA Partitur-Editor
3. step: File ==> Open ==> rueg repository ==> GitHub (or SmartGit) ==> rueg-corpus ==> exb ==> P3 ==> 1, 2, 3 …
4. step: Verify if the CUs in every file correlate with the CU-guidelines - if not, please correct it
5. step: Verify if every word correlates with its right language on the dipl[language]-layer - if not, please correct it
6. step: POS-Tagging ==> verify the accuracy of the POS-Tagging-softwares (U-POS and MyStem)
7. step: Delete all features from the norm[mystem_gr]-layer except the first one and - if available - the features of transitivity, parenthesis and other features which are not redundant with U-POS features
8. step: Save your results
9. step: Go to GitHub (SmartGit) ==> submit your file ==> push/pull/fetch -> commit

Phenomenon/Problem	Solution	Example
participant code	dipl[language]: rus; norm[Foreign]: Yes; norm[mystem_gr]: S, persn; norm[mystem_lex]: USbi05FR; norm[lemma]: USbi05FR; norm[pos]: PROPN; all other grammemes on UPOS-layers get deleted	здравствуйте меня зовут USbi05FR
emojis	dipl[language]: rus; norm[pos]: SYM; all other grammemes on UPOS-layers get deleted	-----
foreign words, e.g. english words: examine each grammatically e.g. анд	dipl[language]: eng; norm [Foreign]: Yes; norm[mystem_gr]: CONJ; norm[pos]: CCONJ; norm[mystem_lex]:анд; norm[lemma]:анд	and = анд
items, e.g. English items: examine each grammatically e.g. а(н)	dipl[language]: eng; norm [Foreign]: Yes; norm[mystem_gr]: ANUM; norm[mystem_lex]:а(н); norm[lemma]:а(н); norm[pos]: DET [abandon in all cases]	a(n) = а(н)
слова с буквой ё	ё пишется на всех уровнях, кроме на уровне dipl ==> на уровне dipl ничего не изменяется ==> norm[norm]: …ё…; norm[lemma]: …ё…; norm[mystem_lex]: …ё…	-----
ага	norm[mystem_gr]: PART; norm[mystem_lex]: ага; norm[lemma]: ага; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	-----
блин	norm[mystem_gr]: INTJ; norm[mystem_lex]: блин; norm[lemma]: блин; norm[pos]: INTJ; all other grammemes on UPOS-layers get deleted	ну блин
быстро	norm[Degree] Pos ⁴; norm[mystem_gr]: ADV; norm[mystem_lex]: быстро; norm[lemma]: быстро; norm[pos]: ADV; all other grammemes on UPOS-layers get deleted	эта машина очень быстро ехала
быть	norm[Aspect]: Imp; norm[Gender]: Fem, norm[Mood]: Ind; norm[Number]: Sing; norm[Tense]: Past; norm[VerbForm]: Fin; norm[Voice]: Act; norm[mystem_gr]: V,intr; norm[mystem_lex]: быть; norm[lemma]: быть; norm[pos]: AUX ⁵	она была уверена
быть	norm[Aspect]: Imp; norm[Gender]: Fem; norm[Mood]: Ind; norm[Number]: Sing; norm[Tense]: Past; norm[VerbForm]: Fin; norm[Voice]: Act; norm[mystem_gr]: V,intr; norm[mystem_lex]: быть; norm[lemma]: быть; norm[pos]: VERB ⁶	там была собака
весь	norm[Case]: Gen; norm[Gender]: Fem; norm[Number]: Sing; norm[mystem_gr]: APRO ⁷; norm[mystem_lex]: весь; norm[lemma]: весь; norm[pos]: PRON	от всей души; что скажешь к всему этому
вообще	norm[mystem_gr]: ADV,parenth; norm[mystem_lex]: вообще; norm[lemma]: вообще; norm[pos]: ADV; all other grammemes on UPOS-layers get deleted	ну вообще там была ещё одна машина
вот in function to replace something	norm[mystem_gr]: ADVPRO; norm[mystem_lex]: вот; norm[lemma]: вот; norm[pos]: ADV; all other grammemes on UPOS-layers get deleted	вот он идёт
вот in function of a modal particle	norm[mystem_gr]: PART; norm[mystem_lex]: вот; norm[lemma]: вот; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	вот а потом мальчик побежал за мячом
врезаться	norm[Aspect]: Perf; norm[Gender]: Fem; norm[Mood]:Ind; norm[Number]: Sing; norm[Tense]: Past; nomr[VerForm]: Fin; norm[Voice]: Mid; norm[mystem_gr]: V, intr; norm[mystem_lex]: врезаться; norm[lemma]: врезаться; norm[pos]: VERB; norm[Reflex]: Yes; all other grammemes on UPOS-layers get deleted	одна машина врезалась в другую
вроде	norm[mystem_gr]: PART; norm[mystem_lex]: вроде; norm[lemma]: вроде; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	вроде никто не пострадал
всё (ещё, равно)	norm[Case]: Nom; norm[Gender]: Neut; norm[Number]: Sing; norm[mystem_gr]: APRO; norm[mystem_lex]: всё; norm[lemma]: всё; norm[pos]: PRON	это всё; всё равно; всё ещё
всё-таки	norm[mystem_gr]: PART; norm[mystem_lex]: всё-таки; norm[lemma]: всё-таки; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	он всё-таки поступил по-своему
всё-таки after conjuctions и, а, но	norm[mystem_gr]: CONJ; norm[mystem_lex]: всё-таки; norm[lemma]: всё-таки; norm[pos]: SCONJ; all other grammemes on UPOS-layers get deleted	как ни крути, а всё-таки придётся решить эту проблему
да	norm[mystem_gr]: PART, parenth; norm[mystem_lex]: да; norm[lemma]: да; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	да так всё произошло
давай	norm[Aspect]: Imp; norm[Mood]:Imp; norm[Number]: Sing; norm[Person]: 2; nomr[VerForm]: Fin; norm[Voice]: Act; norm[mystem_gr]: V,tran; norm[mystem_lex]: давать; norm[lemma]: давать; norm[pos]: VERB; all other grammemes on UPOS-layers get deleted	давай
два	norm[Case]: Nom; norm[Gender]: Fem; norm[mystem_gr]: NUM ⁸; norm[mystem_lex]: два; norm[lemma]: два; norm[pos]: NUM	стукнулись две машины
должен, должна, должно, должны	norm[Gender]: Masc; norm[Number]: Sing; norm[Variant]: Short; norm[mystem_gr]: A, praed; norm [mystem_lex]: должен; norm[lemma]: должен; norm[pos]: ADJ; all other grammemes on UPOS-layers get deleted	он должен был позвонить в полицию, но в конце не звонил
другой	norm[Case]: Acc; norm[Gender]: Fem; norm[Number]: Sing; norm[mystem_gr]: APRO ⁹; norm[mystem_lex]: другой; norm[lemma]: другой; norm[pos]: ADJ	одна машина врезалась в другую
ДТП (дорожно-транспортное происшествие)	norm[Animacy]: Inan; norm[Case]: Gen; norm[Gender]: Neut (because of происшествие); norm[Number]: Sing; norm[mystem_gr]: S,abbr; norm[mystem_lex]: ДТП; norm[lemma]: ДТП; norm[pos]: PROPN	я стал свиделем ДТП
его, её, их as possessive pronouns	norm[case]: Gen; norm[Gender]: Fem; norm[number]: Sing; norm[Person]:3; norm[mystem_gr]: SPRO; norm[mystem_lex]: она; norm[lemma]: она; norm[pos]: PRON	он уронил её пакет
ехавший	norm[Aspect]: Imp; norm[Case]: Nom; norm[Gender]: Masc; norm[Number]: Sing; norm[Tense]: Past; norm[VerbForm]: Part; norm[Voice]: Act; norm[mystem_gr]: V, intr; norm[mytem_lex]: ехать; norm[pos]: VERB; all other grammems on UPOS-laysers get delated	второй водитель ехавший сзади не успел притормозить
ещё	norm[mystem_gr]: ADV; norm[mystem_lex]: ещё; norm[lemma]: ещё; norm[pos]: ADV; all other grammemes on UPOS-layers get deleted	там ещё стояла женщина рядом с машиной
женат	norm[Gender]: Masc; norm[Number]: Sing; norm[Variant]: Short; norm[mystem_gr]: A, praed; norm[mystem_lex]: женатый; norm[lemma]: женатый; norm[pos]: ADJ; all other grammemes on UPOS-layers get deleted	он видимо женат
заезжая	norm[Aspect]:Imp; norm[Tense]:Pres; norm[VerbForm]:Conv; norm [Voice]: Act; norm[mystem_gr]:V,intr,ger; norm[mystem_lex]: заезжать; norm[lemma]:заезжать; norm[pos]:VERB; all other grammemes on UPOS-layers get deleted	одновременно заезжая пара машин
здравствуйте, пока, привет	norm[mystem_gr]: INTJ; norm[mystem_lex]: здравствуйте; norm[lemma]: здравствуйте; norm[pos]: INTJ; all other grammemes on UPOS-layers get deleted	здравствуйте я звоню по поводу
здрасте, приветик	norm[mystem_gr]: INTJ, inform; norm[mystem_lex]: здрасте; norm[lemma]: здрасте; norm[pos]: INTJ; all other grammemes on UPOS-layers get deleted	здрасте я звоню по поводу
значит as вводное слово	norm[Aspect]: Imp; norm[Mood]: Ind; norm[Number]: Sing; norm[Person]: 3; norm[Tense]: Pres; norm[VerbForm]: Fin; norm[Voice]: Act; norm[mystem_gr]: V, parenth, tran; norm[mystem_lex]: значить; norm[lemma]: значить; norm[pos]: VERB ; all other grammemes on UPOS-layers get deleted	значит он уронил всё и пошёл
играть	norm[Aspect]: Imp; norm[Mood]: Ind; norm[Number]: Sing; norm[Person]: 3; norm[Tense]: Past; norm[VerbForm]: Fin; norm[Voice]: Act; norm[mystem_gr]: V, tran ¹⁰; norm[mystem_lex]: играть; norm[lemma]: играть; norm[pos]: VERB	мальчик играл с мячом
как at the beginning of dependent/subordinate clause	norm[mystem_gr]: CONJ; norm[mystem_lex]: как; norm[lemma]: как; norm[pos]: SCONJ; all other grammemes on UPOS-layers get deleted	он не знает как это делается
как in case of comparison or emphasizing	norm[mystem_gr]: PART; norm[mystem_lex]: как; norm[lemma]: как; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	водитель тупой как пробка
как at the beginning of direct questions or at the beginning of indirect questions in suboridinate clauses	norm[mystem_gr]: ADVPRO; norm[mystem_lex]: как; norm[lemma]: как; norm[pos]: PRON; all other grammemes on UPOS-layers get deleted	как у тебя дела; подскажите как пройти к библиотеке
как in function of a subordinate conjunction without a comparison meaning, but in form of an adverb	norm[mystem_gr]: ADVPRO; norm[mystem_lex]: как; norm[lemma]: как; norm[pos]: PRON; all other grammemes on UPOS-layers get deleted	мальчик показал как пройти к дому; я не знаю как это сделать
кажется as вводное слово	norm[Aspect]: Imp; norm[Mood]: Ind; norm[Number]: Sing; norm[Person]: 3; norm[Tense]: Pres; norm[VerbForm]: Fin; norm[Voice]: Act; norm[mystem_gr]: V, parenth, tran; norm[mystem_lex]: казаться; norm[lemma]: казаться; norm[pos]: VERB	кажется водитель не вовремя видел мячик
км/ч	norm[mystem_gr]: S, abbr; norm[mystem_lex]: км/ч; norm[lemma]: км/ч; norm[pos]: NOUN; all other grammemes on UPOS-layers get deleted	сто км/ч
какой	norm[Case]: Nom; norm[Gender]: Masc; norm[Number]: Sing; norm[mystem_gr]: APRO¹¹; norm[mystem_lex]: какой; norm[lemma]: какой; norm[pos]: PRON; all other grammemes on UPOS-layers get deleted	там шёл какой-то мужик
короче as вводное слово	norm[Degree]: Cmp; norm[mystem_gr]: ADV, parenth; norm[mystem_lex]: коротко; norm[lemma]: коротко; norm[pos]: ADV; all other grammemes on UPOS-layers get deleted	короче там шла женщина с коляской
который	norm[Case]: Nom; norm[Gender]: Masc; norm[Number]: Sing; norm[mystem_gr]: APRO ¹²; norm[pos]: PRON	этот мальчик ну который там играл с мячиком он
мой, твой	norm[Case]: Gen; norm[Gender]: Masc; norm[Number]: Sing; norm[mystem_gr]: APRO; norm[mystem_lex]: мой; norm[lemma]: мой; norm[pos]: PRON	я звоню вам с моего телефона
мол as вводное слово	norm[mystem_gr]: PART, parenth; norm[mystem_lex]: мол; norm[lemma]: мол; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	----
мужик	norm[Animacy]: Anim; norm[case]:Nom; norm[Gender]: Masc; norm[Number]: Sing; norm[mystem_gr]: S,inform; norm[pos]: NOUN; all other grammemes on UPOS-layers get deleted	мужик побежал на дорогу
наверно, похоже as вводное слово	norm[mystem_gr]: ADV, parenth; norm[mystem_lex]: наверно; norm[lemma]: наверно; norm[pos]: ADV; all other grammemes on UPOS-layers get deleted	он наверно этого не знал
никто	norm[Case]: Acc; norm[Gender]: Masc; norm[mystem_gr]: SPRO; norm[mystem_lex]: никто; norm[lemma]: никто; norm[pos] PRON; all other grammemes on UPOS-layers get deleted	я никого не видел
нет	norm[mystem_gr]: PART, parenth; norm[mystem_lex]: нет; norm[lemma]: нет; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	нет не поеду ни за что
ну	norm[mystem_gr]: PART; norm[mystem_lex]: ну; norm[lemma]: ну; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	ну что я могу сказать
нужно, можно, надо	norm[mystem_gr]: ADV, praed; norm[mystem_lex]: нужно; norm[lemma]: нужно; norm[pos]: ADV; all other grammemes on UPOS-layers get deleted	-----
ого	norm[mystem_gr]: PART; norm[mystem_lex]: ого; norm[lemma]: ого; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	-----
один	norm[Case]: Nom; norm[Gender]: Masc; norm[Number]: Sing; norm[mystem_gr]: ANUM; norm[mystem_lex]: один; norm[lemma]: один; norm[pos]: NUM	я видел как один человек позвонил в полицию
окей	norm[mystem_gr]: PART; norm[mystem_lex]: окей; norm[lemma]: окей; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	-----
первый	norm[Case]: Nom; norm[Gender]: Fem; norm[Number]: Sing; norm[mystem_gr]: ANUM; norm[mystem_lex]: первый; norm[lemma]: первый; norm[pos]: NUM	первая машина свернула с дороги на парковку и резко остановилась	stehen lassen
пока (conjunction)	norm[mystem_gr]: CONJ; norm[mystem_lex]: пока; norm[lemma]: пока; norm[pos]: SCONJ; all other grammemes on UPOS-layers get deleted	пока она доставала продукты из машины мальчик играл с мячом
пока (leave-taking)	norm[mystem_gr]: INTJ; norm[mystem_lex]: пока; norm[lemma]: пока; norm[pos]: INTJ; all other grammemes on UPOS-layers get deleted	пока пока
потом, затем	norm[mystem_gr]: ADVPRO; norm[mystem_lex]: потом; norm[lemma]: потом; norm[pos]: PRON; all other grammems on UPOS-laysers get delated	потом машины стукнулись
потому, поэтому	norm[mystem_gr]: ADVPRO; norm[mystem_lex]: потому; norm[lemma]: потому; norm[pos]: ADV; all other grammemes on UPOS-layers get deleted	потому что водитель был пьяный
раз	norm[Animacy]:Inan; norm[Case]: Nom; norm[Gender]: Masc; norm[Number]: Sing; norm[mystem_gr]: S,m,inan ; norm[pos]: NOUN; all other grammemes on UPOS-layers get deleted	которая как раз въехала
ранен	norm[Aspect]: Imp; norm[Gender]: Masc; norm[Number]: Sing; norm[Tense]: Past; norm[Variant]: Short; norm[VerbForm]: Part; norm[Voice]: Pass; norm[mystem_gr]: V, tran, praed; norm [mystem_lex]: ранить; norm[lemma]: ранить; norm[pos]: VERB; all other grammemes on UPOS-layers get deleted	никто не ранен
свой	norm[Case]: Acc; norm[Gender]: Masc; norm[Number]: Sing; norm[mystem_gr]: APRO ¹³; norm[pos]: PRON	он любит свой народ
сзади	norm[mystem_gr]: ADV; norm[mystem_lex]: сзади; norm[lemma]: сзади; norm[pos]: ADV; all other grammemes on UPOS-layers get deleted	а сзади как раз машина подъезжает
сзади	norm[mystem_gr]: PR; norm[mystem_lex]: сзади; norm[lemma]: сзади; norm[pos]: ADP; all other grammemes on UPOS-layers get deleted	а сзади неё как раз две машины подъезжают
собакин	norm[case]:Acc; norm[Number]:Plur; norm[mystem_gr]: APRO,poss; norm[mystem_lex]:собакин; norm[lemma]:собакин; norm[pos]:ADJ all other grammems on UPOS-laysers get delated	тётя и дядя я думаю это собакины
спасибо	norm[mystem_gr]: INTJ; norm[mystem_lex]: спасибо; norm[lemma]: спасибо; norm[pos]: INTJ; all other grammemes on UPOS-layers get deleted	-----
судя	norm[Aspect]: Imp; norm[Tense]: Pres; norm[VerbForm]: Conv; norm[Voice]: Mid; norm[mystem_gr]: V, intr, ger; norm[mytem_lex]: судить; norm[lemma]: судить; norm[pos]: VERB; all other grammems on UPOS-laysers get delated	судя по тому что случилось
там, так, тут	norm[mystem_gr]: ADVPRO; norm[mystem_lex]: там; norm[lemma]: там; norm[pos]: ADV; all other grammems on UPOS-laysers get delated	там женщина шла по дороге
типа	norm[mystem_gr]: PART,parenth; norm[mystem_lex]: типа; norm[lemma]: типа; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	ну типа того
то at the beginning of suboridinate clauses	norm[mystem_gr]: CONJ; norm[mystem_lex]: то; norm[lemma]: то; norm[pos]: SCONJ; all other grammemes on UPOS-layers get deleted	если у вас ещё вопросы возникнут то свяжитесь со мной
то in function to replace sth.	norm[Case]: Nom; norm[Gender]: Neut; norm[Number]: Sing; norm[mystem_gr]: APRO; norm[myste_lex]: тот; norm[lemma]: тот; norm[pos]: PRON; all other grammemes on UPOS-layers get deleted	произошло то что мы все предвидели
тоже, только	norm[mystem_gr]: PART; тnorm[mystem_lex]: тоже; norm[lemma]: тоже; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	он тоже вышел из машины
тот, этот, такой	norm[Case]: Dat; norm[Gender]: Fem; norm[Number]: Sing; norm[mystem_gr]: APRO¹¹; norm[mystem_lex]: тот; norm[lemma]: тот; norm[pos]: DET; all other grammemes on UPOS-layers get deleted	по той же дороге ехали ещё две машины
увидев	norm[Aspect]: Perf; norm[Tense]: Past; norm[VerForm]: Conv; norm[Voice]: Act; norm[mystem_gr]: V, tran, ger; norm[mystem_lex]: увидеть; norm[lemma]: увидеть; norm[pos]: VERB; all other grammemes on UPOS-layers get deleted	собака увидев мяч кинулась на него
ф	dipl[language]: rus; norm[mystem_gr]: S,persn; norm[mystem_lex]: ф; norm[lemma]: ф; norm[pos]: PROPN; all other grammemes on UPOS-layers get deleted	ф шестнадцать
хз (хер знает)	norm[mystem_gr]: INTJ, abbr, parenth; norm[mystem_lex]: хз; norm[lemma]: хз; norm[pos]: INTJ	Водители обсуждали ситуацию но полиции не было хз
чуть-чуть	norm[mystem_gr]: ADV; norm[mystem_lex]: чуть-чуть; norm[lemma]: чуть-чуть; norm[pos]: ADV; all other grammemes on UPOS-layers get deleted	он чуть-чуть опоздал
щас	norm[mystem_gr]: ADV,inform; norm[mystem_lex]: щас; norm[lemma]: щас; norm[pos]: ADV; all other grammemes on UPOS-layers get deleted	щас приду
это in function to replace sth.	norm[Case]: Nom; norm[Gender]: Neut; norm[Number]: Sing; norm[mystem_gr]: APRO; norm[myste_lex]: этот; norm[lemma]: этот; norm[pos]: PRON; all other grammemes on UPOS-layers get deleted	он ему это сказал
это after dash (тире)	norm[mystem_gr]: PART; norm[myste_lex]: это; norm[lemma]: это; norm[pos]: PART; all other grammemes on UPOS-layers get deleted	мама - это самый родной человек на свете
я	norm[case]:Nom; norm[Number]: Sing; norm[Person]: 1; norm[mystem_gr]: SPRO ¹⁴; norm[pos]: PRON; all other grammemes on UPOS-layers get deleted	-----

5. Comments

¹ U-POS and MyStem use partly different features for the POS-tagging of words.

 Example: In case of the Russian personal pronoun я U-POS dismisses it to be a pronoun (PRON). Further specifications in U-POS are not given in this context. In contrast to that, MyStem specifies the pronoun. 
          MyStem dismisses я to be a noun-pronoun (SPRON).

² In general, all reflexive verbs in Russian can be identified by the verb postfix -ся. But not all verbs which end with the postfix -ся are reflexive verbs. Verbs with a transitive word stem and the postfix -ся are not reflexive verbs, but verbs in passive voice. When in doubt, check the Russian verb by translating it into German. If you can translate the Russian verb with sich... into German, then it is very likely a real reflexive verb and should be marked on norm[Reflex]-layer with Yes and on norm[Voice]-layer with Mid. If that is not possible and you have to translate the verb into German with the aid of the passive construction wird/werden...ge-..., then it is very likely a transitive verb in its passive form. In this case the word gets marked on norm[Voice]-layer with Pas and the norm[Reflex]-layer stays empty .

 Example: Книга читается.
          Das Buch liest sich. ==> This translation wouldn't make sense (except in fairy-tales), because a book can't 
                                   usually read itself. 
          Das Buch wird gelesen. ==> This translation is more logical than the translation above (if we imagine, that the 
                                     context is not a fairy-tale), because the word stem is a transitive verb with the 
                                     postfix ending -ся. Therefore, the verb expresses the passive and can be translated 
                                     here in that way, that the book gets read by someone, who is unknown or who doesn't 
                                     want to be mentioned. 

 Example: Человек развивается.
          Der Mensch wird entwickelt. ==> Развивать is an transitive verb and the postfix -ся could lead to the 
                                          conclusion, that in this case we are dealing with the passive voice. Basically, 
                                          it is absolutely possible and without the context of course difficult to define. 
                                          In view of this, that we don't have a context, orient yourself on the 
                                          general meaning of this sentence, which is often used. 
          Der Mensch entwickelt sich. ==> This is the general meaning of this sentence, which is used quite often. In its 
                                          general meaning the verb doesn´t have a passive, instead a reflexive meaning. 
                                          This meaning can be preferred in such cases, in which the context doesn't exist 
                                          or is not very clear.

³ Transitive verbs are verbs, which govern direct objects (objects in accusative without preposition). Between the verb and the accusative object is no preposition. Only transitive verbs can create the passive voice. The passive voice can be recognized by a word stem of a transitive verb + postfix -ся.

 Example: Мальчик  читает книгу. Книга читается  мальчиком. 
          Junge (Nom) liest (tran.verb) Buch (Acc.obj. wihtout preposition). Buch (Nom) wird gelesen (pass. voice of a 
          tran.verb) vom Jungen (Inst).

Intransitive verbs are verbs which govern indirect objects (objects in accusative with preposition or objects in other grammatical cases). Between verbs and object(s) can appear a preposition. The objects can appear in accusative with a preposition, in dative with or without a preposition, in genitive with or without a preposition, in instrumental with or without a preposition and in locative with preposition (objects in locative always stand with a preposition, therefore the Russian locative is called the preposition case). Intransitive verbs can't create the passive voice.

 Example: Папа звонит маме. *Мама звонится папой.
          Papa (Nom) ruft (intr.verb) an Mama (Dat.). *Mama wird angerufen von Papa.

⁴ Keep in mind, that not all kinds of adverbs and not all kinds of adjectives can form degrees. The adverb сегодня or the adjective другой can´t form degrees. In these cases you should delate the token on norm[degree]-layer.

⁵ In this case быть has the function of an auxiliary (Hilfsverb). Therefore, the main act/ main verb of the sentence does not posses быть, but уверен (in combination with быть). On this account the word быть gets defined on norm[pos]-layer as AUX.

⁶ In this case быть is the main act of the sentence and has therefore the function of the main verb (Vollverb). On this account the word быть gets defined on norm[mystem_lex]-layer and norm[pos]-layer as VERB.

⁷ The pronoun весь has these grammatical features, if it can be translated as ganz/целый. In these cases весь can be seen more as an adjective, therefore APRO and PRO.

¹⁵ The pronoun весь has these grammatical features, if it can be translated as all/aller. In these cases весь gets used to replace a noun or a phrase and to refer back to an element, word or situation, which was already introduced in the discourse before, but the speaker won´t repeat it again, therefore DET and SPRO.

⁸ In comparison to один, два is defined on norm[mystem_gr]-layer as NUM, because it doesn´t get inflected like an adjective. Therefore, один gets on norm[mystem_gr]-layer ANUM (because it has in inflection features like an adjective) and два gets NUM (because it hasn´t features like an adjective in inflection). Furthermore, in comparison to один два hasn´t a plural paradigma. ⁹ The word другой is defined on norm[mystem_gr]-layer as APRO, because it gets inflected like an adjective, but has the function of a SPRO to replace other nouns, therfore APRO and ADJ. Furthermore, другой can´t form degrees, therefore the event on norm[degree]-layer should be empty.

¹⁰ In this context the verb играть is intransitive, because the Russian preposition c usually requires the instrumental. However, there exist cases, in which играть can be used as a transitive verb.

 Example: Вася играет дурака в этом спектакле.
          Vasja (Nom) spielt (tran.verb) den Dummen (acc.object without a preposition between verb and object)  in diesem 
          Stück (Loc).

Therefore, all verbs which might have a transitive meaning in other contexts have to be defined as transitive on MyStem layer, even if the verb is used as an intransitive verb in the current context! The reason is, that a verb, which can be used (theoretically) as a transitive verb, gets always treated as a verb with a transitive basic meaning, no matter if this transitive meaning of the verb appears in the current situation or not.

¹³ The pronoun свой is defined on norm[mystem_gr]-layer as APRO, because it gets inflected like an adjective, therefore APRO.

¹² Words like такой or который are defined on norm[mystem_gr]-layer as APRO, because in Russian these pronouns get inflected like adjectives, therefore APRO.

¹⁶ то есть is seen as two seperated words, because there is no hyphen (дефис), which combines the two words to one word ==> то is a word for itself and есть is a word for itself. Therefore, each word is seen as an own token, gets an own event and has to be determined grammatically on its own. The same concerns words like потому что or только что. They are seen as two separated words, get own events and have to be grammatically determined on their own.

¹¹ Words like тот or этот are defined on norm[mystem_gr]-layer as APRO, because these pronouns get inflected like adjectives, therefore APRO. These pronouns are defined on norm[pos]-layer as DET, because they have editionally an determinanting (referring) function, because these pronouns refer back to an element, word or situation, which was already introduced in the discourse before, but the speaker won´t repeat it again. Therefore the speaker uses determinating (referring) pronouns.

¹⁴ All personal pronouns are defined on norm[mystem_gr]-layer as SPRO and on norm[pos]-layer as PRON. Personal pronouns get defined on norm[mystem_gr]-layer as SPRO, because in Russian these pronouns replace other nouns (существительные), therefore SPRO.

If you have problems to decide to which part of speech the current word belongs, then look the word up in the
Национальный корпус русского языка and check their results or solution. But keep in mind that they have analyzed the speech of their participants partly under different conditions and assumptions.
All U-POS features are available here: Universal features part 1 and Universal features part 2
All MyStem features are available here: MyStem features
If you have problems to decide whether the current word is a transitive or an intransitive one or if you simply don't know in which grammatical case a word appears, use Викисловарь

create a TextGrid on Praat
import a TextGrid to EXMARaLDA

speaker tier (e.g TUmo01MT; type: transcription)
optional tier for segmentation in Intonation Phrases (IP)
Normalization in EXMARaLDA

According to Communication Units (CU) Communication_unit__P4_10.12.2018.pdf
No punctuation

Replace name of participant with the respective code (e.g TUmo01MT)
If whole names or surnames of friends are mentioned, replace with the participant code + _P (e.g. TUmo02FT_P)
Places that could lead to the identification of a participant (e.g. Atatürk okulunda = Axxx{schoolname} okulunda, Kızılay caddesi = Kxxx{streetname} caddesi)
if a phone number is mentioned, please anonymize it as {phonenumber}

If this is not possible mark those passages as: <Q> communication with elicitor </Q>

Merged forms are transcribed as they are articulated, but with an equal sign linking the merged elements
Examples from TUmo10MT_isT: n=apıyorsun (= ne yapıyorsun), TUmo11MT_isT: n=aber (= ne haber)

Tag Questions

tag questions (de mi) do not constitute a separate CU

reduced syllables are transcribed as articulated
Examples: bi tane (= bir tane), gidiyo (= gidiyorsun) yakıyosun (= yakıyorsun), içbiri (= hiçbiri)
Use / to mark unfinished words, e.g. “Çarb/ çarptı derken oldu bitti“

pronounced sounds are transcribed as articulated (e.g gardaşım (= kardeşim), but sounds which are not typical for Turkish are not represented.

0.2 - 1 sec: (-)
1-3 secs: (--)
More than 3 secs: (5.5) to be measured
Wordinternal pauses are marked as followed: top(-)la - no space between the parts.

vocals pronounced longer than normal (under 2sec) are marked with : (e.g. canı:m)
vocals that are pronounced extremely long (2sec and more) are marked with :: (e.g canı::m)
also possible for consonants (e.g. tamam:)
doubling of vocal syllables with % (e.g. ba%ay)

non-verbal events such as a participant laughing or coughing are noted in square brackets on speaker tier, e.g. [laughing], [whispering, [clears throat], [sighs], [sniffs], [snapsfingers]
if participants speak and laugh at the same time, it is noted as: [[laughing]speech]

uninterpretable material is to be marked as (UNK) on Speaker-tier
longer than 2secs: (UNK, 2.1)
assumed content in brackets, each token separated: (assumed) (content)

e (short "e") ee (long "ee") ı (short "ı") ııı (long "ııı")
thinking: "hmm, eem, ımm"
agreement: "hıhı"
negation: "ı ıh"
dissapointment: "tüh"

original spelling will be kept.

Keep conventionalized spelling (e.g. Renault = renault)

Numbers are spelled (e.g 155 = yüz elli beş)

Symbols	Meaning
`<Q> araştırmacıyla iletişim </Q>`	instances of questions concerning the procedure and/or verbal interventions of elicitators
(-)	0.5 - 1sec
(--)	pauses 1-3secs
(3.2)	pauses longer than 3secs
(UNK)	uninterpretable material
(UNK, 2.2)	uninterpretable material longer than 2secs
(assumption)	assumed material
[gülüşmeler/fısıldaşmalar]	non-verbal material
[[gülüşme]konuşma]	non-verbal & verbal event
:	unusually long vocal or consonant (under 2secs)
::	unusually long vocal or consonant (longer than 2secs)
=	merged forms
/	interruption of a word
%	doubled syllables
{...}	specification of an anonymised place

Basics

Orthographic normalization.
No changes in grammar (case, tense etc).
Pauses are deleted on a norm layer.
Non-verbal material (laughing, coughing) is deleted on a norm level.

Anonymized material

Participant's codes are deleted on a norm layer.
Anonymized material should be left the same way as on a dipl layer (e.g., kxxx{streetname}da).
- if an anonymized token is followed by an ending only, the ending should be placed in one box;
- if an anonymized token is followed by a word, like sokakta, mahallede etc, these words are placed in the next box.

Capitalization

Proper names are capitalized on a norm layer.
Names of countries, nations are capitalized (e.g., Almanya, Almanlar);
Names of days, months are capitalized (e.g., Salı, Ekim);
Nouns like sokak, cadde, mahalled are capitalized in case they are used together with proper names (e.g., Necatibey Sokağı'nda)
German nouns are capitalized (e.g., Auto)
For the capitalization norms of the word "Allah", check TDK Allah (Atasözü, deyim ve birleşik fiiller)
If a common noun is capitalized when it is not placed at the beginning of a sentence (e.g., Futbol Topu), on a norm level it should be corrected as (e.g., futbol topu).
if the whole text or some of its parts was written in CAPS, it should be changed to lower case except where Turkish standard orthography demands capitalization.

Apostrophe

Proper names (e.g., Starbucks), country names (e.g., Almanya) and nouns like sokak, cadde, mahallede require an apostrophe in case they are used together with proper names and used in a dative, accusative, or locative cases (e.g., Necatibey Sokağı'nda);

Dates and Numbers

Code 'ef on altı'/ 'fe on altı' is represented as F16 (written in one box);
If a date is written as 15.10.2018, it stays as 15.10.2018 on a norm level;
But if a date is spelled out (e.g., on beş ekim), the day is represented with numbers, but the month is spelled out and capitalized (e.g., 15 Ekim).

Cancellations, Repairings

If a canceled/repaired token is a meaningful word, then it stays on a norm level, but without a slash (e.g., çarpma/ çarpıştı --> çarpma çarpıştı -- in separate boxes).
if a canceled/repaired token is a word, but is not meaningful according to the context, we delete it on a norm layer (e.g., arı/ araba geliyordu --> araba geliyordu);
if a canceled/repaired token is not a word (e.g., kö/ köpek havladı), we delete it on a norm layer (e.g., kö/ köpek havladı --> köpek havladı).

Assumed material

if on a diplomatic layer, a word is placed in parentheses as assumed material, we leave the parentheses out and keep only the word itself (e.g., (muhtemelen) --> muhtemelen);
if the assumed material is not a word (e.g., trafik kasağı), we try to guess the closest meaningful word and normalize it as trafik kazası.
if the meaning of the assumed material cannot be guessed, we leave it empty on a norm layer.
orphographic mistakes (especially in written data) are corrected (for example, qma -> ama, cüpike --> köpek).

Foreign Material

Foreign material is normalized according to the rules of the language it is taken from.

Hesitation Markers

dipl layer	norm layer
e, ee, eee	e
ı, ıı, ııı	e
em, hm	e

Punctuation

Each punctuation mark is placed in a separate box.
A triple-dot punctuation mark (...) is placed in one box.

Material which is not found in TDK (please update the list during the normalization)

dipl layer	norm layer
laylaylom/lay lay lom	laylaylom
boooah	boah
off	of

Long Vocals & Consonants

Long vocals & consonants (may they be represented with a colon: or with the help of two vowels), the words are spelled according to the orthographic norms (e.g., ya: --> ya)

Emojis

Emojis are represented on a norm layer as they are on a dipl layer;
One emoji is placed into one box.

Dipl[lang] Layer

The code of Turkish is tur
if we see UNKNOWN instead of tur on a dipl[lang] layer, we change it into tur manually.
we delete the language code in case of pauses, non-verbal material (like laughing, coughing), punctuation marks.
we change the language code in case of foreign material BUT: change the code of the language only in case when the word has not been adopted into Turkish yet. E.g., hasar would be still tur, or tişört is tur. But bye bye is eng.
delete language code for participants' codes

code	language
tur	Turkish
deu	German
eng	English
ara	Arabic
kur	Kurdish

Dipl[line] and dipl[message] Layers

disregard these layers.

RUEG Corpus Documentation

Overview of Annotations

Dependencies between annotation layers

Meta data fields

Meta data fields new in 0.3.0

cu (Communication Unit)

Processing steps

dipl (Tokenization)

Language-specific differences

Processing steps

norm (corpus-wide normalization)

Language-specific differences

Processing steps

lemma (Lemmatization)

Processing steps

pos (Universal part of speech)

Processing steps

pos_lang (Language specific Part of speech)

Processing steps

language (Language/Foreign Material)

Processing steps

message (Chat Message span)

line (Chat Message line)

Processing steps

dep (Universal Dependencies)

Processing steps

Annotation Step 1: Transcription

Anonymisation

Segmentation

Spelling

Transcription

'Unwanted' material

Merged forms

Reduced syllables

Elisions, repetitions and interruption

Pauses

Long vocals & consonants

Non-verbal material

List of meta comments used in the RUEG project

Uninterpretable material

Hesitation markers / Interjections / Reception markers

Foreign language material

Proper/Brand names from "foreign language"

Table of symbols

Annotation Step 2: Normalization

Segmentations in our corpus

What is normalization in our corpus and what is its purpose

Underlying principles and things to keep in mind

Orthography of reference

Additional tasks

Annotate CUs for written data

Adapt language annotation

Normalization guidelines

English clitics

German clitics

Greek and Russian clitics

Turkish clitics

Special material / cases

Grammatical deviations

Annotation Step 3: POS and Lemma

Tools involved

Lemmatization

Part of Speech (POS)

Know the guidelines and document difficult cases

Rule of thumb in cases of doubt (Lemma & POS)

Language-specific POS vs. UD-POS

POS – Schemes by Language

All languages

POS – Who does what?

UD-POS by language

Remarks

General decisions for all languages:

Language Specific Decisions

Transcription Decisions English

English Normalization

Individual acronyms and letters are represented with lowercase letters.

Number less than twelve are written out, while numbers greater than twelve are represented as numerals.

Foreign language material

Abbreviations to be written normalized, even if they are in the dictionary

Capitalizations