Annotation Step 2: Normalization

Our corpus pipeline faces 4 types of input for each speaker that participated in the elicitation:

formal written (fw) text files
informal written (iw) chat exports
formal spoken (fs) transcriptions
informal spoken (is) transcriptions

All data types are divided in above-word-level ("phrasal") segments:

fw: sentences indicated by punctuation used by the participant
iw: messages and lines indicated by separators used by the participant (punctuation, new line, message separation)
fs and is: communication units (CUs) indicated by transcriber (some projects might use intonation phrases additionally)

We annotate our CUs on the annotation tier "cu" (in lowercase letters). These segments are then automatically tokenized (roughly speaking divided in character sequences between two whitespaces), which provides our first word-level segmentation: The diplomatic layer, consisting of graphemic words represented in the phrasal segments.

This word-level segmentation is then to be normalized to facilitate search and automatic annotation.

The result of normalization is an additional word-level segmentation layer, similar to dipl. Starting at the diplomatic tokenization, explicit rules are to be applied to obtain orthographically normalized tokens on the norm layer. Starting of as a copy of the diplomatic token layer, the normalized tokens can (but do not necessarily need) to undergo operations such as edition, deletion, split or merge.

Our normalization is not the last step of analysis and does not remove information from the overall corpus
Normalization takes place mainly on word-level to obtain standardized forms w.r.t. an orthography of reference
We will agree on an orthography of reference for each language in our corpus
The normalized layer uses the language's native script, whereas the diplomatic layer sticks to the script used by the participants for the written registers (fw, iw)
We do not normalize syntax. This way we are able to learn about an underlying grammar more easily rather than analysing the elicitated language in terms of a standard grammar. Note that this comes with disadvantages but is more appropriate for our projects' idea. We want to obtain a layer that is orthographically normalized but still allows to do syntactical analyses. This is why we do not normalize syntax and try to keep as many elements as possible.
Normalization is also a technically necessary process. Orthographically standardized forms help all researchers, those not familiar with our data included, to search linguistic phenomena. Furthermore, annotation to be integrated in our corpus can be pre-generated automatically and efforts of correction are reduced.
The highest priority is to ensure that we will always be able to answer our research questions.
We do not alternate lexical choice when normalizing.
The norm layer will enable us to identify grammatical units.
A normalization decision is always influenced by the question of how controversial and/or lossy in terms of information that decision might be.
Since this is our first layer of normalization, we normalize very carefully.
There will be no grammatical normalizations (adaption of cases or gender or other grammatical features that deviate from our understanding of "standard")
We normalize to a single standardized form (if possible) to reduce noise in the tokenization
We always normalize as far as we can go, which means we are more careful in some cases compared to others. As an example, 're stays 're in the normalization, because tempus is unclear, whereas 'll can be normalized to will, since it is entirely uncontroversial what the diplomatic form expresses.
When normalizing, we try to be as uninterpretative as possible.
1:1 mappings from diplomatic and normalized tokens are easier to decide on, because the alternation in the form is easy to trace.
The standardized representation should be influenced by the underlying orthography explicitly and implicitly, e. g. for German clitics we do not use an apostrophe in the separated token, whereas we do for English.
Normalization should not alternate meaning (if possible).

language	orthography of reference	reference
English	American English	Merriam-Webster
German	„Neue deutsche Rechtschreibung“	Amtliche Regelung der dt. Rechtschreibung, Duden
Greek	Standard Modern Greek	Λεξικό της κοινής νεοελληνικής and David Holton, Peter Mackridge, Irene Philippaki-Warburton (1997) Greek: A Comprehensive Grammar of the Modern Language, London: Routledge
Russian	современный русский литературный язык	Малый академический словарь (МАС) https://rus-academic-dict.slovaronline.com/
Turkish	Turkish / Modern Standard Turkish (ISO 639-3 = tur)	Türk Dil Kurumu

For the written registers (iw, fw), we currently lack a syntactic unit. Furthermore, CUs as segmentation units were, amongst other reasons, chosen due to their applicability for spoken and written data. Having CUs in all types of texts facilitates cross-mode / cross-register comparisons.

To do this in EXMARaLDA, you first need to create the respective annotation tier. After having opened your file in EXMARaLDA, add a new tier through the menu bar (Tier > Add Tier). Choose the following settings:

Speaker: dipl [dipl]
Type: A(nnotation)
Category: cu

By depending on the diplomatic tokenization, the tier has one timeslot per event on dipl. To annotate a CU, merge all timeslots in the CU tier that overlap the tokens (dipl) belonging to the CU (including sentence-final punctuation). Finally, assign a simple annotation value "cu" to the newly created span.

The CU tier does not explicitly contain any text. By overlapping its respective tokens, the entire text of a CU is already defined and can be searched.

Please do not edit the CU layer of the transcriptions. They are final.

For each dipl token there is a lanuage annotation. When preparing your data, this is automatically set to the language of elicitation. Nevertheless, in some cases you might have to adapt that value. Please use the ISO 639-2 language code. A list of valid codes can be found here.

For cancellations, pauses, emojis and non-verbal material, please delete the language annotation.

The rules are always applied on the normalized tokenization. The diplomatic tokenization remains as is.

The following lists and examples are planned to be extended and updated with more exmaples, especially from Greek, Russian and Turkish.

Notation: Normalization rules are formulated as diplomatic token(s) → normalized token(s). Token boundaries are represented with a slash (/).

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comments / Rationale / Criticism
(CU-initial) capitalization	do not correct. Only adapt if a word itself has to be capitalized. Do not capitalize merely because of a CU-initial position	am / i / in / london → am / I / in / London	das / auto / konnte / nicht / mehr / bremsen → das / Auto / konnte / nicht / mehr / bremsen

Note: Rule of thumb: Capitalize, if the word needs to be capitalized in the orthography of reference. Only introduce capitalization if it is the correct spelling independent of whether it appears in an initial position or not. Also, do not undo CU-initial capitalization if provided by writer.

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comments / Rationale / Criticism
orthographic errors or phonetic assimilations / slip of the tongue	normalize orthography	orthographie → orthography	hunt → Hund fümf → fünf	~	¹
orthographic errors or phonetic assimilations / slip of the tongue (within vocabulary)	correct to target hypothesis if meaning of corrected form is much more likely AND the normalization implies a change in category	I / was / their → I / was / there	ihr / seit / dort → ihr / seid / dort Umfall → Unfall	~	²
written data: composed forms	refer to above mentioned reference and normalize	inter-national → international caraccident → car / accident	Auto / Unfall → Autounfall

This is the prototypical case of normalization.

Both corrections are much more likely, since it is very hard to derive any meaning from the uncorrected examples, considering they were meant to be this way. Additionally, since it is 1-1 normalization in terms of words, normalization is not particularly invasive. Last but not least, since the elicitations take place in a more or less closed setting, it is much easier to make a judgement on the likelihood of an expression in a case of potential within-vocabulary errors.

Phenomenon	What to do on norm level?	English example	German example	Russian example
missing words	do not add them, we do not normalize syntax	you / there / ? → you / there / ?	bin / unterwegs → bin / unterwegs

Phenomenon	What to do on norm level?	English example	German example	Russian example
repetitions	keep, normalize orthography	the / the / cAr → the / the / car	ein / ein / Autounfall

Phenomenon	What to do on norm level?	English example	German example	Russian example
self-corrections	see repetitions / cancellations	the / hatchback / (eh) / car → the / hatchback / car (filled pause deleted, rest remains)	ein / Typ / ein / Fußgänger / hat / dann remains as is, whereas ein / Fußg/ / ein / Typ → ein / ein / Typ, since Fußg/ was explicitly marked as cancelled by the transcriber

Note: The proposed procedure for self-corrections is the least theory-driven, thus, avoids making any grammatical assumptions when normalizing. Nevertheless, or therefore, it also might be very controversial. It definitely does not facilitate later syntactic annotation.

Phenomenon	What to do on norm level?	English example	German example	Russian example
spoken data: cancellations³	drop the incomplete element	car / acc/ → car	Autounf/ →

Note: Sometimes it is difficult to tell whether something is a cancellation or not, because the incomplete part might already be a word on its own. In these cases we rely on the transcriber's decision: If something is marked as cancellation in the transcript, we stick to that interpretation. Nevertheless, that should not keep your project from revising transcriptions (given the current state of corpus creation still allows for such revisions).

Phenomenon	What to do on norm level?	English example	German example	Russian example
spoken data: phonetic markers	remove and normalize to orthography	a::nd → and	und / da::nn → und / dann

Phenomenon	What to do on norm level?	English example	German example	Russian example
reduced forms	if reconstructable normalize orthographically⁴	ca → car	nich → nicht wolln → wollen

⁴

Note: reduction has to be out of vocabulary. If not, check with the guidelines on within-vocabulary errors above.

Note: There might be (even in the dictionary) widely accepted variants of standardized writings that are more or less the result of reduction (or other processes). Examples are cross instead of across in English, grade instead of gerade in German. We need to focus on the goal of normalization: To have a text segmentation where orthographic variation is reduced / removed, i. e. where multiple expressions are mapped to one standard representation. Thus, as long as there is no argument that normalizing an item alternates meaning, these reductions should be normalized. We only deal with mappings from one token to one token, i. e. retrieving the originally chosen expression is very easy and we do not lose information. A counterexample are cases of drauf in German. As a pronomial adverb it is often a reduction of darauf and should be normalized as that. Nevertheless, it can also be the separated particle of the particle verb draufgehen (to die, fig.) and normalizing it to darauf would not be appropriate/"correct". When normalizing, we will most likely encounter much more subtle cases of difference in meaning. We need to discuss and document our decisions.

Phenomenon	What to do on norm level?	English example	German example	Russian example
colloquial forms, dialect	normalize	coz → because rulz → rules cha → you	nix → nichts ick → ich jetze → jetzt itzt → jetzt

Phenomenon	What to do on norm level?	English example	German example	Russian example
merged forms	split into normalized items as long as both items are visibly represented in the merged form	wanna → want / to gonna → going / to ⁵	kannste → kannst / du ⁶

⁵

Note: regarding (1) and (2)
what / cha / gonna / do → what / you / going / to / do

⁶

Note: kannst / mir / mal / helfen
remains as is, since there is no overt material for du without assuming a reduction of kannst, i. e. this is a case of missing material, not merge (see above)

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comment
merges / clitics	separate and normalize, if possible	isnt → is / n't withe / ball → with / e / ball he's / guilty → he / 's / guilty	binsch → bin / ich

Note: We normalize a clitic if there is no ambiguity about the standard form. If the standard form cannot be determined unambiguously, the clitic is represented as a clitic, but in a standardized way (see table below for examples).

We always separate the clitics, i. e. divide a single diplomatic token in two norm tokens. See the following list of examples:

Diplomatic	Normalized
n't	not
'll	will
's	's
're	're
isn't or isnt	is / not
aren't or arent (or even arnt)	are / not
cannot or can't or cant	can / not
won't or wont	will / not
didn't or didnt	did / not
don't or dont	do / not
doesn't or doesnt (or even dosnt)	does / not
ain't or aint	ai / not
couldn't or couldnt	could / not
shouldn't or shouldnt	should / not
wouldn't or wouldnt	would / not
she'll	she / will
he's	he / 's
you're	you / 're
Thomas' / computer	Thomas / ' / computer
Anke's / mail or Ankes / mail	Anke / 's / mail

Separate clitics analogous to English clitics, but do not use an apostrophe.

Represent them according to the chosen orthographic standard.

Concatenated morphological elements can remain a unit, since this is part of the language. If you prefer to separate them anyway, make sure you work consistently.

Phenomenon	What to do on norm level?	English example	German example	Russian example
lexicalized merged forms	+ do not normalize + work with list? + refer to orthography reference?	Rock'n'Roll → Rock'n'Roll	im → im zum → zum beim → beim ins → ins

Phenomenon	What to do on norm level?	English example	German example	Russian example
non-standard variation within a paradigm	normalize	$\dots$	helf / mir → hilf / mir

Note: Non-standardized but maybe widely established and accepted forms can be understood as alternative spellings. For alternative spellings our rules guide us to choose a standard and always use that to guarantee consistency. Therefore, normalizing to the true standard form is a good way to go. It is also a safe procedure, since we are dealing with single word to single word mappings, i. e. the original diplomatic form is always retrievable.

Phenomenon	What to do on norm level?	English example	German example	Russian example
speaker codes	do not normalize

Phenomenon	What to do on norm level?	English example	German example	Russian example
foreign material / proper names in a foreign language	normalize according to agreement		leik → like

Note: The transcription of foreign material is always phonographic. The normalization sticks to the habit of the respective language. Whereas the phonographic leik (dipl) might be like (en) on the German norm layer, whereas a Russian that says Autowerkstatt in German in a Russian context, the dipl and norm form will be phonographic representations (аутоверкштат).

Also: Remember to adapt the language annotation for these cases.

Phenomenon	What to do on norm level?	English example	German example	Russian example
Abbreviations	(merge and) normalize according to standard orthography		pekawe → PKW

Phenomenon	What to do on norm level?	English example	German example	Russian example
cardinal numbers	normalize according to regulations in orthography of reference

Note: In the written data, it is enough to correct the spelling of numeral words, there is no need to convert actual numbers to words or vice versa.

Phenomenon	What to do on norm level?	English example	German example	Russian example
floating point numbers	normalize as numbers	one / point / three → 1.3	eins / Komma / drei → 1,3

Phenomenon	What to do on norm level?	English example	German example	Russian example
times and dates	are a composition of elements for which there are normalization rules to be applied individually

Phenomenon	What to do on norm level?	English example	German example	Russian example
fractions	normalize/keep as words	one / third	ein / Drittel

Phenomenon	What to do on norm level?	English example	German example	Russian example
pauses, noise, turn-holders, laughing, ...	not represented, i. e. delete token

Phenomenon	What to do on norm level?	English example	German example	Russian example
filled pause vs. particle, etc.	If you decide to consider certain filled pauses lexical units, decide on a standardized representation and represent that unit always the same way in the normalization layer. If you are currently undecided on the status of such an element and this is actually part of your research, delete that element on the normalization layer.

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comment
interaction/communication with elicitor	not represented	...	...	There is a transcription standard for these events. These events will not be normalized. Delete These events on the norm layer.

Phenomenon	What to do on norm level?	English example	German example	Russian example
incomprehensible, no hypothesis (unknown material)	not represented, i. e. delete token on norm	(UNK)	(UNK)

Phenomenon	What to do on norm level?	English example	German example	Russian example
incomprehensible with hypothesis	hypothesis is represented in a normalized version	(car) → car	(jetze) → jetzt

Phenomenon	What to do on norm level?	English example	German example	Russian example
punctuation	do not add nor normalize, but separate on both dipl and norm	accident.Then / the → accident / . / Then / the	,dass → , / dass

NOTE: This is a phenomenon you may face in written data. Since there is no obligation for authors to stick to pure typing-conventions and use spaces after punctuation, we will treat cases like the above as tokenization errors caused by the pipeline. To remain within a certain frame of complexity we cannot resolve these issues within automatic tokenization. Please perform the separation during normalization on both - dipl and norm. Each element of punctuation should be its own token, unless it is a sequence of punctuation characters without whitespaces in between. In these cases the characters remain a unit (e.g. "..." stays like this on dipl and norm).

Phenomenon	What to do on norm level?	English example	German example	Russian example	Comment
multiple representations allowed in orthography of reference	Make a decision, document it and stay consistent.		Albtraum vs. Alptraum	~	This also implies, that if a participant uses the alternative form you should normalize to the variant you chose as a standard in the corpus.

If there is a mismatch in case, gender or any other grammatical property/feature, do not normalize it. This has to remain to be up to investigation. You might, of course, find non-trivial cases between orthographic normalization and an actual grammatical target hypothesis. Document these case and raise a discussion if necessary.

RUEG Corpus Documentation

Segmentations in our corpus

What is normalization in our corpus and what is its purpose

Underlying principles and things to keep in mind

Orthography of reference

Additional tasks

Annotate CUs for written data

Adapt language annotation

Normalization guidelines

English clitics

German clitics

Greek and Russian clitics

Turkish clitics

Special material / cases

Grammatical deviations