Annotation Step 2: Normalization

Segmentations in our corpus

Our corpus pipeline faces 4 types of input for each speaker that participated in the elicitation:

  • formal written (fw) text files
  • informal written (iw) chat exports
  • formal spoken (fs) transcriptions
  • informal spoken (is) transcriptions

All data types are divided in above-word-level ("phrasal") segments:

  • fw: sentences indicated by punctuation used by the participant
  • iw: messages and lines indicated by separators used by the participant (punctuation, new line, message separation)
  • fs and is: communication units (CUs) indicated by transcriber (some projects might use intonation phrases additionally)

We annotate our CUs on the annotation tier "cu" (in lowercase letters). These segments are then automatically tokenized (roughly speaking divided in character sequences between two whitespaces), which provides our first word-level segmentation: The diplomatic layer, consisting of graphemic words represented in the phrasal segments.

This word-level segmentation is then to be normalized to facilitate search and automatic annotation.

What is normalization in our corpus and what is its purpose

The result of normalization is an additional word-level segmentation layer, similar to dipl. Starting at the diplomatic tokenization, explicit rules are to be applied to obtain orthographically normalized tokens on the norm layer. Starting of as a copy of the diplomatic token layer, the normalized tokens can (but do not necessarily need) to undergo operations such as edition, deletion, split or merge.

Underlying principles and things to keep in mind

  1. Our normalization is not the last step of analysis and does not remove information from the overall corpus

  2. Normalization takes place mainly on word-level to obtain standardized forms w.r.t. an orthography of reference

  3. We will agree on an orthography of reference for each language in our corpus

  4. The normalized layer uses the language's native script, whereas the diplomatic layer sticks to the script used by the participants for the written registers (fw, iw)

  5. We do not normalize syntax. This way we are able to learn about an underlying grammar more easily rather than analysing the elicitated language in terms of a standard grammar. Note that this comes with disadvantages but is more appropriate for our projects' idea. We want to obtain a layer that is orthographically normalized but still allows to do syntactical analyses. This is why we do not normalize syntax and try to keep as many elements as possible.

  6. Normalization is also a technically necessary process. Orthographically standardized forms help all researchers, those not familiar with our data included, to search linguistic phenomena. Furthermore, annotation to be integrated in our corpus can be pre-generated automatically and efforts of correction are reduced.

  7. The highest priority is to ensure that we will always be able to answer our research questions.

  8. We do not alternate lexical choice when normalizing.

  9. The norm layer will enable us to identify grammatical units.

  10. A normalization decision is always influenced by the question of how controversial and/or lossy in terms of information that decision might be.

  11. Since this is our first layer of normalization, we normalize very carefully.

  12. There will be no grammatical normalizations (adaption of cases or gender or other grammatical features that deviate from our understanding of "standard")

  13. We normalize to a single standardized form (if possible) to reduce noise in the tokenization

  14. We always normalize as far as we can go, which means we are more careful in some cases compared to others. As an example, 're stays 're in the normalization, because tempus is unclear, whereas 'll can be normalized to will, since it is entirely uncontroversial what the diplomatic form expresses.

  15. When normalizing, we try to be as uninterpretative as possible.

  16. 1:1 mappings from diplomatic and normalized tokens are easier to decide on, because the alternation in the form is easy to trace.

  17. The standardized representation should be influenced by the underlying orthography explicitly and implicitly, e. g. for German clitics we do not use an apostrophe in the separated token, whereas we do for English.

  18. Normalization should not alternate meaning (if possible).

Orthography of reference

language orthography of reference reference
English American English Merriam-Webster
German „Neue deutsche Rechtschreibung“ Amtliche Regelung der dt. Rechtschreibung, Duden
Greek Standard Modern Greek Λεξικό της κοινής νεοελληνικής and David Holton, Peter Mackridge, Irene Philippaki-Warburton (1997) Greek: A Comprehensive Grammar of the Modern Language, London: Routledge
Russian современный русский литературный язык Малый академический словарь (МАС) https://rus-academic-dict.slovaronline.com/
Turkish Turkish / Modern Standard Turkish (ISO 639-3 = tur) Türk Dil Kurumu

Additional tasks

Annotate CUs for written data

For the written registers (iw, fw), we currently lack a syntactic unit. Furthermore, CUs as segmentation units were, amongst other reasons, chosen due to their applicability for spoken and written data. Having CUs in all types of texts facilitates cross-mode / cross-register comparisons.

To do this in EXMARaLDA, you first need to create the respective annotation tier. After having opened your file in EXMARaLDA, add a new tier through the menu bar (Tier > Add Tier). Choose the following settings:

  • Speaker: dipl [dipl]
  • Type: A(nnotation)
  • Category: cu

By depending on the diplomatic tokenization, the tier has one timeslot per event on dipl. To annotate a CU, merge all timeslots in the CU tier that overlap the tokens (dipl) belonging to the CU (including sentence-final punctuation). Finally, assign a simple annotation value "cu" to the newly created span.

The CU tier does not explicitly contain any text. By overlapping its respective tokens, the entire text of a CU is already defined and can be searched.

Please do not edit the CU layer of the transcriptions. They are final.

Adapt language annotation

For each dipl token there is a lanuage annotation. When preparing your data, this is automatically set to the language of elicitation. Nevertheless, in some cases you might have to adapt that value. Please use the ISO 639-2 language code. A list of valid codes can be found here.

For cancellations, pauses, emojis and non-verbal material, please delete the language annotation.

Normalization guidelines

The rules are always applied on the normalized tokenization. The diplomatic tokenization remains as is.

The following lists and examples are planned to be extended and updated with more exmaples, especially from Greek, Russian and Turkish.

Notation: Normalization rules are formulated as diplomatic token(s)normalized token(s). Token boundaries are represented with a slash (/).

Phenomenon What to do on norm level? English example German example Russian example Comments / Rationale / Criticism
(CU-initial) capitalization do not correct. Only adapt if a word itself has to be capitalized. Do not capitalize merely because of a CU-initial position am / i / in / london → am / I / in / London das / auto / konnte / nicht / mehr / bremsen → das / Auto / konnte / nicht / mehr / bremsen

Note: Rule of thumb: Capitalize, if the word needs to be capitalized in the orthography of reference. Only introduce capitalization if it is the correct spelling independent of whether it appears in an initial position or not. Also, do not undo CU-initial capitalization if provided by writer.

Phenomenon What to do on norm level? English example German example Russian example Comments / Rationale / Criticism
orthographic errors or phonetic assimilations / slip of the tongue normalize orthography orthographieorthography huntHund
fümffünf
~ 1
orthographic errors or phonetic assimilations / slip of the tongue (within vocabulary) correct to target hypothesis if meaning of corrected form is much more likely AND the normalization implies a change in category I / was / theirI / was / there ihr / seit / dortihr / seid / dort
UmfallUnfall
~ 2
written data: composed forms refer to above mentioned reference and normalize inter-nationalinternational
caraccidentcar / accident
Auto / UnfallAutounfall
1

This is the prototypical case of normalization.

2

Both corrections are much more likely, since it is very hard to derive any meaning from the uncorrected examples, considering they were meant to be this way. Additionally, since it is 1-1 normalization in terms of words, normalization is not particularly invasive. Last but not least, since the elicitations take place in a more or less closed setting, it is much easier to make a judgement on the likelihood of an expression in a case of potential within-vocabulary errors.

Phenomenon What to do on norm level? English example German example Russian example
missing words do not add them, we do not normalize syntax you / there / ?you / there / ? bin / unterwegsbin / unterwegs
Phenomenon What to do on norm level? English example German example Russian example
repetitions keep, normalize orthography the / the / cArthe / the / car ein / ein / Autounfall
Phenomenon What to do on norm level? English example German example Russian example
self-corrections see repetitions / cancellations the / hatchback / (eh) / carthe / hatchback / car (filled pause deleted, rest remains) ein / Typ / ein / Fußgänger / hat / dann remains as is, whereas
ein / Fußg/ / ein / Typein / ein / Typ, since Fußg/ was explicitly marked as cancelled by the transcriber

Note: The proposed procedure for self-corrections is the least theory-driven, thus, avoids making any grammatical assumptions when normalizing. Nevertheless, or therefore, it also might be very controversial. It definitely does not facilitate later syntactic annotation.

Phenomenon What to do on norm level? English example German example Russian example
spoken data: cancellations3 drop the incomplete element car / acc/car Autounf/
3

Note: Sometimes it is difficult to tell whether something is a cancellation or not, because the incomplete part might already be a word on its own. In these cases we rely on the transcriber's decision: If something is marked as cancellation in the transcript, we stick to that interpretation. Nevertheless, that should not keep your project from revising transcriptions (given the current state of corpus creation still allows for such revisions).

Phenomenon What to do on norm level? English example German example Russian example
spoken data: phonetic markers remove and normalize to orthography a::ndand und / da::nnund / dann
Phenomenon What to do on norm level? English example German example Russian example
reduced forms if reconstructable normalize orthographically4 cacar nichnicht
wollnwollen
4

Note: reduction has to be out of vocabulary. If not, check with the guidelines on within-vocabulary errors above.

Note: There might be (even in the dictionary) widely accepted variants of standardized writings that are more or less the result of reduction (or other processes). Examples are cross instead of across in English, grade instead of gerade in German. We need to focus on the goal of normalization: To have a text segmentation where orthographic variation is reduced / removed, i. e. where multiple expressions are mapped to one standard representation. Thus, as long as there is no argument that normalizing an item alternates meaning, these reductions should be normalized. We only deal with mappings from one token to one token, i. e. retrieving the originally chosen expression is very easy and we do not lose information. A counterexample are cases of drauf in German. As a pronomial adverb it is often a reduction of darauf and should be normalized as that. Nevertheless, it can also be the separated particle of the particle verb draufgehen (to die, fig.) and normalizing it to darauf would not be appropriate/"correct". When normalizing, we will most likely encounter much more subtle cases of difference in meaning. We need to discuss and document our decisions.

Phenomenon What to do on norm level? English example German example Russian example
colloquial forms, dialect normalize cozbecause
rulzrules
chayou
nixnichts
ickich
jetzejetzt
itztjetzt
Phenomenon What to do on norm level? English example German example Russian example
merged forms split into normalized items as long as both items are visibly represented in the merged form wannawant / to
gonnagoing / to 5
kannstekannst / du 6
5

Note: regarding (1) and (2)
what / cha / gonna / dowhat / you / going / to / do

6

Note: kannst / mir / mal / helfen
remains as is, since there is no overt material for du without assuming a reduction of kannst, i. e. this is a case of missing material, not merge (see above)

Phenomenon What to do on norm level? English example German example Russian example Comment
merges / clitics separate and normalize, if possible isntis / n't
withe / ballwith / e / ball
he's / guiltyhe / 's / guilty
binschbin / ich

Note: We normalize a clitic if there is no ambiguity about the standard form. If the standard form cannot be determined unambiguously, the clitic is represented as a clitic, but in a standardized way (see table below for examples).

English clitics

We always separate the clitics, i. e. divide a single diplomatic token in two norm tokens. See the following list of examples:

Diplomatic Normalized
n't not
'll will
's 's
're 're
isn't or isnt is / not
aren't or arent (or even arnt) are / not
cannot or can't or cant can / not
won't or wont will / not
didn't or didnt did / not
don't or dont do / not
doesn't or doesnt (or even dosnt) does / not
ain't or aint ai / not
couldn't or couldnt could / not
shouldn't or shouldnt should / not
wouldn't or wouldnt would / not
she'll she / will
he's he / 's
you're you / 're
Thomas' / computer Thomas / ' / computer
Anke's / mail or
Ankes / mail
Anke / 's / mail

German clitics

Separate clitics analogous to English clitics, but do not use an apostrophe.

Greek and Russian clitics

Represent them according to the chosen orthographic standard.

Turkish clitics

Concatenated morphological elements can remain a unit, since this is part of the language. If you prefer to separate them anyway, make sure you work consistently.

Phenomenon What to do on norm level? English example German example Russian example
lexicalized merged forms + do not normalize
+ work with list?
+ refer to orthography reference?
Rock'n'RollRock'n'Roll imim
zumzum
beimbeim
insins
Phenomenon What to do on norm level? English example German example Russian example
non-standard variation within a paradigm normalize $\dots$ helf / mirhilf / mir

Note: Non-standardized but maybe widely established and accepted forms can be understood as alternative spellings. For alternative spellings our rules guide us to choose a standard and always use that to guarantee consistency. Therefore, normalizing to the true standard form is a good way to go. It is also a safe procedure, since we are dealing with single word to single word mappings, i. e. the original diplomatic form is always retrievable.

Special material / cases

Phenomenon What to do on norm level? English example German example Russian example
speaker codes do not normalize
Phenomenon What to do on norm level? English example German example Russian example
foreign material / proper names in a foreign language normalize according to agreement leiklike

Note: The transcription of foreign material is always phonographic. The normalization sticks to the habit of the respective language. Whereas the phonographic leik (dipl) might be like (en) on the German norm layer, whereas a Russian that says Autowerkstatt in German in a Russian context, the dipl and norm form will be phonographic representations (аутоверкштат).

Also: Remember to adapt the language annotation for these cases.

Phenomenon What to do on norm level? English example German example Russian example
Abbreviations (merge and) normalize according to standard orthography pekawePKW
Phenomenon What to do on norm level? English example German example Russian example
cardinal numbers normalize according to regulations in orthography of reference

Note: In the written data, it is enough to correct the spelling of numeral words, there is no need to convert actual numbers to words or vice versa.

Phenomenon What to do on norm level? English example German example Russian example
floating point numbers normalize as numbers one / point / three1.3 eins / Komma / drei → 1,3
Phenomenon What to do on norm level? English example German example Russian example
times and dates are a composition of elements for which there are normalization rules to be applied individually
Phenomenon What to do on norm level? English example German example Russian example
fractions normalize/keep as words one / third ein / Drittel
Phenomenon What to do on norm level? English example German example Russian example
pauses, noise, turn-holders, laughing, ... not represented, i. e. delete token
Phenomenon What to do on norm level? English example German example Russian example
filled pause vs. particle, etc. If you decide to consider certain filled pauses lexical units, decide on a standardized representation and represent that unit always the same way in the normalization layer. If you are currently undecided on the status of such an element and this is actually part of your research, delete that element on the normalization layer.
Phenomenon What to do on norm level? English example German example Russian example Comment
interaction/communication with elicitor not represented ... ... There is a transcription standard for these events. These events will not be normalized. Delete These events on the norm layer.
Phenomenon What to do on norm level? English example German example Russian example
incomprehensible, no hypothesis (unknown material) not represented, i. e. delete token on norm (UNK) (UNK)
Phenomenon What to do on norm level? English example German example Russian example
incomprehensible with hypothesis hypothesis is represented in a normalized version (car)car (jetze)jetzt
Phenomenon What to do on norm level? English example German example Russian example
punctuation do not add nor normalize, but separate on both dipl and norm accident.Then / theaccident / . / Then / the ,dass, / dass

NOTE: This is a phenomenon you may face in written data. Since there is no obligation for authors to stick to pure typing-conventions and use spaces after punctuation, we will treat cases like the above as tokenization errors caused by the pipeline. To remain within a certain frame of complexity we cannot resolve these issues within automatic tokenization. Please perform the separation during normalization on both - dipl and norm. Each element of punctuation should be its own token, unless it is a sequence of punctuation characters without whitespaces in between. In these cases the characters remain a unit (e.g. "..." stays like this on dipl and norm).

Phenomenon What to do on norm level? English example German example Russian example Comment
multiple representations allowed in orthography of reference Make a decision, document it and stay consistent. Albtraum vs. Alptraum ~ This also implies, that if a participant uses the alternative form you should normalize to the variant you chose as a standard in the corpus.

Grammatical deviations

If there is a mismatch in case, gender or any other grammatical property/feature, do not normalize it. This has to remain to be up to investigation. You might, of course, find non-trivial cases between orthographic normalization and an actual grammatical target hypothesis. Document these case and raise a discussion if necessary.