Annotation Step 2: Normalization
Segmentations in our corpus
Our corpus pipeline faces 4 types of input for each speaker that participated in the elicitation:
- formal written (fw) text files
- informal written (iw) chat exports
- formal spoken (fs) transcriptions
- informal spoken (is) transcriptions
All data types are divided in above-word-level ("phrasal") segments:
- fw: sentences indicated by punctuation used by the participant
- iw: messages and lines indicated by separators used by the participant (punctuation, new line, message separation)
- fs and is: communication units (CUs) indicated by transcriber (some projects might use intonation phrases additionally)
We annotate our CUs on the annotation tier "cu" (in lowercase letters). These segments are then automatically tokenized (roughly speaking divided in character sequences between two whitespaces), which provides our first word-level segmentation: The diplomatic layer, consisting of graphemic words represented in the phrasal segments.
This word-level segmentation is then to be normalized to facilitate search and automatic annotation.
What is normalization in our corpus and what is its purpose
The result of normalization is an additional word-level segmentation layer, similar to dipl. Starting at the diplomatic tokenization, explicit rules are to be applied to obtain orthographically normalized tokens on the norm layer. Starting of as a copy of the diplomatic token layer, the normalized tokens can (but do not necessarily need) to undergo operations such as edition, deletion, split or merge.
Underlying principles and things to keep in mind
-
Our normalization is not the last step of analysis and does not remove information from the overall corpus
-
Normalization takes place mainly on word-level to obtain standardized forms w.r.t. an orthography of reference
-
We will agree on an orthography of reference for each language in our corpus
-
The normalized layer uses the language's native script, whereas the diplomatic layer sticks to the script used by the participants for the written registers (fw, iw)
-
We do not normalize syntax. This way we are able to learn about an underlying grammar more easily rather than analysing the elicitated language in terms of a standard grammar. Note that this comes with disadvantages but is more appropriate for our projects' idea. We want to obtain a layer that is orthographically normalized but still allows to do syntactical analyses. This is why we do not normalize syntax and try to keep as many elements as possible.
-
Normalization is also a technically necessary process. Orthographically standardized forms help all researchers, those not familiar with our data included, to search linguistic phenomena. Furthermore, annotation to be integrated in our corpus can be pre-generated automatically and efforts of correction are reduced.
-
The highest priority is to ensure that we will always be able to answer our research questions.
-
We do not alternate lexical choice when normalizing.
-
The norm layer will enable us to identify grammatical units.
-
A normalization decision is always influenced by the question of how controversial and/or lossy in terms of information that decision might be.
-
Since this is our first layer of normalization, we normalize very carefully.
-
There will be no grammatical normalizations (adaption of cases or gender or other grammatical features that deviate from our understanding of "standard")
-
We normalize to a single standardized form (if possible) to reduce noise in the tokenization
-
We always normalize as far as we can go, which means we are more careful in some cases compared to others. As an example, 're stays 're in the normalization, because tempus is unclear, whereas 'll can be normalized to will, since it is entirely uncontroversial what the diplomatic form expresses.
-
When normalizing, we try to be as uninterpretative as possible.
-
1:1 mappings from diplomatic and normalized tokens are easier to decide on, because the alternation in the form is easy to trace.
-
The standardized representation should be influenced by the underlying orthography explicitly and implicitly, e. g. for German clitics we do not use an apostrophe in the separated token, whereas we do for English.
-
Normalization should not alternate meaning (if possible).
Orthography of reference
language | orthography of reference | reference |
---|---|---|
English | American English | Merriam-Webster |
German | „Neue deutsche Rechtschreibung“ | Amtliche Regelung der dt. Rechtschreibung, Duden |
Greek | Standard Modern Greek | Λεξικό της κοινής νεοελληνικής and David Holton, Peter Mackridge, Irene Philippaki-Warburton (1997) Greek: A Comprehensive Grammar of the Modern Language, London: Routledge |
Russian | современный русский литературный язык | Малый академический словарь (МАС) https://rus-academic-dict.slovaronline.com/ |
Turkish | Turkish / Modern Standard Turkish (ISO 639-3 = tur) | Türk Dil Kurumu |
Additional tasks
Annotate CUs for written data
For the written registers (iw, fw), we currently lack a syntactic unit. Furthermore, CUs as segmentation units were, amongst other reasons, chosen due to their applicability for spoken and written data. Having CUs in all types of texts facilitates cross-mode / cross-register comparisons.
To do this in EXMARaLDA, you first need to create the respective annotation tier. After having opened your file in EXMARaLDA, add a new tier through the menu bar (Tier > Add Tier). Choose the following settings:
- Speaker: dipl [dipl]
- Type: A(nnotation)
- Category: cu
By depending on the diplomatic tokenization, the tier has one timeslot per event on dipl. To annotate a CU, merge all timeslots in the CU tier that overlap the tokens (dipl) belonging to the CU (including sentence-final punctuation). Finally, assign a simple annotation value "cu" to the newly created span.
The CU tier does not explicitly contain any text. By overlapping its respective tokens, the entire text of a CU is already defined and can be searched.
Please do not edit the CU layer of the transcriptions. They are final.
Adapt language annotation
For each dipl token there is a lanuage annotation. When preparing your data, this is automatically set to the language of elicitation. Nevertheless, in some cases you might have to adapt that value. Please use the ISO 639-2 language code. A list of valid codes can be found here.
For cancellations, pauses, emojis and non-verbal material, please delete the language annotation.
Normalization guidelines
The rules are always applied on the normalized tokenization. The diplomatic tokenization remains as is.
The following lists and examples are planned to be extended and updated with more exmaples, especially from Greek, Russian and Turkish.
Notation: Normalization rules are formulated as diplomatic token(s) → normalized token(s). Token boundaries are represented with a slash (/).
Phenomenon | What to do on norm level? | English example | German example | Russian example | Comments / Rationale / Criticism |
---|---|---|---|---|---|
(CU-initial) capitalization | do not correct. Only adapt if a word itself has to be capitalized. Do not capitalize merely because of a CU-initial position | am / i / in / london → am / I / in / London | das / auto / konnte / nicht / mehr / bremsen → das / Auto / konnte / nicht / mehr / bremsen |
Note: Rule of thumb: Capitalize, if the word needs to be capitalized in the orthography of reference. Only introduce capitalization if it is the correct spelling independent of whether it appears in an initial position or not. Also, do not undo CU-initial capitalization if provided by writer.
Phenomenon | What to do on norm level? | English example | German example | Russian example | Comments / Rationale / Criticism |
---|---|---|---|---|---|
orthographic errors or phonetic assimilations / slip of the tongue | normalize orthography | orthographie → orthography | hunt → Hund fümf → fünf | ~ | 1 |
orthographic errors or phonetic assimilations / slip of the tongue (within vocabulary) | correct to target hypothesis if meaning of corrected form is much more likely AND the normalization implies a change in category | I / was / their → I / was / there | ihr / seit / dort → ihr / seid / dort Umfall → Unfall | ~ | 2 |
written data: composed forms | refer to above mentioned reference and normalize | inter-national → international caraccident → car / accident | Auto / Unfall → Autounfall |
This is the prototypical case of normalization.
Both corrections are much more likely, since it is very hard to derive any meaning from the uncorrected examples, considering they were meant to be this way. Additionally, since it is 1-1 normalization in terms of words, normalization is not particularly invasive. Last but not least, since the elicitations take place in a more or less closed setting, it is much easier to make a judgement on the likelihood of an expression in a case of potential within-vocabulary errors.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
missing words | do not add them, we do not normalize syntax | you / there / ? → you / there / ? | bin / unterwegs → bin / unterwegs |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
repetitions | keep, normalize orthography | the / the / cAr → the / the / car | ein / ein / Autounfall |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
self-corrections | see repetitions / cancellations | the / hatchback / (eh) / car → the / hatchback / car (filled pause deleted, rest remains) | ein / Typ / ein / Fußgänger / hat / dann remains as is, whereas ein / Fußg/ / ein / Typ → ein / ein / Typ, since Fußg/ was explicitly marked as cancelled by the transcriber |
Note: The proposed procedure for self-corrections is the least theory-driven, thus, avoids making any grammatical assumptions when normalizing. Nevertheless, or therefore, it also might be very controversial. It definitely does not facilitate later syntactic annotation.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
spoken data: cancellations3 | drop the incomplete element | car / acc/ → car | Autounf/ → |
Note: Sometimes it is difficult to tell whether something is a cancellation or not, because the incomplete part might already be a word on its own. In these cases we rely on the transcriber's decision: If something is marked as cancellation in the transcript, we stick to that interpretation. Nevertheless, that should not keep your project from revising transcriptions (given the current state of corpus creation still allows for such revisions).
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
spoken data: phonetic markers | remove and normalize to orthography | a::nd → and | und / da::nn → und / dann |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
reduced forms | if reconstructable normalize orthographically4 | ca → car | nich → nicht wolln → wollen |
Note: reduction has to be out of vocabulary. If not, check with the guidelines on within-vocabulary errors above.
Note: There might be (even in the dictionary) widely accepted variants of standardized writings that are more or less the result of reduction (or other processes). Examples are cross instead of across in English, grade instead of gerade in German. We need to focus on the goal of normalization: To have a text segmentation where orthographic variation is reduced / removed, i. e. where multiple expressions are mapped to one standard representation. Thus, as long as there is no argument that normalizing an item alternates meaning, these reductions should be normalized. We only deal with mappings from one token to one token, i. e. retrieving the originally chosen expression is very easy and we do not lose information. A counterexample are cases of drauf in German. As a pronomial adverb it is often a reduction of darauf and should be normalized as that. Nevertheless, it can also be the separated particle of the particle verb draufgehen (to die, fig.) and normalizing it to darauf would not be appropriate/"correct". When normalizing, we will most likely encounter much more subtle cases of difference in meaning. We need to discuss and document our decisions.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
colloquial forms, dialect | normalize | coz → because rulz → rules cha → you | nix → nichts ick → ich jetze → jetzt itzt → jetzt |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
merged forms | split into normalized items as long as both items are visibly represented in the merged form | wanna → want / to gonna → going / to 5 | kannste → kannst / du 6 |
Note: regarding (1) and (2)
what / cha / gonna / do → what / you / going / to / do
Note: kannst / mir / mal / helfen
remains as is, since there is no overt material for du without assuming a reduction of kannst, i. e. this is a case of missing material, not merge (see above)
Phenomenon | What to do on norm level? | English example | German example | Russian example | Comment |
---|---|---|---|---|---|
merges / clitics | separate and normalize, if possible | isnt → is / n't withe / ball → with / e / ball he's / guilty → he / 's / guilty | binsch → bin / ich |
Note: We normalize a clitic if there is no ambiguity about the standard form. If the standard form cannot be determined unambiguously, the clitic is represented as a clitic, but in a standardized way (see table below for examples).
English clitics
We always separate the clitics, i. e. divide a single diplomatic token in two norm tokens. See the following list of examples:
Diplomatic | Normalized |
---|---|
n't | not |
'll | will |
's | 's |
're | 're |
isn't or isnt | is / not |
aren't or arent (or even arnt) | are / not |
cannot or can't or cant | can / not |
won't or wont | will / not |
didn't or didnt | did / not |
don't or dont | do / not |
doesn't or doesnt (or even dosnt) | does / not |
ain't or aint | ai / not |
couldn't or couldnt | could / not |
shouldn't or shouldnt | should / not |
wouldn't or wouldnt | would / not |
she'll | she / will |
he's | he / 's |
you're | you / 're |
Thomas' / computer | Thomas / ' / computer |
Anke's / mail or Ankes / mail | Anke / 's / mail |
German clitics
Separate clitics analogous to English clitics, but do not use an apostrophe.
Greek and Russian clitics
Represent them according to the chosen orthographic standard.
Turkish clitics
Concatenated morphological elements can remain a unit, since this is part of the language. If you prefer to separate them anyway, make sure you work consistently.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
lexicalized merged forms | + do not normalize + work with list? + refer to orthography reference? | Rock'n'Roll → Rock'n'Roll | im → im zum → zum beim → beim ins → ins |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
non-standard variation within a paradigm | normalize | $\dots$ | helf / mir → hilf / mir |
Note: Non-standardized but maybe widely established and accepted forms can be understood as alternative spellings. For alternative spellings our rules guide us to choose a standard and always use that to guarantee consistency. Therefore, normalizing to the true standard form is a good way to go. It is also a safe procedure, since we are dealing with single word to single word mappings, i. e. the original diplomatic form is always retrievable.
Special material / cases
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
speaker codes | do not normalize |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
foreign material / proper names in a foreign language | normalize according to agreement | leik → like |
Note: The transcription of foreign material is always phonographic. The normalization sticks to the habit of the respective language. Whereas the phonographic leik (dipl) might be like (en) on the German norm layer, whereas a Russian that says Autowerkstatt in German in a Russian context, the dipl and norm form will be phonographic representations (аутоверкштат).
Also: Remember to adapt the language annotation for these cases.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
Abbreviations | (merge and) normalize according to standard orthography | pekawe → PKW |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
cardinal numbers | normalize according to regulations in orthography of reference |
Note: In the written data, it is enough to correct the spelling of numeral words, there is no need to convert actual numbers to words or vice versa.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
floating point numbers | normalize as numbers | one / point / three → 1.3 | eins / Komma / drei → 1,3 |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
times and dates | are a composition of elements for which there are normalization rules to be applied individually |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
fractions | normalize/keep as words | one / third | ein / Drittel |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
pauses, noise, turn-holders, laughing, ... | not represented, i. e. delete token |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
filled pause vs. particle, etc. | If you decide to consider certain filled pauses lexical units, decide on a standardized representation and represent that unit always the same way in the normalization layer. If you are currently undecided on the status of such an element and this is actually part of your research, delete that element on the normalization layer. |
Phenomenon | What to do on norm level? | English example | German example | Russian example | Comment |
---|---|---|---|---|---|
interaction/communication with elicitor | not represented | ... | ... | There is a transcription standard for these events. These events will not be normalized. Delete These events on the norm layer. |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
incomprehensible, no hypothesis (unknown material) | not represented, i. e. delete token on norm | (UNK) | (UNK) |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
incomprehensible with hypothesis | hypothesis is represented in a normalized version | (car) → car | (jetze) → jetzt |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
punctuation | do not add nor normalize, but separate on both dipl and norm | accident.Then / the → accident / . / Then / the | ,dass → , / dass |
NOTE: This is a phenomenon you may face in written data. Since there is no obligation for authors to stick to pure typing-conventions and use spaces after punctuation, we will treat cases like the above as tokenization errors caused by the pipeline. To remain within a certain frame of complexity we cannot resolve these issues within automatic tokenization. Please perform the separation during normalization on both - dipl and norm. Each element of punctuation should be its own token, unless it is a sequence of punctuation characters without whitespaces in between. In these cases the characters remain a unit (e.g. "..." stays like this on dipl and norm).
Phenomenon | What to do on norm level? | English example | German example | Russian example | Comment |
---|---|---|---|---|---|
multiple representations allowed in orthography of reference | Make a decision, document it and stay consistent. | Albtraum vs. Alptraum | ~ | This also implies, that if a participant uses the alternative form you should normalize to the variant you chose as a standard in the corpus. |
Grammatical deviations
If there is a mismatch in case, gender or any other grammatical property/feature, do not normalize it. This has to remain to be up to investigation. You might, of course, find non-trivial cases between orthographic normalization and an actual grammatical target hypothesis. Document these case and raise a discussion if necessary.