Overview of Annotations
The RUEG corpus is a multi-layer corpus of both written and spoken language.
We use several annotation formats in the process of annotation, but all annotations, except for the dependency annotations, are part of the
EXMARaLDA file in the exb
directory.
In addition to the editable EXMARaLDA format, the corpus is also converted to the ANNIS format (annis
directory) for search and visualization.
Dependencies between annotation layers
Most annotation layers depend on other annotations. This can to lead to complex dependencies, as visualized by the following graph:
Meta data fields
In addition to the annotation layers, each document has also meta data fields which are stored in the .meta
file next to each EXMARaLDA file.
The meta data is also included in the ANNIS format.
field name | type | description |
---|---|---|
speaker-id | String | |
formality | String | informal/formal |
mode | String | spoken/written |
speaker-bilingual | Boolean | yes/no |
elicitation-session | Number | 1 (monolinguals, bilinguals in first session) 2 (bilinguals in second session) |
elicitation-language | String | Language that is elicited from the speaker |
elicitation-country | String | |
elicitation-order | Number | 1-8 |
elicitator-good-id | String | project- and people-number of "good cop" |
elicitator-bad-id | String | project- and people-number of "bad cop" |
elicitation-date | String | 2018-XX-XX |
transcriber-id | String | comma-separated list of project- and person-number XX-XX |
normalizer-id | String | comma-separated list of project- and person-number XX-XX |
annotator-id | String | comma-separated list of project- and person-number XX-XX |
speaker-language-s | String | Languages as given by the participants and separated by comma |
speaker-age-group | String | children/adolescents/adults |
speaker-gender | String | m/f/d |
speaker-age | Number | two-digit number year |
speaker-AoO | Number | Age Of Onset in years (two-digits) |
speaker-AoO-answer | Number | complete, but anonymized answer string |
speaker-personality-score-X | Number | Personality score (1-7) for each of the questions 1-6 of the personality test |
speaker-extravert-score | Number | aggregated extravert score |
cu (Communication Unit)
Value set: open
Segmentation and transcription of Communication Units For spoken data, the start and the end of the CUs are manually aligned with the audio.
See the transcriptions guidelines for details.
Processing steps
# | type | step | output format |
---|---|---|---|
1 | manual | Transcription | exb |
dipl (Tokenization)
Value set: open
Automatic tokenization of the text into words.
- as defined by the TreeTagger tokenization script
- extra handling for emojis and pauses
Language-specific differences
- language specific abbreviations
Processing steps
# | type | step | output format |
---|---|---|---|
1 | automatic | TreeTagger | exb |
norm (corpus-wide normalization)
Value set: open
A common normalization that is the same for written and spoken data. This allows a search across registers.
- segmented into graphemic words
- emojis are a single word
- text messsage acronyms are treated as single word
- punctuation is considered a token if not part of an emoji
- following standard orthography
- no word order corrections
- no grammatical corrections
Language-specific differences
- script is normalized to language standard
- each language decided on
- orthographic standard
- clitics
- script
Processing steps
# | type | step | output format |
---|---|---|---|
1 | automatic | Copy base text | exb |
2 | manual | Normalize | exb |
lemma (Lemmatization)
Value set: open
Lemmatization based on the normalization (norm).
Processing steps
# | type | step | output format |
---|---|---|---|
1 | automatic | lemmatization (part of the POS-tagging) | exb |
2 | manual | correction | exb |
pos (Universal part of speech)
Value set: closed
Part of speech annotation using the Universal POS tags.
Processing steps
# | type | step | output format |
---|---|---|---|
1 | automatic | Automatic POS tagging | exb |
pos_lang (Language specific Part of speech)
Value set: closed
Part of speech annotation with a tag-set for each language.
- there is one common tag-set for each language
- text message acronyms get their own tag manually (or if the tagger supports it, automatically)
Different tagsets are used for each language:
language | tag set | reference |
---|---|---|
English | British National Corpus / Claws 4 | Leech et al. 19941 |
German | STTS 2.0 | Westpfahl 20142 |
Russian | MyStem tag set | Segalovich 20033 |
Turkish | MULTILIT tag set | Schroeder et al. 20154 |
Processing steps
# | type | step | output format |
---|---|---|---|
1 | automatic | Automatic POS tagging with tool | exb |
2 | manual | correction | exb |
language (Language/Foreign Material)
Value set: closed
Describes the language.
- per-token
- ISO three letter language code
- every token has this category assigned
- no dialects
Processing steps
# | type | step | output format |
---|---|---|---|
1 | automatic | Fill out default language | exb |
2 | manual | Mark foreign material | exb |
message (Chat Message span)
Value set: natural numbers
Span annotation for each message in the chat. Contains its consecutive number.
line (Chat Message line)
Value set: open
Span annotation with the chat message text as content.
Processing steps
# | type | step | output format |
---|---|---|---|
1 | automatic | exb |
dep (Universal Dependencies)
Value set: closed
Automatic Universal Dependency parsing.
Processing steps
# | type | step | output format | |
---|---|---|---|---|
1 | automatic | UD Parsing | CoNLL |
Leech, Geoffrey, Roger Garside, and Michael Bryant. 1994. “CLAWS4: The Tagging of the British National Corpus.” In COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics. Vol. 1.
Westpfahl, Swantje. 2014. “STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data.” In Proceedings of Law Viii-the 8th Linguistic Annotation Workshop, 1–10.
Segalovich, Ilya. 2003. “A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine.” In MLMTA, 273–80. Citeseer.
Schroeder, Christoph, Christin Schellhardt, Mehmet-Ali Akinci, Meral Dollnick, Ginesa Dux, Esin Işil Gülbeyaz, Anne Jähnert, et al. 2015. “MULTILIT.” Universität Potsdam. https://publishup.uni-potsdam.de/opus4-ubp/frontdoor/index/index/docId/8039.
Annotation Step 1: Transcription
Anonymisation
Anonymisation on Speaker tier
- Replace name of participant with the respective speaker code, e.g. USbi02FR
- If whole names or surnames of friends are mentioned, replace with the participant code + _P, e.g. USbi02FR_P
COMMENT: To our information we anonymize all part of the speaker name, also just the first name.
- Places that could lead to the identification of a participant, like streetnames, schoolnames, etc.:
If the participant names for example "Friedrichstraße", you transcribe "{streetname}Straßezzz". You leave out the actual name of the street, in the example here you replace the "Friedrich". "zzz" has to be replaced by any inflectional suffixes you hear in the data or erased, if there are no inflectional suffixes. In this example you would transcribe"{streetname}Straße", because there are no inflectional suffixes. Accordingly if schools are named, for example "Apple Highschools" ("s" in the end for an English plural form), you transcribe "{schoolname}schools". There should be no spaces following the {...}. The placeholder you put in between the {...} is in English, regardless of the language you elicitate and transcribe in. Over time a list of these placeholders should develop, so that all identifiers can be replaced and put between the {...} ("{streetname}Allee"). This list is the same for every project and language.
Segmentation
- Communication Unit (CU) is used as a segmentation unit
- No punctuation marks at all on the transcription layer
- No accents, no intonation patterns are marked
- In Exmaralda: blank space at the end of each event (* no punctuation marks on norm layer)
Spelling
- No capital letters
- Abbreviations/acronyms are transcribed as full words in the phonology of the language heard in the recording(e.g. German BMW = beemwe, English BMW = beemdoubleyou)
- speaker codes need to be partly capitalized to follow their correct pattern
Transcription
Adapted from KiDKo2014
'Unwanted' material
- 'unwanted' material are questions of participants concerning the procedure and eventual responses from the elicitator
- first, figure out if you can exclude this kind of data with 'unwanted' material and repeat the elicitation
- If this is not possible, mark those passages as:
<Q> communication with elicitor </Q>
- they get an extra-event
Merged forms
- Merged forms are transcribed as they are articulated, but with an equal sign linking the merged elements
- Examples: so=ne (= so eine)
Reduced syllables
- reduced syllables are transcribed as articulated
- Examples: goin (= going), bi tane (= bir tane), hab ich ein Tadel bekommen (= einen Tadel)
Elisions, repetitions and interruption
- Do not leave anything out and do not add anything which is not there!
- Use / to mark unfinished words, e.g. “The bl/ blue car crashes um stops“
- word internal cancellations/corrections are transcribed as follows: dipl: "be$ha$ come" (norm: "become")
- Onomatopoeias/echoisms are separated tokens (e.g. gutschi gutschi gutschi), only transcribed as one single token if they are very short (e.g. eieiei)
Pauses
- always measured to the first decimal
- 0.2 - 1 sec: (-)
- 1 - 3 secs: (--)
- More than 3 secs: (5.5) to be measured
- Wordinternal pauses are marked as followed: be(-)have 1
- keep in mind that there might be persons who talk really slowly (makes no sense to put a pause after every word/token)
- pauses inside a CU do not get an extra-event on the CU tier
- pauses which occur between two CUs get an extra-event on the CU tier
Long vocals & consonants
- vocals realized longer than normal (0.2 - 2sec) are marked with : (e.g. so lo:ng)
- vocals that are realized longer tha 2 seconds are marked with :: (z.B. so lo::ng)
- also possible for consonants (e.g. mum:)
- doubling of vocal syllables with % (e.g. by%ye, tschü%üß)
Non-verbal material
- non-verbal events such as a participant laughing or coughing are noted in square brackets on the transcription tier, e.g. [laughing]
- if participants speak and laugh at the same time, you note it as: [[laughing]speech]
List of meta comments used in the RUEG project
- [coughing]
- [gulping]
- [laughing]
- [pfing] for a sound like "pfff"
- [sighing]
- [throatclearing]
- [tongueclicking], including tsking as disapproval, while thinking and just mouth opening with a click
- [whispering]
- [stuttering]
- [imitating], for when they imitate a sound related to the story (e.g., car crash)
- [sniffing]
Uninterpretable material
- uninterpretable material is to be marked as (UNK) on Speaker-tier
- longer than 2secs: (UNK, 2.1)
- assumed content is placed in between brackets, each token separated: (assumed) (content)
- if the uninterpretable material can be identified as belonging to a CU, there is no seperated event on the CU tier for it
Hesitation markers / Interjections / Reception markers
- For every language, we define a set of hesitation markers/interjections/reception markers
- create a list with those markers
- If heritage speakers use particles from their ‘other’ language, we transcribe them as they sound, consistent with the procedure on foreign language material2
Foreign language material
- Choose a spelling for each item following one of those options:
- transcribe phonographically (e.g. engl. like = germ. leik) OR
- use orthographic spelling of the "other" language
- Create a list where you document the spelling of each item in alphabetic order
- put the file name that includes the word and the time of the appearance in the list
- each time you encounter foreign language material in your data, check the list to guarantee a consistent form for those items3
Proper/Brand names from "foreign language"
- Keep conventionalized spelling (e.g. Renault = renault)
- document your decisions, create a list with those items
- Language specific decisions: Russian: put it in the spelling and script of the actually spoken language to avoid loss of phonetic/morphological/syntactic information Turkish and Greek: use Latin alphabet and conventionalized spelling
Table of symbols
Symbols | Meaning |
---|---|
<Q> communication with elicitor </Q> | instances of questions concerning the procedure and/or verbal interventions of elicitators |
(-) | pauses up to 1sec |
(--) | pauses 1-3secs |
(3.2) | pauses longer than 3secs |
(UNK) | uninterpretable material |
(UNK, 2.2) | uninterpretable material longer than 2secs |
(assumption) | assumed material |
[...] | non-verbal material |
[[...]...] | non-verbal & verbal event |
: | unusually long vocal or consonant (under 2secs) |
:: | unusually long vocal or consonant (longer than 2secs) |
= | merged forms |
/ | interruption of a word |
$...$ | word internal cancellations |
% | doubled syllables |
{...} | specification of an anonymised place |
Annotation Step 2: Normalization
Segmentations in our corpus
Our corpus pipeline faces 4 types of input for each speaker that participated in the elicitation:
- formal written (fw) text files
- informal written (iw) chat exports
- formal spoken (fs) transcriptions
- informal spoken (is) transcriptions
All data types are divided in above-word-level ("phrasal") segments:
- fw: sentences indicated by punctuation used by the participant
- iw: messages and lines indicated by separators used by the participant (punctuation, new line, message separation)
- fs and is: communication units (CUs) indicated by transcriber (some projects might use intonation phrases additionally)
We annotate our CUs on the annotation tier "cu" (in lowercase letters). These segments are then automatically tokenized (roughly speaking divided in character sequences between two whitespaces), which provides our first word-level segmentation: The diplomatic layer, consisting of graphemic words represented in the phrasal segments.
This word-level segmentation is then to be normalized to facilitate search and automatic annotation.
What is normalization in our corpus and what is its purpose
The result of normalization is an additional word-level segmentation layer, similar to dipl. Starting at the diplomatic tokenization, explicit rules are to be applied to obtain orthographically normalized tokens on the norm layer. Starting of as a copy of the diplomatic token layer, the normalized tokens can (but do not necessarily need) to undergo operations such as edition, deletion, split or merge.
Underlying principles and things to keep in mind
-
Our normalization is not the last step of analysis and does not remove information from the overall corpus
-
Normalization takes place mainly on word-level to obtain standardized forms w.r.t. an orthography of reference
-
We will agree on an orthography of reference for each language in our corpus
-
The normalized layer uses the language's native script, whereas the diplomatic layer sticks to the script used by the participants for the written registers (fw, iw)
-
We do not normalize syntax. This way we are able to learn about an underlying grammar more easily rather than analysing the elicitated language in terms of a standard grammar. Note that this comes with disadvantages but is more appropriate for our projects' idea. We want to obtain a layer that is orthographically normalized but still allows to do syntactical analyses. This is why we do not normalize syntax and try to keep as many elements as possible.
-
Normalization is also a technically necessary process. Orthographically standardized forms help all researchers, those not familiar with our data included, to search linguistic phenomena. Furthermore, annotation to be integrated in our corpus can be pre-generated automatically and efforts of correction are reduced.
-
The highest priority is to ensure that we will always be able to answer our research questions.
-
We do not alternate lexical choice when normalizing.
-
The norm layer will enable us to identify grammatical units.
-
A normalization decision is always influenced by the question of how controversial and/or lossy in terms of information that decision might be.
-
Since this is our first layer of normalization, we normalize very carefully.
-
There will be no grammatical normalizations (adaption of cases or gender or other grammatical features that deviate from our understanding of "standard")
-
We normalize to a single standardized form (if possible) to reduce noise in the tokenization
-
We always normalize as far as we can go, which means we are more careful in some cases compared to others. As an example, 're stays 're in the normalization, because tempus is unclear, whereas 'll can be normalized to will, since it is entirely uncontroversial what the diplomatic form expresses.
-
When normalizing, we try to be as uninterpretative as possible.
-
1:1 mappings from diplomatic and normalized tokens are easier to decide on, because the alternation in the form is easy to trace.
-
The standardized representation should be influenced by the underlying orthography explicitly and implicitly, e. g. for German clitics we do not use an apostrophe in the separated token, whereas we do for English.
-
Normalization should not alternate meaning (if possible).
Orthography of reference
language | orthography of reference | reference |
---|---|---|
English | American English | Merriam-Webster |
German | „Neue deutsche Rechtschreibung“ | Amtliche Regelung der dt. Rechtschreibung, Duden |
Greek | Standard Modern Greek | Λεξικό της κοινής νεοελληνικής and David Holton, Peter Mackridge, Irene Philippaki-Warburton (1997) Greek: A Comprehensive Grammar of the Modern Language, London: Routledge |
Russian | современный русский литературный язык | Малый академический словарь (МАС) https://rus-academic-dict.slovaronline.com/ |
Turkish | Turkish / Modern Standard Turkish (ISO 639-3 = tur) | Türk Dil Kurumu |
Additional tasks
Annotate CUs for written data
For the written registers (iw, fw), we currently lack a syntactic unit. Furthermore, CUs as segmentation units were, amongst other reasons, chosen due to their applicability for spoken and written data. Having CUs in all types of texts facilitates cross-mode / cross-register comparisons.
To do this in EXMARaLDA, you first need to create the respective annotation tier. After having opened your file in EXMARaLDA, add a new tier through the menu bar (Tier > Add Tier). Choose the following settings:
- Speaker: dipl [dipl]
- Type: A(nnotation)
- Category: cu
By depending on the diplomatic tokenization, the tier has one timeslot per event on dipl. To annotate a CU, merge all timeslots in the CU tier that overlap the tokens (dipl) belonging to the CU (including sentence-final punctuation). Finally, assign a simple annotation value "cu" to the newly created span.
The CU tier does not explicitly contain any text. By overlapping its respective tokens, the entire text of a CU is already defined and can be searched.
Please do not edit the CU layer of the transcriptions. They are final.
Adapt language annotation
For each dipl token there is a lanuage annotation. When preparing your data, this is automatically set to the language of elicitation. Nevertheless, in some cases you might have to adapt that value. Please use the ISO 639-2 language code. A list of valid codes can be found here.
For cancellations, pauses, emojis and non-verbal material, please delete the language annotation.
Normalization guidelines
The rules are always applied on the normalized tokenization. The diplomatic tokenization remains as is.
The following lists and examples are planned to be extended and updated with more exmaples, especially from Greek, Russian and Turkish.
Notation: Normalization rules are formulated as diplomatic token(s) → normalized token(s). Token boundaries are represented with a slash (/).
Phenomenon | What to do on norm level? | English example | German example | Russian example | Comments / Rationale / Criticism |
---|---|---|---|---|---|
(CU-initial) capitalization | do not correct. Only adapt if a word itself has to be capitalized. Do not capitalize merely because of a CU-initial position | am / i / in / london → am / I / in / London | das / auto / konnte / nicht / mehr / bremsen → das / Auto / konnte / nicht / mehr / bremsen |
Note: Rule of thumb: Capitalize, if the word needs to be capitalized in the orthography of reference. Only introduce capitalization if it is the correct spelling independent of whether it appears in an initial position or not. Also, do not undo CU-initial capitalization if provided by writer.
Phenomenon | What to do on norm level? | English example | German example | Russian example | Comments / Rationale / Criticism |
---|---|---|---|---|---|
orthographic errors or phonetic assimilations / slip of the tongue | normalize orthography | orthographie → orthography | hunt → Hund fümf → fünf | ~ | 1 |
orthographic errors or phonetic assimilations / slip of the tongue (within vocabulary) | correct to target hypothesis if meaning of corrected form is much more likely AND the normalization implies a change in category | I / was / their → I / was / there | ihr / seit / dort → ihr / seid / dort Umfall → Unfall | ~ | 2 |
written data: composed forms | refer to above mentioned reference and normalize | inter-national → international caraccident → car / accident | Auto / Unfall → Autounfall |
This is the prototypical case of normalization.
Both corrections are much more likely, since it is very hard to derive any meaning from the uncorrected examples, considering they were meant to be this way. Additionally, since it is 1-1 normalization in terms of words, normalization is not particularly invasive. Last but not least, since the elicitations take place in a more or less closed setting, it is much easier to make a judgement on the likelihood of an expression in a case of potential within-vocabulary errors.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
missing words | do not add them, we do not normalize syntax | you / there / ? → you / there / ? | bin / unterwegs → bin / unterwegs |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
repetitions | keep, normalize orthography | the / the / cAr → the / the / car | ein / ein / Autounfall |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
self-corrections | see repetitions / cancellations | the / hatchback / (eh) / car → the / hatchback / car (filled pause deleted, rest remains) | ein / Typ / ein / Fußgänger / hat / dann remains as is, whereas ein / Fußg/ / ein / Typ → ein / ein / Typ, since Fußg/ was explicitly marked as cancelled by the transcriber |
Note: The proposed procedure for self-corrections is the least theory-driven, thus, avoids making any grammatical assumptions when normalizing. Nevertheless, or therefore, it also might be very controversial. It definitely does not facilitate later syntactic annotation.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
spoken data: cancellations3 | drop the incomplete element | car / acc/ → car | Autounf/ → |
Note: Sometimes it is difficult to tell whether something is a cancellation or not, because the incomplete part might already be a word on its own. In these cases we rely on the transcriber's decision: If something is marked as cancellation in the transcript, we stick to that interpretation. Nevertheless, that should not keep your project from revising transcriptions (given the current state of corpus creation still allows for such revisions).
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
spoken data: phonetic markers | remove and normalize to orthography | a::nd → and | und / da::nn → und / dann |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
reduced forms | if reconstructable normalize orthographically4 | ca → car | nich → nicht wolln → wollen |
Note: reduction has to be out of vocabulary. If not, check with the guidelines on within-vocabulary errors above.
Note: There might be (even in the dictionary) widely accepted variants of standardized writings that are more or less the result of reduction (or other processes). Examples are cross instead of across in English, grade instead of gerade in German. We need to focus on the goal of normalization: To have a text segmentation where orthographic variation is reduced / removed, i. e. where multiple expressions are mapped to one standard representation. Thus, as long as there is no argument that normalizing an item alternates meaning, these reductions should be normalized. We only deal with mappings from one token to one token, i. e. retrieving the originally chosen expression is very easy and we do not lose information. A counterexample are cases of drauf in German. As a pronomial adverb it is often a reduction of darauf and should be normalized as that. Nevertheless, it can also be the separated particle of the particle verb draufgehen (to die, fig.) and normalizing it to darauf would not be appropriate/"correct". When normalizing, we will most likely encounter much more subtle cases of difference in meaning. We need to discuss and document our decisions.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
colloquial forms, dialect | normalize | coz → because rulz → rules cha → you | nix → nichts ick → ich jetze → jetzt itzt → jetzt |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
merged forms | split into normalized items as long as both items are visibly represented in the merged form | wanna → want / to gonna → going / to 5 | kannste → kannst / du 6 |
Note: regarding (1) and (2)
what / cha / gonna / do → what / you / going / to / do
Note: kannst / mir / mal / helfen
remains as is, since there is no overt material for du without assuming a reduction of kannst, i. e. this is a case of missing material, not merge (see above)
Phenomenon | What to do on norm level? | English example | German example | Russian example | Comment |
---|---|---|---|---|---|
merges / clitics | separate and normalize, if possible | isnt → is / n't withe / ball → with / e / ball he's / guilty → he / 's / guilty | auf=m → auf / m binsch → bin / ich |
Note: We normalize a clitic if there is no ambiguity about the standard form. If the standard form cannot be determined unambiguously, the clitic is represented as a clitic, but in a standardized way (see table below for examples).
English clitics
We always separate the clitics, i. e. divide a single diplomatic token in two norm tokens. See the following list of examples:
Diplomatic | Normalized |
---|---|
n't | not |
'll | will |
's | 's |
're | 're |
isn't or isnt | is / not |
aren't or arent (or even arnt) | are / not |
cannot or can't or cant | can / not |
won't or wont | will / not |
didn't or didnt | did / not |
don't or dont | do / not |
doesn't or doesnt (or even dosnt) | does / not |
ain't or aint | ai / not |
couldn't or couldnt | could / not |
shouldn't or shouldnt | should / not |
wouldn't or wouldnt | would / not |
she'll | she / will |
he's | he / 's |
you're | you / 're |
Thomas' / computer | Thomas / ' / computer |
Anke's / mail or Ankes / mail | Anke / 's / mail |
German clitics
Separate clitics analogous to English clitics, but do not use an apostrophe.
Greek and Russian clitics
Represent them according to the chosen orthographic standard.
Turkish clitics
Concatenated morphological elements can remain a unit, since this is part of the language. If you prefer to separate them anyway, make sure you work consistently.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
lexicalized merged forms | + do not normalize + work with list? + refer to orthography reference? | Rock'n'Roll → Rock'n'Roll | im → im zum → zum beim → beim ins → ins |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
non-standard variation within a paradigm | normalize | $\dots$ | helf / mir → hilf / mir |
Note: Non-standardized but maybe widely established and accepted forms can be understood as alternative spellings. For alternative spellings our rules guide us to choose a standard and always use that to guarantee consistency. Therefore, normalizing to the true standard form is a good way to go. It is also a safe procedure, since we are dealing with single word to single word mappings, i. e. the original diplomatic form is always retrievable.
Special material / cases
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
speaker codes | do not normalize |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
foreign material / proper names in a foreign language | normalize according to agreement | leik → like |
Note: The transcription of foreign material is always phonographic. The normalization sticks to the habit of the respective language. Whereas the phonographic leik (dipl) might be like (en) on the German norm layer, whereas a Russian that says Autowerkstatt in German in a Russian context, the dipl and norm form will be phonographic representations (аутоверкштат).
Also: Remember to adapt the language annotation for these cases.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
Abbreviations | (merge and) normalize according to standard orthography | pekawe → PKW |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
cardinal numbers | normalize according to regulations in orthography of reference |
Note: In the written data, it is enough to correct the spelling of numeral words, there is no need to convert actual numbers to words or vice versa.
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
floating point numbers | normalize as numbers | one / point / three → 1.3 | eins / Komma / drei → 1,3 |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
times and dates | are a composition of elements for which there are normalization rules to be applied individually |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
fractions | normalize/keep as words | one / third | ein / Drittel |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
pauses, noise, turn-holders, laughing, ... | not represented, i. e. delete token |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
filled pause vs. particle, etc. | If you decide to consider certain filled pauses lexical units, decide on a standardized representation and represent that unit always the same way in the normalization layer. If you are currently undecided on the status of such an element and this is actually part of your research, delete that element on the normalization layer. |
Phenomenon | What to do on norm level? | English example | German example | Russian example | Comment |
---|---|---|---|---|---|
interaction/communication with elicitor | not represented | ... | ... | There is a transcription standard for these events. These events will not be normalized. Delete These events on the norm layer. |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
incomprehensible, no hypothesis (unknown material) | not represented, i. e. delete token on norm | (UNK) | (UNK) |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
incomprehensible with hypothesis | hypothesis is represented in a normalized version | (car) → car | (jetze) → jetzt |
Phenomenon | What to do on norm level? | English example | German example | Russian example |
---|---|---|---|---|
punctuation | do not add nor normalize, but separate on both dipl and norm | accident.Then / the → accident / . / Then / the | ,dass → , / dass |
NOTE: This is a phenomenon you may face in written data. Since there is no obligation for authors to stick to pure typing-conventions and use spaces after punctuation, we will treat cases like the above as tokenization errors caused by the pipeline. To remain within a certain frame of complexity we cannot resolve these issues within automatic tokenization. Please perform the separation during normalization on both - dipl and norm. Each element of punctuation should be its own token, unless it is a sequence of punctuation characters without whitespaces in between. In these cases the characters remain a unit (e.g. "..." stays like this on dipl and norm).
Phenomenon | What to do on norm level? | English example | German example | Russian example | Comment |
---|---|---|---|---|---|
multiple representations allowed in orthography of reference | Make a decision, document it and stay consistent. | Albtraum vs. Alptraum | ~ | This also implies, that if a participant uses the alternative form you should normalize to the variant you chose as a standard in the corpus. |
Grammatical deviations
If there is a mismatch in case, gender or any other grammatical property/feature, do not normalize it. This has to remain to be up to investigation. You might, of course, find non-trivial cases between orthographic normalization and an actual grammatical target hypothesis. Document these case and raise a discussion if necessary.
Annotation Step 3: POS and Lemma
Tools involved
- we prepare lemma, pos and feature annotations for you (automatically)
- you will correct these annotations in EXMARaLDA
- you will find the data in the same subfolder of
exb/
as during normalization
If you ever face the following error message when opening an EXMARaLDA file:
Tier ... is not stratified. Please choose a method for stratifying the tier:
Choose Stratify by deletion.
Lemmatization
-
binds several occurences within the corpus to a common type
-
facilitates / enables for search with / for lexical items
-
lemma: standardized form which might occur in different realizations / variants
-
lemma usually (but not necessarily!) determined by paradigms, i. e.:
inflected form → prototype = a single inflected form within paradigm or uninflected form = lemma
- lemmatization varies across annotation schemes and languages
- lemmatization is linked to part of speech
Part of Speech (POS)
- we would like to categorize several lemmata to more abstract categories
- factors for determining a category can be:
- syntactic / distributional
- semantic
- morphological
- graphical
- ...
Know the guidelines and document difficult cases
-
part of speech is the baseline for many further annotations
-
we need correct annotations
-
we will measure agreement
-
highly connected to lemmatization, thus underlies the same restrictions and parameters of variation
Rule of thumb in cases of doubt (Lemma & POS)
Always trust the guidelines more than your “grammatical intuition”, but in cases of doubt consult both.
Language-specific POS vs. UD-POS
- language specific: you might have to find new ways / rules for undescribed phenomena (Please document!)
- UD: strictly stick to UD guidelines for your language and please do not decide by what seems more logical to you
POS – Schemes by Language
English | German | Greek | Russian | Turkish |
---|---|---|---|---|
British National Corpus Part of Speech Tagset | STTS 2.0 | Universal POS tags | MyStem Morphology | MULTILIT |
All languages
Universal Dependencies POS-tags, lemmas, and features
- https://universaldependencies.org/
- https://universaldependencies.org/u/feat/
- https://universaldependencies.org/u/pos/
POS – Who does what?
English | German | Greek | Russian | Turkish |
---|---|---|---|---|
correct BNC-POS, lemma, features (?) | correct STTS-POS, lemma, features (?) | correct UD-POS, UD features, lemma | correct MyStem-POS, lemma, features, and UD-POS | correct MULTILIT-POS, lemma, features (?), and UD-POS (?) |
UD-POS by language
English | German | Greek | Russian | Turkish |
---|---|---|---|---|
derivable | derivable | needs manual correction | needs manual correction (?) | derivable (?) |
Remarks
- delete tier norm [CU]
- feel free to move the tiers up and down
- rename pos_1_1 to pos_lang
- look out for errors in the data and report them immediately
General decisions for all languages:
- lemma "F16" is tagged as a proper noun with the respective tag from the specific tagset you are using (e.g. "PROPN" from Universal Postags)
- all kinds of greetings should be treated as interjections and tagged with the respective tag from the specific tagset you are using (e.g. "INTJ" from Universal Postags), unless there is a specific tag for greetings in the language specific tagset you are using
- regarding the lemmatization of informal greetings: you can just copy the word form from the norm/dipl-layer
Language Specific Decisions
Each language has additional transcription and annotation decisions. These are documented in the following sub-sections. We apologize for any parts of documentation that are not available in English until now, these will be translated and included in future releases of the corpus.
Transcription Decisions English
Transcription decisions: English
Spelling lists for CU Layer.
1. Compounds with hyphens
- rear-end/rear-ended > rearend/rearended (if it is a verb, but if it is a noun, e.g. "rear end of the car", insert space)
2. Hesitation markers
- ɑ > ah
- ɛ > eh
- ɹ [syllabic] > er
- oʊ > oh
- ə > uh
- əm/ʌm > um
- em
- oh
- ha
- mm
3. Merged forms
-
couldn't > couldn=t
-
didn't > didn=t
-
doesn't > doesn=t
-
don't > don=t
-
I'd > i=d
-
I'm > i=m
-
it'd > it=d
-
shouldn't > shouldn=t
-
that's > that=s
-
there's > there=s
-
they're > they=re
-
Unclear determiner ["a" versus "the"] > (det) TBD
-
wasn't > wasn=t
-
we'd > we=d
-
what's > what=s
-
who's > who=s
-
you're > you=re
-
guy's ball -> guys, guy's (guy is) running -> guy=s
-
The following are lexicalized forms that appear in the dictionary: gonna, kinda, wanna.
COMMENT: How do we handle ambiguous cases with determiners e.g. with a vs. with the suggestion: with=e
4. Notations
- [coughing]
- [gulping]
- [laughing]
- [pfing] for a sound like "pfff"
- [tongueclicking], including tsking as disapproval, while thinking and just mouth opening with a click
- [throatclearing]
- [whispering]
- [sighing]
- [sniffing]
- [imitating] - when a person imitates a sound of a crash or other sounds related to the story
- [yawning]
5. Other
- basketball
- ɛf > ef (i.e., case number ef sixteen)
- soccer ball
- renault
- volkswagen
- aysap (for a.s.a.p. 'as soon as possible', if pronounced 'ay-sap', not 'ay ess ay pee')
6. Foreign language material
- hello - preevyet (привет)
- Olya (name), not Olia
English Normalization
Individual acronyms and letters are represented with lowercase letters.
- ef > f
- pe:em > pm
- be:em:doubleyou > bmw
- aysap > asap
Number less than twelve are written out, while numbers greater than twelve are represented as numerals.
- ef sixteen > F16 (one token)
- nine | one | one > 911 (one token)
- -_- emoticons to be left as one token
- consecutive punctuation such as "..." are to be left as one token, unless they are already separated on dipl layer.
- twelve > twelve
Foreign language material
- preevyet (hello) > privet
Abbreviations to be written normalized, even if they are in the dictionary
- sec > second
- prob > "probably" or "problem", depending on the context
Capitalizations
- I
- Renault, Volkswagen, Golf
Other
- rear end (if noun), rear-end (if adjective or verb) e.g. "rear-end collision", "rear-ended", "bumped into the rear end"
English POS and Lemma
BNC: Tag List
- Might be less compatible with American English spellings
- Greater number of tags for accuracy; the tags are highly specific, though not all are necessary for our purposes (i.e., four categories for punctuation). Regardless, researchers searching for broader categories in the corpus should be able to do so by filtering the data appropriately.
- Intuitive tag names
- Multiple codes for determiners
Decisions
- Hi/Hello/Hey : ITJ (Interjection)
- F16: NP0 (proper noun)
- I : PNP
- am -> be: VBB
- like: ITJ (interjection)
- okay (ie. 'he is okay'): AJ0
- kind (of): AV0
- e (det): AT0
- same: AJ0
- as: CJS
- (in) front: PRP
- behind: PRP
- Police: NN0
- 911: NP0 (proper noun)
- no (AT0) one (PNI)
Transcription Decisions German
(in German)
Grundlegendes
Transkriptionsprogramm Exmaralda:
Nach jedem Intervall muss ein Leerzeichen stehen!
Tiers
- Sprecher_in (Benennung nach der Sigle, z.B. DEmo09FD; type: transcription)
- Comment (Kommunikation unter Transkribierenden; type: description; Informationen dieser Ebene verschwinden später)
- Anonymisierung (für das Festlegen von Events, die verrauscht werden sollen; type: description)
Segmentierung und Akzentsetzung
- Segmentierung nach Communication Units (CU)
- Keine Akzentsetzung
- Keine Satzzeichen
- Faustregel: Unabhängige Sätze bilden eine CU, abhängige Sätze bilden gemeinsam mit dem jeweiligen unabhängigen Satz eine CU
Transkription
Allgemein
- Sprachliches Material, das sich auf die Aufgabenstellung bezieht, wie Nachfragen durch Proband_innen, und evtl. Unterbrechungen durch Erhebende, wird wie folgt markiert:
<Q> communication with elicitor </Q>
und bekommt ein eigenes Event. - Strikte Kleinschreibung
- Weitestgehend orthografische Transkription gemäß der deutschen Rechtsschreibregeln
- ABER: Transkribiere wie gehört in Fällen, die NICHT allgemeiner Standard gesprochener Sprache sind! Nichts weglassen oder ergänzen.
- darunter fällt z.B.: nich vs. nicht, kein vs. keinen, jetz vs. jetzt, n vs. ein, is vs. ist, ne vs. eine, ma vs. mal, was vs. etwas, brauch vs. braucht, rum vs. herum, ham vs. haben, isch vs. ich
- wenn die komplette Endung wegfällt (-en und nicht nur Schwa), z.B. gesprung, angefang, dein
- dialektale, soziolektale und ethnolektale Variation wird verschriftlicht (nicht übertrieben genau)
- Standardphänome der gesprochenen Sprache, die NICHT transkribiert werden, sondern gemäß der deutschen Orthografie, sind1:
- Wegfall von Nebensilbenvokale / Schwatilgung: immer orthografisch schreiben (sehn = sehen, machn = machen); siehe offene Fragen
- Auslautverhärtung (z.B. bald nicht balt)
- Wortfinales er (z.B. koffer nicht koffa)
- Wortfinales g als ch (z.B. König nicht könich)
- ä bleibt ä (z.B. später nicht speta)
- sch in der Verbindung sp/st wird als sp/st notiert
- Vereinfachung von pf zu f wird nicht notiert
- v und f wird den Rechtschreibregeln gemäß verwendet
- Diphthonge werden gemäß den Rechtschreibregeln verwendet
- einzelne Wörter: eigentlich, irgendwie
- Metasprache ist Englisch (alles was nicht von Srecher*innen produziert wird, z.B. non-verbal)
Eine Liste mit Einzelentscheidungen zu Schreibweisen findet sich hier
Auslassung/Wiederholung/Abbrüche
- Nichts weglassen, was gesprochen wird. Nichts ergänzen, was nicht gesprochen wird.
- keine Apostrophe als Auslassungszeichen
Onomatopoetika/Echoismen:
- einzelne Token (z.B. gutschi gutschi gutschi)
- es sei denn, sie sind sehr kurz, dann zusammen (z.B. eieiei)
Reihungen, Wiederholunge, Abbrüche
- transkribieren wie gehört, nichts weglassen
- Wortabbrüche durch / kennzeichnen
- wortinterne Abbrüche werden so gekennzeichnet, dass das abgebrochene Element durch Dollarzeichen eingeklammert wird, z.B. dipl: Kinder$gar$ wagen
Aussprachevariationen
Verschmelzungen/Reduktionsformen/Auslassungen
-
verschmolzene Formen (Reduktion + schneller Anschluss) werden durch Gleichheitszeichen gekennzeichnet (z.B. is=er, ham=se)
-
Verb + Personalpronomen mit = (z.B. hättest=e, is=er)
-
andere mit = (z.B. so=ne)
-
Präpostion + Artikel Verschmelzungen werden nicht extra markiert, sondern werden zusammengeschrieben. Wir unterschieden nicht zwischen verschiedenen Formen von Präp + Art. Verschmelzungen. Dazu gehören:
- zum, beim, aufm, mitm, fürs, ins, ans etc.
- Doppelschreibung von Konsonantenen bei Silbengelenk: vonner, inner, mitter etc.
- ABER: auser, aufer, zuner etc.
-
Eine Liste mit Schreibweisen findet sich hier
Dehnung
- Orthographische Schreibung mit Dehnungs-h oder Vokalverdopplung bleibt erhalten, die Doppelpunkte (siehe folgende Punkte) kommen hinzu
- Längung bei Buchstabenkombinationen: vor dem Dehnungs-h (z.B. spä::ht), nach dem Dipthong (z.B. polzei::), nach dem Doppelbuchstaben (z.B. see:), nach der r-Vokalisierung (z.B. über:)
- Länger als Normal (0.2-2 Sek.) durch Doppelpunkt (z.B. so la:ng)
- Auffällig lang (ab 2 Sek.) mit :: (z.B. so la::ng)
- Dopplung vokalischer Silben mit % (z.B. tschü%üss); kann mit Dehnungsdoppelpunkten kombiniert werden (z.B. tschü:%üss); das gleiche gilt für Diphthonge (z.B. hei%ei)
- langgezogene und lang behauchte Konsonanten werden auch durch Doppelpunkte (wie bei den Vokalen) gekennzeichnet
Komplexe Komposita
- Generell zusammen schreiben (z.B. lehrerzimmer)
- Kompositionen aus Nomen+Adjektiv (saumüde), zwei Adjektiven (supertoll) oder Adjektiv+Präposition (übergeil) und Partikelverben (leidtun, weggehen) werden als ein Token transkribiert
- Abkürzungen/Wörter mit Bindestrich werden nicht gesondert markiert (z.B. "mathe emesa prüfung" für Mathe-MSA-Prüfung)2
- Koordination von Komposita OHNE Bindestrich (z.B. mittag und abendessen)3
Zahlen und Daten
- Ausschreiben
- Komplexe Zahlen auch zusammenschreiben (z.B. zweikommadrei)
- Bruchzahlen getrennt schreiben (z.B. zwei drittel)
- Schulnoten getrennt schreiben (z.B. zwei minus)
- Datum getrennt schreiben (z.B. elfter dritter)
Abkürzungen / Akronyme
- Ein Wort (z.B. "beemwe" für BMW)
- Komposita aus zwei Abkürzungen werden nicht gesondert markiert (z.B. "mercedes eselfka" für Mercedes-S11K)2
Eigen-/Markennamen
Schreibung wird beibehalten (z.B. renault)
Pausen
- auf Sprecher-tier transkribieren
- Pausen zwischen zwei CUs erhalten ein eigenes Event
- Pausen innerhalb einer CU wird innerhalb dieser transkribiert, kein eigenes Event
- 0.2-1.0 Sekunde: (-)
- 1.1-3.0 Sekunden: (--)
- über 3 Sekunden: gemessenen Wert in die Klammern eintragen, z.B. (5.5)
- Wortinterne Pausen werden ebenfalls so gekennzeichnet, z.B. auf(-)geschlossen (ohne Leerzeichen vor und nach der Klammer), Ausnahme: Wortinterne Pause und gefüllte Pause (ähm) mit Leerzeichen, z.B. auf (-) äh geschlossen
Außersprachliche/non-verbale Handlungen
- non-verbale Handlungen wie lachen oder husten werden auf speaker-tier in eckigen Klammern notiert, z.B. [laughing]
- gleichzeitiges Lachen und Sprechen, wie folgt: [[laughing]ball]
- CU-Zuordnung: wie bei Häsitationsmarkern wird non-verbale Handlung zur folgenden CU zugeordnet, z.B. "(-) | [laughing] (-) ähm ja das hat mir meine mama schon immer gesagt"
Geräusche/non-verbales Material
- Geräuschimitationen werden auf speaker-tier transkribiert (z.B. "dann hör ick nur so wuh")
- Sprecher_in-ungebundene Unterbrechungen (z.B. starker Autolärm, Computer piepsen) werden nur als Pause notiert; interner Vermerk auf comment-tier4
Unverständlich/schwer verständlich
- unverständliches innerhalb einer CU, wird innerhalb der CU in Klammern transkribiert, also kein eigenes Intervall angelegt (UNK) bzw. (UNK, 2.2) über 2 Sekunden
- Wenn nicht eindeutig ist zu welcher CU das Gesagte gehört, wird ein eigenes Event angelegt
- Bei Vermutungen des Inhalts jedes Lexem einzeln in Klammern setzen, z.B. (vermuteter) (Inhalt), Vermutungen beziehen sich auf existierende Lexeme, d.h. wenn man nur "ga" hört, muss das entweder als Abbruch markiert werden innerhalb der Klammern, als "(ga/)" oder als (UNK)
Fremdsprachliches Material
- Transkribieren wie gehört in deutscher Orthographie
- AUßER bei englischem Material, das wird in englischer Schreibweise belassen
- UND: Wie im Duden bei "fremdsprachlichen Material", das dort gelistet ist (z.B. adieu)
- Liste mit der Schreibung von fremdsprachlichen Wörtern und deren Vorkommen erstellen, damit sie immer gleich geschrieben werden, diese Liste findet sich hier
Anonymisierung
Personennamen
- Name (Vor- und Nachnamen) des Sprechenden durch Sprechersigle ersetzen (z.B. DEmo02FD)
- Genitiv-s wird an diese Sigle angehängt (z.B. DEmo02FDs)
- Vornamen von nicht anwesenden Genannten müssen nicht anonymisiert werden
- Nachname oder Vor- und Nachname von nicht Anwesenden müssen anonymisiert werden durch Sprechersigle_P (z.B. DEmo02FD_P)
Ortsangaben
- nur anonymisieren bei Rückschlussmöglichkeit auf den Wohnort, wenn z.B. Straßennamen genannt werden, die der Adresse des Sprechenden entsprechen könnten
- Schule: {Kategorie}, z.B. Alexander-Puschkin-Schule als {schoolname}schule 5
- Straße: {streetname}straße
- Hat der Ort eine Flexion, z.B. Plural, wird sie drangehängt: {streetname}straßen
- Unverständliche Personennamen ohne Angabe der Kategorie
Partikeln/Signale/Interjektionen
Rezeptionssignale
- Verneinung: 'hm'hm, nee, 'ä'ä
- Bejahung/Rezeptionssignal: 'hmhm
- Rezeptionssignal: hm, mhmh
Vezögerungssignale (hesitation markers)
- Filler: äh, ähm, öh, öhm, (auch wenn sie mit einem glottal stop produziert werden), hm (auch wenn nur m produziert wird)
Fragepartikel (gehören zur vorherigen CU)
- immer: ne, gell
- manchmal: oder, ja, okay
Interjektionen
Eine Liste aller Interjektionen, die vorkommen können befindet sich hier und wird während des Transkriptionsprozesses stetig erweitert.
Fragen / Zweifel / Unklarheiten
Sonderzeichen auf Transkriptionsebene
Symbole | Bedeutung |
---|---|
<Q>...</Q> | Nachfragen zur Aufgabenstellung |
(-) | Pause 0.2-1.0 Sek. |
(--) | Pause 1.1-3.0 Sek. |
(3.2) | Pausen länger als 3 Sek. |
(UNK) | unverständliches Material |
(UNK, 2.2) | unverständliches Material länger als 2 Sek. |
(assumption) | vermutetes Material |
[...] | non-verbale Handlung |
[[...]...] | non-verbale Handlung & verbales Material |
: | auffällig lange Vokale/Konsonanten (unter 0.2-2 Sek.) |
:: | außergewöhnlich lange Vokale/Konsonanten (über 2 Sek.) |
= | reduzierte Formen mit schnellem Anschluss |
/ | Wortabbruch |
$...$ | wortinterner Abbruch |
% | Silbendopplung |
{...} | Spezifizierung eines anonymisierten Ortes |
' | Glottisverschlusslaut |
Fußnoten:
Gründe: Konsistenz, agreement, Lesbarkeit
Bindestrich auf Normebene
Störende Geräusche, die den Redefluss unterbrechen: Brauchen wir diese Information später und wo soll sie hin?
Verfahren wir so auch mit Unis? Und soll über diese Orte auch eine Liste angefertigt werden, welche Abkürzungen was sind?
Erst auf Normebene
Interjektionen
Interjektionen, die bei RUEG im deutschen Material vorkommen:
- ach
- ah
- ah ja
- ebend
- ey
- ha (artikuliertes +sth ausatmen)
- halt
- ja
- joa
- jo
- na ja
Interjektionen aus den KidKo Richtlinien:
- ach
- ah
- aha
- ey, e:y
- grr
- hossa
- ih, i:h
- lol
- na
- oh
- oah
- ohoh
- 'oh'oh
- tja
- uh
- ui
- ja doch
- ach so
- ach ja
- oh nein
Einzelentscheidungen Schreibweisen
Alphabetische Liste mit Einzelentscheidungen zu Schreibweisen
- baby (duden, wenn "bebi/be:bi)" produziert wird)
- basecap (wie im duden)
- beemwe
- bissn (wenn nicht bisschen sonder bissen oder bissn produziert wird, z.B. DEbi02FG_fsD: 22.05)
- cap (wie im duden) ABER käppi
- cops (wie im duden)
- crash (wie im duden)
- einskommafünf liter flasche (1,5l-Flasche)
- fauwe (VW)
- genuch (wenn es so produziert wird)
- ha u (für HU, also die Abkürzung der Humboldt Universität)
- hey (duden, wenn "häi" produziert wird)
- hi (duden, wenn "hai" produziert wird)
- ick (ich auf berlinerisch)
- iks üpsilon straße (für xy straße)
- käppi
- limousine (spricht natürlich limusine)
- nix (wie duden)
- pekawe (oder besser pekawe, weil ein Doppelpunkt eine Längung über 0.2 Sek. anzeigt, die meist nicht vorhanden ist bei pekawe, wenn doch, dann natürlich mit :)
- pekawe fahrer (PKW-Fahrer)
- revue passieren (wie im duden)
- so was
- tschau (empfohlene Schreibweise duden)
- van (Van wie im duden) (auch minivan)
Alphabetische Liste mit fremdsprachlichem Material (mit Vorkommensangabe)
englische Ausdrücke werden in englischer Orthografie transkribiert:
- crashen (DEmo12MD_isD / 00:30, Ebi08MG_isD / 00:44) im online Duden gelistet (eingesehen 15.04.2019, 9:57), Wert auf lang: deu
- bystander (DEmo32FD_isD / 01:24) nicht im online Duden gelistet (eingesehen 15.04.2019, 10:00), Wert auf lang: eng
- dude (DEmo53FD_isD / 00:24) nicht im online Duden nicht gelistet (eingesehen 15.04.2019, 9:59), Wert auf lang: eng
- plep [pleb = dt. Prolo] (DEmo74MD_isD / 00:28) nicht im online Duden gelistet (eingesehen 24.04.2019, 12:32), Wert auf lang: eng
alle anderen, wie gehört:
- dawei (rus)
- vallah (ara)
anderes:
- kränk [von krank mit englischer aussprache] (DEmo71FD_isD / 01:35)
Verschmolzene Formen
Präposition + Artikel Verschmelzungen ohne Markierung
- anner
- aufer
- aufm
- aufn
- auser
- ausm
- beim
- hinterm
- hinters
- im
- inner
- ins
- mitm
- mitn
- übern
- vonnem
- vonner
- vors
- zum
- zur
Ohne =, weil sie ihm Duden stehen
- son, sone (für solch, solche), also bei folgendem Substantiv im Plural
Schreibweisen mit =
- auch=n (auch ein)
- bist=e
- dass=es
- d=is (das ist)
- d=is=n (das ist ein)
- gab=s
- geht=s (geht es)
- gib=s
- grad=n
- hab=s
- ha=ich (hab ich)
- hat=ter (hat der / hat er)
- hat=n
- hat=s
- hättest=e
- ich=n
- ich=s
- is=er (ist er)
- is=es (ist es)
- is=n
- kennst=e (kennst du)
- muss=er
- noch=n
- nu=ma (nur mal)
- ob=s
- sa=ma (stark zusammengezogenes sag mal)
- so=m (so einem)
- so=n (so ein)
- so=ne (so eine)
- und=n
- un=an (und dann, komplett ohne Plosiv produziert)
- war=n
- war=s
- weil=et (weil es)
- wie=s
- wird=s (wird es)
- wollt=er (wollte er; so geschrieben, weil nicht entscheidbar ist, ob es wollt er oder wollte er ist)
- zwar=n
German Normalization
(in German)
Grundlegendes
- Orthografische Normalisierung
- Syntax bleibt
- Grammatik wird nicht verändert, d.h. z.B. keine Normalisierung von Genus- und Deklination: "schön guten tag", "mit ein hund", "so ein klein hund"
- Abbrüche und Pausen werden gelöscht und bekommen ein leeres Event (auf language-Ebene wird der Wert für Abbrüche und Pausen gelöscht --> leeres Event); AUSNAHME bei wortinternen Abbrüchen (dipl: vorbeige$le$ rollt, norm: vobeigerollt)
- Wiederholungen bleiben
- spoken: non-verbales Material, wie [laughing], wird nicht übernommen --> leeres Event
- Satzzeichen werden nicht hinzugefügt
Hesitation marker / Gefüllte Pausen
- Hesitation marker werden einheitlich zu „äh“ normalisiert, darunter fallen äh, ähm, öh, hm etc.
dipl | norm |
---|---|
äh, öh, ähm, hm etc. | äh |
Keine lexikalischen Änderungen
- bei eindeutig gleichbleibender Bedeutung, einen Standard festlegen und dokumentieren, wie:
dipl | norm |
---|---|
aufgrund, auf Grund | aufgrund |
andren,anderen | anderen |
Dicker, Digger | Dicker (29.05.2019) |
etwas, was | etwas |
grad, grade, gerade | gerade |
gern, gerne | gerne |
habe, hab | habe |
hey, hi, hei(als Begrüßung, nicht als Ausruf) | hi |
langlaufen | entlanglaufen |
mache , mach (Imperativ) | mach |
nichts, nix | nichts |
noch mal, nochmal | nochmal (28.05.2019) |
rumspielen | herumspielen |
rum | herum |
runterfallen | herunterfallen |
sodass, so dass (wenn Konjunktion) | sodass |
vorn, vorne | vorne |
- bei möglicher Bedeutungsänderung bzw. Kontextrestriktionen, Lexeme so belassen, Varianten bleiben:
- daran, dran
- darin, drin, drinnen
- drauf, darauf
- sone (wie "sone autos", also nur bei Substantiv im Plural), solche
- reinfahren, hereinfahren, hineinfahren
- auffahren, rauffahren, drauffahren
- reinpacken, einpacken, hineinpacken
Foreign Language Material (FM) und translinguale Elemente
- FM mit deutscher Flexion, z.B.:
dipl | norm | lang |
---|---|---|
gecrasht | gecrasht | eng/deu |
- im Duden aufgenommenes Material, wie sorry, Van etc. werden auf language-Ebene als deu markiert
Zahlen...
- bis zwölf werden ausgeschrieben
- ab 13 werden als Ziffern geschrieben
- in den schriftlichen Texten, Variante beibehalten, die der/die Proband*in gewählt hat
Einzelentscheidungen
dipl | norm |
---|---|
pekawe | PKW |
ef sechzehn | F16 |
-
'nen' zu 'ein' normalisieren in Fällen von Nominativ Maskulin und Nom./Akk. Neutrum, z. B.:
- dipl: "da is nen auto um die ecke gebogen". norm: "da ist ein auto..."
- dipl: "ich hab nen auto gesehen". norm: "ich habe ein auto..."
ABER nicht in: dipl: "nen ne vollbremsung" zu norm: "einen eine vollbremsung"
-
so=n entweder zu "so ein" oder zu "so einen" je nach Kontext:
- dipl: "so=n typ hat mitm ball...". norm: "so ein Typ hat mitm Ball"
- dipl: "die frau hat so=n hund dabei". norm: "die Frau hat so einen Hund dabei"
ABER:
- dipl: "sie hatte so ein hund dabei" bleibt auf norm: "sie hatte so ein Hund dabei"
-
gender gap
- dipl: Fußgänger innen; norm: Fußgänger_innen
Language Values
dipl | norm |
---|---|
deu | deutsch |
eng | englisch |
ara | arabisch |
tur | türkisch |
spa | spanisch |
Schriftliche Texte
- CU-Ebene hinzufügen
-
Satzzeichen:
- Keine einfügen, nicht korrigieren, außer bei fehlendem Leerzeichen:
dipl norm eingepackt.auf eingepackt / . / auf - mehrere Satzzeichen hintereinander ohne Leerzeichen, z.B. drei Punkte: … in einem Event lassen
dipl norm ... / ... / - wenn Leerzeichen dazwischen, dann auch so lassen, z.B. . /. / .
dipl norm . . . /. / . / . / -
Emojis
- Emojis wie :) und Acronyme wie lol werden auf norm übernommen und in einem Segement belassen
German POS and Lemma
(partly in German)
Model: STTS 2.0 (Westphal et. al.)
Here you find the Guidelines: Westpfahl_Schmidt_Jonietz_Borlinghaus_STTS_2_0_2017.pdf
Entscheidungen POS tag:
- F16 als NE
- 1-Wort Begrüßungen/Verabschiedungen wie hi, hallo, tschüss sind Interjektionen (NGIRR)
- Speaker-Codes, anonymisierte Straßennahmen, etc. werde zu Eigenname (NE)
- von den Proband_innen selber anonymisierte Namen, wie "Frau XX" oder "XY-Straße", werden zu Nichtwort (XY)
- drei Punkte sind eine satzbeendende Interpunktion ($.)
- wenn durch Abbrüche oder sonstiges nicht entscheidbar ist, welches POStag einem token zugewiesen werden sollte, bleibt der Wert leer
- konventionalisierte Abkürzungen, wie "d.h." werden als ADV getaggt (siehe guidelines S.13)
- "also" je nach Kontext SEDM, ADV oder KOUS:
- "also" als SEDM im Vor-Vorfeld, z.B. "also/SEDM ich heiße..."
- "also" als ADV, meist Spezifizierungen ohne Verb, z.B. "...eine Familie, also/ADV Frau, Mann, Kind"
- "also" als KOUS, z.B. "also/KOUS ging ich die Straße entlang"
- ein paar Beispiele
token | POS tag |
---|---|
/bis /später | /APPR /ADJD |
/gegenüber /von | /ADV /APPR |
/gegenüber /dem /Auto | /APPR /ART /NN |
-
Beispiele
- DEbi56FR_fsD: und der hat natürlich/ADJD auf den Ball reagiert
Entscheidungen lemma:
- Nominalisierungen werden beibehalten (Verletzte, Folgendes, Fahrer...) und dann nach der untenstehenden Tabelle für verschiedene Formen, die kleinste gemeinsame Form nehmen (Verletzte, Folgende, Fahrer), sodass pos und lemma übereinstimmen (z.B. [norm]: "das Spielen", [pos_lang]: NN, [lemma]: Spielen)
- Speaker codes bleiben so
- Merged forms Artikel+Präposition werden zum lemma der Präposition: aufm wird zu auf, mitm zu mit, zum zu zu, im zu in
- Datum zu @card@
- Reflexive werden zu Personalpronomen (sich zu er|sie|es), Possesive nicht, siehe auch Tabelle Formen Lemma
- Ordinalzahlen bleiben (damit es mit [lemma] übereinstimmt) und werden nicht zu Kardinalzahlen
- Bei unterschiedlichen Formen bei verschiedenem Genus nehmen wir immer die kleinste gemeinsame Form (siehe Tabelle), Ausnahme: NN, die Personen bezeichnen, bleiben in der Form, die auf [norm] ist, z.B. "Augenzeugin" bleibt und wird nicht zu "Augenzeug"
- der, die, das wird immer d, unabhängig davon, ob es als Artikel oder Relativpronomen gebraucht wird.
- Verschmelzungen Präpositionen+Artikel (vom, beim, aufm) werden immer zur Präposition lemmatisiert (vom --> von, beim --> bei, aufm --> auf)
- Pluralformen werden zu Singularformen, d.h. Einkäufe wird auf [lemma] zu Einkauf
Verschiedene Formen | lemma |
---|---|
all, alle, alles, aller | all |
andere, anderer, anderes | ander |
eine, einer, ein | ein |
der, die, das | d |
diese, dieser, dieses (atrribuierende Demonstrativpronomen) | diese |
dieser, dies, dieses (substituierendes Demostrativpronomen) | dies |
Folgendes, Folgende, Folgender | Folgende |
jener, jenes, jene | jene |
mein, meiner, meine, meins | mein |
welche, welcher, welches | welch |
vordere, vorderer, vorderes (ADJA) | vordere |
zweit, zweite, zweiter, zweites | zweit |
Transcription Decisions Russian
0. General information
-
no capital letters
-
abbreviations/acronyms are transcribed as full words (e.g. ДТП = дэтэпэ)
-
lower case for all words, even at the beginning of a sentence ==> exceptions: participant code, participant code + _P and symbols like (UNK) etc.
-
the transcription of the participant speech shall generally take place in accordance to the standard orthographic rules of Russian
-
but: if the participant articulates utterances or words, which are not typical neither for the standard Russian nor for the oral vernacular language (повседневный язык) of Russian, transcribe it as it was articulated by the participant
Example from USbi52MR_fsR:
Participant: потому что они два два (-) не видели (-) ==> Standard and vernacular Russian: потому что они друг-друга не видели ==> два два is not typical for the standard or oral vernacular Russian ==> Transcription: потому что они два два (-) не видели (-)
Example:
Participant: мужик играл с футболом ==> standard and vernacular Russian: мужик играл с мячом ==> с футболом in this context is not typical neither for the standard nor for the vernacular Russian ==> Transcription: мужик играл с футболом
-
typical phenomena for standard and vernacular Russian, which shouldn't be transcribed:
-
reduced vokals (if it is not a special dialect of Russian)
Example:
Participant: ана талкает каляску, а мужык играет смячикам ==> Transcription: она толкает коляску, а мужик играет с мячиком
-
so called phonetic words (= words, which are articulated as one word ==> it often concerns prepositions and the following noun)
Example:
Participant: он вышел издому ==> Transcription: он вышел из дому
-
1. Tiers
- two tiers should be used
-
- tier = speaker tier ==> is only used for the transcription of the participant speech; it gets marked with the participant code
-
- tier = comment tier ==> the comment tier is an optional tier and used for communication between transcribers; later (that means: after the transcription) the comment tier will be deleted
-
2. Segmentation
-
NB:
- 1 independent/main clause (главное предложение) = 1 simple sentence (простое предложение)
- 1 independent clause (главное предложение) + 1 or more dependent clauses (придаточное предложение) = complex sentence (сложноподчинённое предложение)
- 1 independent clause (главное предложение) + 1 or more independent clauses (главное предложение) = compound sentence (сложносочинённое предложение)
-
hint: an independent clause can always stand alone; a dependent clause should never stand alone, because without its independent clause the dependent clause wouldn't make sense
-
in addition: a compound sentence can be easily recognized by certain conjunctions, which connect the independent clauses in that type of sentence: these conjunctions are coordinating (соединительный), adversative (противительный) or disjunctive (разделительный) conjunctions (союзы), such as и, но, а, или, либо...либо etc.
-
dependent clauses in complex sentences can be recognized by conjunctions like потому что, когда, что, кто, который, чтобы, так как, но и etc.
-
the participant speech gets segmented in communication units (CUs)
-
1 CU correlates with 1 simple sentence or with 1 complex sentence; sentences, which consist of more than 1 independent clause (= compound sentence), are in every case more than 1 CU:
-
Simple sentence
Example from DEbi52FR_isR:
я стала свидетельницей (-) а: (-) столкновение двух машин | ==> 1 CU
-
Complex sentence
Example from DEbi52FR_isR:
виноваты были не машины а: (-) один (-) эм мужчина который (-) ну кот/ ещё более такой (-) молодой | ==> 1 CU
-
Compound sentence
Example from DEbi03FR_isR:
хотела с тобой это поделить | но я была здесь на парковке у реве | ==> 2 CU's
-
-
if a compound sentence includes a subject or a verb ellipsis, then every ellipsis will be defined as an extra CU
Example:
она вышла из магазина | [subject ellipsis] уронила пакет | и [subject ellipsis] пошла дальше | ==> 3 CU's
-
discourse marker (ну, ну там, вот, так, как бы, получается, эм, то есть etc.) and the following utterance will be seen as 1 CU
Example from DEbi03FR_isR:
ну там я предполагаю там ребёнок внутри был | ==> 1 CU
-
discourse markers (ну, ну там, вот, так, как бы, получается, эм, то есть etc.), which specify the precursory utterance, will be seen as 1 CU, too
Example from DEbi02FR_fsR:
хм их было трое то есть э маленький ребёнок э: женщина и мужчина | ==> 1 CU ("маленький ребёнок э: женщина и мужчина" is a specification of "трое")
-
greetings (привет, здравствуйте, здорово, здрасте etc.) will be defined as an extra CU
Example from DEbi03FR_isR:
привет DEbi03FR_P | слушай я сейчас видела здесь такую ситуацию | ==> 2 CU's
-
question tags such as правда?, или?, правильно?, правильно понял?, не так ли? etc. belong to the previous CU
Example:
ты вася пупкин, правильно | ==> 1 CU
-
in case you are not sure, make two or more CU's, because merging them afterwards is possible in contrast to CU splitting ==> CU splitting afterwards is not possible at all
-
keep in mind, that punctuation marks are not used at all ==> that means: no full stops, no commas etc.
3. Anonymization
-
replace the name of the participant with the respective code ==> e.g.: DEbi52FR
-
if whole names or surnames of the participant’s friends are mentioned, replace them with the participant code + _P
Example from DEbi52FR_isR:
привет DEbi52FR_P
-
places, that could lead to the identification of the participant, should be replaced as following
Example:
я хожу в Leo-Tolstoi-Schule ==> я хожу в {schoolname}шуле
я живу на улице Шютценштрассе ==> я живу на улице {streetname}штрассе
-
anonymization in Audacity: the name of the participant should be anonymized with the aid of white noise
4. Hesitation markers
-
hesitation markers do not represent an own event ==> they belong to the concerned CU
Example from DEbi52FR_isR:
я когда шла э: на автобусную остановку (-) эм: ==> 1 CU
-
general notation:
- m-hm (confirming) = угу
- ehm = эм or э:м
- hm = хм
- eh/uh = э or э:
- ɑha = ага
- ah = а:
5. Long vocals and consonants
- vocals pronounced longer than normal (under 2 seconds) are marked with a colon ==> e.g.: ну: да
- vocals that are pronounced extremely long (2 seconds and longer) are marked with two colons ==> e.g.: ну:: да
- long pronunciation is also possible for consonants ==> e.g.: тс: тише
- doubling of vocal syllables are marked with % ==> e.g.: ты точно сделал? да%а
6. Pauses
- are transcribed on speaker tier
- a pause between two CU´s is marked as an own event ==> the pause gets two boundaries
- pauses in a CU get transcribed within the concerned CU ==> they do not represent an own event
- word internal pauses are marked in the words and without a space between the parts of the concerned word ==> e.g.: с э(-)тим мячиком ==> exception: pauses with эм in a word ==> e.g.: они на (-) эм крыли стол
- general notation:
- 0.2-1 second ==> (-)
- 1-3 seconds ==> (--)
- longer than 3 seconds ==> time should be measured and noted in brackets ==> e.g.: (3.1), (5.5)
- background noise like traffic noise, phone ringing or computer noise are noted as pauses
7. Merged forms
-
merged forms are transcribed as they are articulated, but with an equal sign linking the merged elements
Example from USbi52MR_fsR:
с одной стороны (-) дороги (-), э, шли муж=женой
8. Reduced syllables
-
general rule: reduced syllables should be transcribed in its full length, even if it was differently articulated
Example:
participant: она токо что шла на улице ==> transcription: она только что шла на улице
-
exception: if a word can be found with its reduced syllables in a dictionary (e.g. MAC ==> Link ) and the participant articulated the word in its reduced form, then the reduced form of the word should be preferred for transcription
Example:
participant: здрасте ==> transcription: здрасте Link zum MAC
-
use / to mark unfinished words
Example from DEbi52FR_isR:
сегодня (-) э когда я шла на авто/ астобв/ (-) а (-) автобусную остановку
9. Numerals and dates
-
numbers should be transcribed by words, since Russian numbers are often inclined or morphed
Example:
я вижу двух* женщин*
-
dates should be transcribed by words, too
Example:
я родился двадцать первого* января тысяча девятьсот девяносто пятого года*
10. Spelling for russified lexicals
-
general rule: foreign words should be transcribed into Russian as they are articulated
-
in addition: for this case exists a special list, where you can search for words like that or add new words Link ==> important: all transcribers have to transcribe these words into Russian eaqually
- Autowerkstatt = аутоверкштат
- Truck = трак
- Ort = орт
- REWE = реве
- Renault = рено
- also = алзо
- OK_ = окe
- WhatsApp = воцап
- {schoolname}schule = {schoolname}шуле
- {streetname}straße = {streetname}штрассе
-
table for russified lexicals
German/English word Russified word Code File Second accident аксидент USbi06FR fsR 6,94 accidentally аксидальтально USbi07MR fsR 15,1 Aldi алди DEbi64MR fsR 8 also алзо DEbi56FR fsR 36,85 Autowerkstatt аутоверкштат DEbi51FR fsR 93,75 bag бег USbi59FR isR 164,23 Ball (mit dem) болом DEbi12FR fsR 16,9 Ball бол DEbi12FR fsR 19 in the back ин зе бэк USbi74MR isR 53,92 ciao чао DEbi04MR isR 29,4 case кэйз USbi86FR fsR 4,62 crash крэш DEbi12FR fsR 47,78 crashed крэшовали DEbi15MR isR 37,56 hey хей USbi73FR isR 0,64 like лайк USbi86Fr fsR 73,13 message месседж USbi16FR isR 8,23 911/nine one one найн уон уон USbi59FR isR 83,64 911/nine eleven найн элэвэн USbi73FR fsR 59,29 OK окей USbi05FR isR 64,7 Ort орт DEbi53FR fsR 23,19 parking lot паркинг лот USbi74MR isR 7,48 Renault рено DEbi10MR isR 31,71 representative рэпрезэнтэтиф USbi74MR fsR 3,85 Rewe реве DEbi03FR isR 13,9 Schützenstraße Шютценштрассе DEbi04MR fsR 5,14 spilled/spilt сплыть USbi58FR fsR 36,41 stopped стопт USbi58FR fsR 25,22 stroller строллер USbi79MR isR 51,8 Truck трак USbi52MR fsR 77,39 turn торн USbi74MR isR 36,32 Vans вэнс USbi59FR isR 12,59 WhatsApp воцап USbi52MR isR -
if already exists a conventionalized spelling in Russian for a foreign word, the conventionalized spelling should be preferred
Example:
Messenger = мессенджер
11. Notations of non-verbal material, uninterpretable material and background noise
-
non-verbal events like laughing or coughing are noted in square brackets on speaker tier and always belong to the concerned CU
-
general notation
- [throatclearing]
- [coughing]
- [laughing]
- [pfing] ==> for a sound like „pff“
- [sighing]
- [sniffing]
- [tongueclicking] ==> including tsking as disapproval, while thinking and just mouth opening with a click
- [yawning]
- [gulping]
- [whispering]
- [breathing]
-
if the participant speaks and makes a non-verbal event at the same time, it is noted as:
- [[coughing]word]
- [[laughing]word]
- [[sighing]word]
- [[tisking]word]
- [[yawning]word]
- [[gulping]word]
- [[whispering]word]
Example from DEbi52FR_isR:
ты [[laughing]знаешь] что сегодня случилось
-
uninterpretable material is to be marked as (UNK) on speaker-tier
-
if it is not clear, to which CU the UNK belongs, make an own event ==> that means: write the UNK between two boundaries
-
if the UNK is longer than two seconds, measure the time and write the time together with UNK in one bracket ==> e.g.: (UNK, 2.1)
-
assumed content is noted in brackets, each token has to be separated ==> e.g.: (assumed) (content)
-
background noise such as traffic noise, phone ringing or Computer noise should be noted as pauses
12. Table of symbols
Symbol | Meaning |
---|---|
<Q> speech </Q> ==> e.g.: <Q> можно я ещё раз </Q> | for questions to the procedure on the part of the participant or for verbal interventions on the part of the elicitor |
(-) | for pauses 0.2-1 second |
(--) | for pauses 1-3 seconds |
(time) ==> e.g.: (3.1) | for pauses longer than 3 seconds |
(UNK) | for uninterpretable material |
(UNK, time) ==> e.g.: (UNK, 2.1) | for uninterpretable material longer than 2 seconds |
(assumed word) | for assumed material |
[non-verbal action] ==> e.g.: [laughing] | for non-verbal material |
[[non-verbal action]word] ==> e.g.: [[laughing]знаешь] | for a non-verbal & verbal event |
: | for unusually long vocal or consonant (under 2 seconds) |
:: | for unusually long vocal or consonant (longer than 2 seconds) |
= | for merged words |
/ | for interruption of a word |
% | for doubled syllables |
{...} ==> e.g.: {schoolname}шуле | for anonymised places |
Russian Normalization
0. General information
-
the conversion of the participant speech 1 into a **correct orthographic form **2, which correlates with the applicable linguistic norm/with the linguistic standards of the respective language (in this case of the Russian language), is called normalization
-
during the normalization of the participant speech morphosyntactic (grammatical) mistakes don´t get respected /don´t get changed into its grammatically correct forms
Example:
Participant: | Я калаской видел | ==> Normalization: | Я коляской видел | 3
-
the normalization is manually carried out in the application program EXMARaLDA
1. Structure of normalization in EXMARaLDA
- EXMARaLDA on normalization level is structured in the follwoing layers:
- dipl-[dipl]-layer
- norm-[norm]-layer
- dipl-[language]-layer
- dipl-[CU]-layer 4
Layer | Function |
---|---|
dipl-[dipl] | shows the original speech of the participant, which never gets corrected - neither grammatically nor orthographically |
norm-[norm] | here the speech of the participant should be manually normalized (orthographical - not grammatical - corrected) |
dipl-[language] | shows, to which language the concerned word belongs 5 |
dipl-[CU] | here the speech of the participant is segmented into communication units (which is one of the results from the transcription); in case of written files the person, who wants to normalize the file, has to add at first the dipl-[CU]-layer (because it is absent) 6 and then he has to segmente the participant speech in CUs by himself. |
2. The subjects of normalization are ...
- ... files from DEbi---R; USbi---R and RUmo---R with the following symbols at the end:
- _fsR (formal spoken Russian)
- _fwR (formal written Russian)
- _isR (informal spoken Russian)
- _iwR (informal written Russian)
3. Steps of procedure
-
- step: Push/Pull/Fetch in GitHub
-
- step: Open EXMARaLDA Partitur-Editor
-
- step: File ==> Open ==> rueg repository ==> GitHub ==> rueg-corpus ==> exb ==> P3 ==> 1, 2, 3 …
-
- step: verify (in case of the spoken files), if the CUs on dipl-[CU]-layer correlate with the CU-guidelines - if not, please correct it; in case of the written files you have to add an extra CU-layer and then to segment the speech of the participant into CUs according to the CU-guidelines
-
- step: verify, if every single word correlates with its right language on the dipl[language]-layer - if not, please correct it
-
- step: normalization according to the orthographical rules of the Малый академический словарь MAC ==> translate the speech of the participant from the dipl-[dipl]-layer in its orthographically correct form on norm-[norm]-layer and delete all phenomena, which are not necessary for the following annotation levels (lemmatization, POS-taggin etc.) ==> which phenomena this in practice concerns, can be taken from the table in 4. Normalization guidelines and problems
-
- step: Save your results
-
- step: go to GitHub ==> submit your file ==> push/pull/fetch
4. Normalization guidelines and problems
Phenomenon/Problem | What to do on norm-[norm]-layer | Example dipl-[dipl]-layer | result on norm-[norm]-layer |
---|---|---|---|
pauses | delete them 7 | привет (-) ты не (--) ты не поверишь что случилось | привет ты не ты не поверишь что случилось |
hesitation markers | delete them 8 | там короче эм шла женщина с коляской и э | там короче шла женщина с коляской и |
phonetic markers | delete them 9 | мальчик играл с мячиком и:: и потом мячик покатился на дорогу | мальчик играл с мячиком и и потом мячик покатился на дорогу |
emojis | they don´t get removed and receive an own event | / :-P привет / я / сегодня / увидел / аварию :-( / | / :-P / привет/ я / сегодня / увидел / аварию / :-( / |
punctuation marks (in general) | they don´t get removed and receive an own event 10 | / мячик / покатился / на / дорогу . / потом / собака / начала / лаять . / | мячик / покатился / на / дорогу /. / потом / собака / начала / лаять / . / |
punctuation marks (many in immediate succession without spaces between them) | keep them in one event | /это / просто / был / кошмар / !!!!! | /это / просто / был / кошмар / !!!!! / |
punctuation marks (many in immediate succession with spaces between them) | every punctuation mark receives an own event | /это / просто / был / кошмар / ! ! ! ! ! | /это / просто / был / кошмар / ! / ! / ! / ! / ! / |
merges / clitics 11 | separate and normalize, if possible | муж=женой | муж с женой |
repetitions | they don´t get removed | / потом / мяч / покатился /на /на/ дорогу / | / потом / мяч / покатился /на / на / дорогу / |
self-correction in whole words | they don´t get removed and each of them receives an own event | / парень / играл / с/ в /мячик / | / парень / играл / с / в /мячик / |
5. Language values
Each token on dipl-[dipl]-layer has to be marked with an abbreviation, to which language it belongs (language value). In general, this gets conducted automatically, but in some cases you have to change the language value - for example in case of foreign words in the participant´s speech on dipl-[dipl]-layer. If this situation/problem occures, please change the language value (abbreviation). In case, that you don´t know the abbreviation of the respective language, use the ISO 639-2 language code. A list of valid codes can be found here.
Language | language value on dipl-[language]-layer |
---|---|
Russian | rus |
German | ger |
English | eng |
6. Comments
1 The original speech of the participant, which never gets changed (neither in process of normalization nor in process of lemmatization or POS-tagging), is located on the dipl-[dipl]-layer in EXMARaLDA.
2 The orthographically corrected speech of the participant is located on the norm-[norm]-layer in EXMARaLDA. In the most cases, on this layer the events (gaps) are already filled in, but the orthographic accuracy of the words should be checked manually each time.
3 This CU gets translated into its orthographically correct form (калаской ==> коляской). In contrast to this, the grammatical (morphological and syntactic) incorrectness of the CU (the correct government of the verb видеть is видеть кого/что? (Acc.) ==> Я видел коляску** and the more correct word order in this case would be Я вижу калаской) doesn´t get respected or changed at all.
4 In case of written files (_fwR, _iwR) a dipl-[CU]-layer has to be added (in case of spoken files the dipl-[CU]-layer already exists) and the speech of the participant from the dipl-[dipl]-layer has to be segmented in CUs on the added dipl-[CU]-layer.
5 Keep in mind, that in case of foreign words, these words get transformed into the russian alphabet, but they still stay foreign words (even if they are written in Russian). Therefore, foreign words have to be marked on dipl-[language]-layer with the concerned abbreviation (usually ger or eng) from which language they originally came.
6 You can add a new layer in EXMARaLDA through the menu bar (Tier ==> Add Tier). then, choose the following settings:
- Speaker: dipl [dipl]
- Type: A(nnotation)
- Category: CU
By depending on the diplomatic tokenization, the tier has one timeslot per event on dipl-[dipl]-layer. To annotate a CU, merge all timeslots on the added dipl-[CU]-layer that overlap the tokens on dipl-[dipl]-layer belonging to the CU (including sentence-final punctuation).Finally, assign a simple annotation value "CU" to the newly created span and everything is ready.
The CU tier does not explicitly contain any text. By overlapping its respective tokens, the entire text of a CU is already defined and can be searched.
7 If you delete pauses on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.
8 If you delete hesitation markers on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.
9 If you delete phonetic markers on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.
10 General rule: We do not add and correct any punctuation marks. Only in case, when the punctuation mark sticks to a word, then we have to correct it on norm-[norm]-layer.
11 Clitics are synsemantic/function words (= words without an own semantic meaning [==> therefore, they have a grammatic meaning]), which phonologically "fuse" into their immediate "neighbour" words (which are in most of the cases nouns). That means, that clitics (which are in most of the cases prepositions) and their immediate "neighbour words" (nouns) get articulated as one (phonological) word. Depending on their position (behind or in front of their nouns to which they belong), clitics can be classified in proclitics (if the clitic stands in front of its noun to which it belongs) and in enclitics (if the clitic stands behind its noun to which it belongs).
Example:
работать из (=clitic) дому (=noun) ==> articulation: издому, врач на (=clitic) дом (=noun) ==> articulation: надом
Russian POS and Lemma
0. General information
Lemmatization
- the term lemma may be defined as the base form of a word
- the base form of a word is the form you can usually find in a dictionary
- for verbs the base form correlates with the infinitive, for nouns with the nominative and for adjectives with the nominative in its masculine form
- the conversion of a word into its base form is called lemmatization
- the lemmatization is carried out semi-automatically in the application program EXMARaLDA using two POS- and lemma-taggers U-POS and MyStem, however the accuracy of the taggers should be checked manually each time
- the lemmas or base forms of the words can be found in MyStem on the norm[mystem_lex] and in U-POS on the norm[lemma] layer
POS-Tagging
- the term tagging means that each word of the participant is attributed with its part of speech (POS)
- the tagging is carried out in the application program EXMARaLDA by semi-automatic U-POS and MyStem taggers, but the accuracy of the taggers should be checked manually each time
- there are two taggers in EXMARaLDA which assume the task of POS-tagging - U-POS and MyStem
- keep in mind that these two tagging-softwares are similar to each other, but not absolutely identical 1
1. Structure of POS-Tagging in EXMARaLDA
U-POS-Layers
- to the U-POS-software belong the layers from norm[Animacy] to norm[voice] as well as the norm[lemma] and the norm[pos] layer
- each layer in U-POS (and MyStem) correlates with a grammatical category
- the meaning of each grammatical category in U-POS gets explained in the following table:
Layer | Grammatical categorie | Grammeme | Part of speech |
---|---|---|---|
norm[Animacy] | Одушевлённость | Одушевлённость (Anim); Неодушевлённость (Inan) | concerns only nouns |
norm[Aspect] | Вид | Cовершенный вид [что сделать?] (Perf); Несовершенный вид [что делать?] (Imp) | concerns only verbs |
norm[Case] | Падеж | им.п. (Nom); род.п. (Gen); дат.п. (Dat); вин.п. (Acc); твор.п. (Ins); предл.п. (Loc); зват.п. (Voc) | concerns all nominal categories of POS |
norm[Degree] | Степень сравнения | положительная (Pos); сравнительная (Cmp); превосходная (Sup) | concerns adjectives and adverbs |
norm[Foreign] | Определение заимствования | (Yes) | concerns all parts of speech |
norm[Gender] | Род | муж.р. (Masc); жен.р. (Fem); сред.р. (Neut) | concerns only nouns, adjectives and pronouns |
norm[Mood] | Наклонение | изъяв.н. (Ind); услов.н. (Cnd); повел.н. (Imp) | concerns only verbs |
norm[Number] | Число | Единственное (Sing); Множественное (Plur) | concerns nouns, adjectives, personal pronouns and verbs |
norm[Person] | Лицо | Первое лицо (1); Второе лицо (2); Третье лицо (3) | concerns personal pronouns and verbs |
norm[Tense] | Время | Настоящее (Pres); Прошедшее (Past); Будущее (Fut) | concerns verbs and participles |
norm[VerbForm] | Форма глагола | Начальная форма (Inf); Спрягаемая форма (Fin) | concerns verbs |
norm[voice] | Залог | Действительный (Act); middle voice (Mid); Страдательный (Pas) | concerns verbs and participles |
norm [lemma] | Начальная форма слова | ------ | concerns all parts of speech |
norm[pos] | Определение части речи данного слова | существительное (NOUN); глагол (VERB); прилагательное (ADJ) ... | concerns all parts of speech |
MyStem-Layers
- to the MyStem-tagger belong the norm[mystem_gr] and the norm[mystem_lex] layers
- each layer in MyStem (and U-POS) correlates with a grammatical category
- the meaning of each grammatical category in MyStem can be explained as in the following table:
Layer | Grammatical category | Grammeme | Part of speech |
---|---|---|---|
norm[mystem_gr] | POS-Determination of the given word | Every redundant grammeme gets delated, except the first grammeme and - if they appear - the grammemes of transitivity (tran/intr) 2 and parenthesis (parent) | concerns all parts of speech |
norm[mystem_lex] | Base form of a word | should conform with the base form in U-POS | concerns all parts of speech |
2. The subjects of lemmatization and POS-Tagging are ...
- ... files from DEbi---R; USbi---R and RUmo---R with following symbols at the end:
- _fsR (formal spoken Russian)
- _fwR (formal written Russian)
- _isR (informal spoken Russian)
- _iwR (informal written Russian)
3. Steps of procedure
-
- step: Push/Pull/Fetch in GitHub
-
- step: Open EXMARaLDA Partitur-Editor
-
- step: File ==> Open ==> rueg repository ==> GitHub ==> rueg-corpus ==> exb ==> P3 ==> 1, 2, 3 …
-
- step: verify if the CUs in every file correlate with the CU-guidelines - if not, please correct it
-
- step: verify if every word correlates with its right language on the dipl[language]-layer - if not, please correct it
-
- step: POS-Tagging ==> verify the accuracy of the POS-Tagging-softwares (U-POS and MyStem)
-
- step: delete all features from the norm[mystem_gr]-layer except the first one and - if available - the features of transitivity, parenthesis and other features which are not redundant with U-POS features
-
- step: Save your results
-
- step: go to GitHub ==> submit your file ==> push/pull/fetch
4. Tagging-Guidelines and problems
Phenomenon/Problem | Solution | Example |
---|---|---|
личные местоимения | norm[mystem_gr]: SPRO; norm[pos]: PRON | ----- |
ф | все граммемы на уровнях U-POS удаляются; dipl[language]: rus; norm[mystem_gr]: S,persn; norm[pos]: PROPN | ф шестнадцать |
его, её, их в качестве притяжательных местоимений | norm[case]: Gen | Он уронил её пакет |
тут, там, так | norm[mystem_gr]: ADVPRO; norm[pos]: ADV | ----- |
слова с буквой ё | ё пишется на всех уровнях кроме dipl ==> norm[norm]: …ё…; norm[lemma]: …ё…; norm[mystem_lex]: …ё… | ----- |
значит | norm[mystem_gr]: ADV, parenth; norm[pos]: INTJ | значит, он уронил всё и пошёл |
короче | norm[mystem_gr]: ADV, parenth; norm[pos]: INTJ | Короче, там шла женщина с коляской |
окей | norm[mystem_gr]: PART; norm[pos]: PART | ----- |
всё (ещё) | norm[mystem_gr]: ADV; norm[pos]: ADV | ----- |
блин | norm[mystem_gr]: INTJ; norm[pos]: INTJ | ----- |
то есть | norm[mystem_gr]: СONJ (то) PART (есть); norm[pos]: СONJ (то) PART (есть) | ----- |
спасибо | norm[mystem_gr]: PART; norm[pos]: PART | ----- |
этот | norm[mystem_gr]: APRO; norm[pos]: DET | ----- |
свой | norm[mystem_gr]: APRO; norm[pos]: PRON | ----- |
привет, здравствуйте | norm[mystem_gr]: INTJ; norm[pos]: INTJ; norm[lemma]: привет, здравствуйте; norm[mystem_lex]: привет, здравствуйте | ----- |
играть | norm[mystem_gr]: tran | мальчик играл с мячом 3 |
5. Comments
1 U-POS and MyStem use partly different features for the POS-tagging of words.
Example: In case of the Russian personal pronoun я U-POS dismisses it to be a pronoun (PRON). Further specifications in U-POS are not given in this context. In contrast to that, MyStem specifies the pronoun.
MyStem dismisses я to be a noun-pronoun (SPRON).
4 In general, all reflexive verbs in Russian can be identified by the verb postfix -ся. But not all verbs which end with the postfix -ся are reflexive verbs. Verbs with a transitive word stem and the postfix -ся are not reflexive verbs, but verbs in passive voice. When in doubt, check the Russian verb by translating it into German. If you can translate the Russian verb with sich... into German, then it is very likely a reflexive verb. If that is not possible and you have to translate the verb into German with the aid of the passive construction wird/werden...ge-..., then it is very likely a transitive verb in its passive form.
Example: Книга читается.
Das Buch liest sich. ==> This translation wouldn't make sense (except in fairy-tales), because a book can't
usually read itself.
Das Buch wird gelesen. ==> This translation is more logical than the translation above (if we imagine, that the
context is not a fairy-tale), because the word stem is a transitive verb with the
postfix ending -ся. Therefore, the verb expresses the passive and can be translated
here in that way, that the book gets read by someone, who is unknown or who doesn't
want to be mentioned.
Example: Человек развивается.
Der Mensch wird entwickelt. ==> Развивать is an transitive verb and the postfix -ся could lead to the
conclusion, that in this case we are dealing with the passive voice. Basically,
it is absolutely possible and without the context of course difficult to define.
In view of this, that we don't have a context, orient yourself on the
general meaning of this sentence, which is often used.
Der Mensch entwickelt sich. ==> This is the general meaning of this sentence, which is used quite often. In its
general meaning the verb doesn´t have a passive, instead a reflexive meaning.
This meaning can be preferred in such cases, in which the context doesn't exist
or is not very clear.
2 Transitive verbs are verbs, which govern direct objects (objects in accusative without preposition). Between verbs and an accusative object is no preposition. Only transitive verbs can create the passive voice. The passive voice can be recognized by a word stem of a transitive verb + postfix -ся.
Example: Мальчик читает книгу. Книга читается мальчиком.
Junge (Nom) liest (tran.verb) Buch (Acc.obj. wihtout preposition). Buch (Nom) wird gelesen (pass. voice of a
tran.verb) vom Jungen (Inst).
Intransitive verbs are verbs which govern indirect objects (objects in accusative with preposition or objects in other grammatical cases). Between verbs and object can appear a preposition. The objects can appear in accusative with a preposition, in dative with or without a preposition, in genitive with or without a preposition, in instrumental with or without a preposition and in locative with preposition (objects in locative always stand with a preposition, therefore the Russian locative is called the preposition case). Intransitive verbs can't create the passive voice.
Example: Папа звонит маме. *Мама звонится папой.
Papa (Nom) ruft (intr.verb) an Mama (Dat.). *Mama wird angerufen von Papa.
3 In this context the verb играть is intransitive, because the Russian preposition c usually requires the instrumental. However there exist cases in which играть can be used as a transitive verb.
Example: Вася играет дурака в этом спектакле.
Vasja (Nom) spielt (tran.verb) den Dummen (acc.object without a preposition between verb and object) in diesem
Stück (Loc).
Therefore, all verbs which might have a transitive meaning in other contexts have to be defined as transitive on MyStem layer even if the verb is used as an intransitive verb in the current context! The reason is that a verb which can be used (theoretically) as a transitive verb gets always treated as a verb with a transitive basic meaning, no matter if this transitive meaning of the verb appears in the current situation or not.
6. Useful links
- If you have problems to decide to which part of speech the current word belongs, then look the word up in the
Национальный корпус русского языка and check their results or solution. But keep in mind that they have analyzed the speech of their participants partly under different conditions and assumptions. - All U-POS features are available here: Universal features part 1 and Universal features part 2
- If you have problems to decide whether the current word is a transitive or an intransitive one or if you simply don't know in which grammatical case a word appears, use Викисловарь
Transcription Decisions Turkish
Basics
Format
- create a TextGrid on Praat
- import a TextGrid to EXMARaLDA
Tiers
- speaker tier (e.g TUmo01MT; type: transcription)
- optional tier for segmentation in Intonation Phrases (IP)
- Normalization in EXMARaLDA
Segmentation
- According to Communication Units (CU) Communication_unit__P4_10.12.2018.pdf
- No punctuation
Anonymisation
- Replace name of participant with the respective code (e.g TUmo01MT)
- If whole names or surnames of friends are mentioned, replace with the participant code + _P (e.g. TUmo02FT_P)
- Places that could lead to the identification of a participant (e.g. Atatürk okulunda = Axxx{schoolname} okulunda, Kızılay caddesi = Kxxx{streetname} caddesi)
- if a phone number is mentioned, please anonymize it as {phonenumber}
Transcription
'Unwanted' material (if applicable)
- If this is not possible mark those passages as:
<Q> communication with elicitor </Q>
Merged forms
- Merged forms are transcribed as they are articulated, but with an equal sign linking the merged elements
- Examples from TUmo10MT_isT: n=apıyorsun (= ne yapıyorsun), TUmo11MT_isT: n=aber (= ne haber)
Tag Questions
- tag questions (de mi) do not constitute a separate CU
Reduced syllables
- reduced syllables are transcribed as articulated
- Examples: bi tane (= bir tane), gidiyo (= gidiyorsun) yakıyosun (= yakıyorsun), içbiri (= hiçbiri)
- Use / to mark unfinished words, e.g. “Çarb/ çarptı derken oldu bitti“
Accents and dialects
- pronounced sounds are transcribed as articulated (e.g gardaşım (= kardeşim), but sounds which are not typical for Turkish are not represented.
Pauses
- 0.2 - 1 sec: (-)
- 1-3 secs: (--)
- More than 3 secs: (5.5) to be measured
- Wordinternal pauses are marked as followed: top(-)la - no space between the parts.
Long vocals & consonants
- vocals pronounced longer than normal (under 2sec) are marked with : (e.g. canı:m)
- vocals that are pronounced extremely long (2sec and more) are marked with :: (e.g canı::m)
- also possible for consonants (e.g. tamam:)
- doubling of vocal syllables with % (e.g. ba%ay)
Non-verbal material
-
non-verbal events such as a participant laughing or coughing are noted in square brackets on speaker tier, e.g. [laughing], [whispering, [clears throat], [sighs], [sniffs], [snapsfingers]
-
if participants speak and laugh at the same time, it is noted as: [[laughing]speech]
Uninterpretable material
- uninterpretable material is to be marked as (UNK) on Speaker-tier
- longer than 2secs: (UNK, 2.1)
- assumed content in brackets, each token separated: (assumed) (content)
Hesitation markers / Interjections / Reception markers
- e (short "e") ee (long "ee") ı (short "ı") ııı (long "ııı")
- thinking: "hmm, eem, ımm"
- agreement: "hıhı"
- negation: "ı ıh"
- dissapointment: "tüh"
Foreign language material
- original spelling will be kept.
Proper/Brand names
- Keep conventionalized spelling (e.g. Renault = renault)
Numerals
- Numbers are spelled (e.g 155 = yüz elli beş)
Table of symbols
Symbols | Meaning |
---|---|
<Q> araştırmacıyla iletişim </Q> | instances of questions concerning the procedure and/or verbal interventions of elicitators |
(-) | 0.5 - 1sec |
(--) | pauses 1-3secs |
(3.2) | pauses longer than 3secs |
(UNK) | uninterpretable material |
(UNK, 2.2) | uninterpretable material longer than 2secs |
(assumption) | assumed material |
[gülüşmeler/fısıldaşmalar] | non-verbal material |
[[gülüşme]konuşma] | non-verbal & verbal event |
: | unusually long vocal or consonant (under 2secs) |
:: | unusually long vocal or consonant (longer than 2secs) |
= | merged forms |
/ | interruption of a word |
% | doubled syllables |
{...} | specification of an anonymised place |
Turkish Normalization
Basics
- Orthographic normalization.
- No changes in grammar (case, tense etc).
- Pauses are deleted on a norm layer.
- Non-verbal material (laughing, coughing) is deleted on a norm level.
Anonymized material
-
Participant's codes are deleted on a norm layer.
-
Anonymized material should be left the same way as on a dipl layer (e.g., kxxx{streetname}da).
- if an anonymized token is followed by an ending only, the ending should be placed in one box;
- if an anonymized token is followed by a word, like sokakta, mahallede etc, these words are placed in the next box.
Capitalization
- Proper names are capitalized on a norm layer.
- Names of countries, nations are capitalized (e.g., Almanya, Almanlar);
- Names of days, months are capitalized (e.g., Salı, Ekim);
- Nouns like sokak, cadde, mahalled are capitalized in case they are used together with proper names (e.g., Necatibey Sokağı'nda)
- German nouns are capitalized (e.g., Auto)
- For the capitalization norms of the word "Allah", check TDK Allah (Atasözü, deyim ve birleşik fiiller)
- If a common noun is capitalized when it is not placed at the beginning of a sentence (e.g., Futbol Topu), on a norm level it should be corrected as (e.g., futbol topu).
- if the whole text or some of its parts was written in CAPS, it should be changed to lower case except where Turkish standard orthography demands capitalization.
Apostrophe
- Proper names (e.g., Starbucks), country names (e.g., Almanya) and nouns like sokak, cadde, mahallede require an apostrophe in case they are used together with proper names and used in a dative, accusative, or locative cases (e.g., Necatibey Sokağı'nda);
Dates and Numbers
- Code 'ef on altı'/ 'fe on altı' is represented as F16 (written in one box);
- If a date is written as 15.10.2018, it stays as 15.10.2018 on a norm level;
- But if a date is spelled out (e.g., on beş ekim), the day is represented with numbers, but the month is spelled out and capitalized (e.g., 15 Ekim).
Cancellations, Repairings
- If a canceled/repaired token is a meaningful word, then it stays on a norm level, but without a slash (e.g., çarpma/ çarpıştı --> çarpma çarpıştı -- in separate boxes).
- if a canceled/repaired token is a word, but is not meaningful according to the context, we delete it on a norm layer (e.g., arı/ araba geliyordu --> araba geliyordu);
- if a canceled/repaired token is not a word (e.g., kö/ köpek havladı), we delete it on a norm layer (e.g., kö/ köpek havladı --> köpek havladı).
Assumed material
- if on a diplomatic layer, a word is placed in parentheses as assumed material, we leave the parentheses out and keep only the word itself (e.g., (muhtemelen) --> muhtemelen);
- if the assumed material is not a word (e.g., trafik kasağı), we try to guess the closest meaningful word and normalize it as trafik kazası.
- if the meaning of the assumed material cannot be guessed, we leave it empty on a norm layer.
- orphographic mistakes (especially in written data) are corrected (for example, qma -> ama, cüpike --> köpek).
Foreign Material
- Foreign material is normalized according to the rules of the language it is taken from.
Hesitation Markers
dipl layer | norm layer |
---|---|
e, ee, eee | e |
ı, ıı, ııı | e |
em, hm | e |
Punctuation
- Each punctuation mark is placed in a separate box.
- A triple-dot punctuation mark (...) is placed in one box.
Material which is not found in TDK (please update the list during the normalization)
dipl layer | norm layer |
---|---|
laylaylom/lay lay lom | laylaylom |
boooah | boah |
off | of |
Long Vocals & Consonants
- Long vocals & consonants (may they be represented with a colon: or with the help of two vowels), the words are spelled according to the orthographic norms (e.g., ya: --> ya)
Emojis
- Emojis are represented on a norm layer as they are on a dipl layer;
- One emoji is placed into one box.
Dipl[lang] Layer
- The code of Turkish is tur
- if we see UNKNOWN instead of tur on a dipl[lang] layer, we change it into tur manually.
- we delete the language code in case of pauses, non-verbal material (like laughing, coughing), punctuation marks.
- we change the language code in case of foreign material BUT: change the code of the language only in case when the word has not been adopted into Turkish yet. E.g., hasar would be still tur, or tişört is tur. But bye bye is eng.
code | language |
---|---|
tur | Turkish |
deu | German |
eng | English |
ara | Arabic |
kur | Kurdish |
Dipl[line] and dipl[message] Layers
- disregard these layers.