Turkish Normalization
Basics
- Orthographic normalization.
- No changes in grammar (case, tense etc).
- Pauses are deleted on a norm layer.
- Non-verbal material (laughing, coughing) is deleted on a norm level.
Anonymized material
-
Participant's codes are deleted on a norm layer.
-
Anonymized material should be left the same way as on a dipl layer (e.g., kxxx{streetname}da).
- if an anonymized token is followed by an ending only, the ending should be placed in one box;
- if an anonymized token is followed by a word, like sokakta, mahallede etc, these words are placed in the next box.
Capitalization
- Proper names are capitalized on a norm layer.
- Names of countries, nations are capitalized (e.g., Almanya, Almanlar);
- Names of days, months are capitalized (e.g., Salı, Ekim);
- Nouns like sokak, cadde, mahalled are capitalized in case they are used together with proper names (e.g., Necatibey Sokağı'nda)
- German nouns are capitalized (e.g., Auto)
- For the capitalization norms of the word "Allah", check TDK Allah (Atasözü, deyim ve birleşik fiiller)
- If a common noun is capitalized when it is not placed at the beginning of a sentence (e.g., Futbol Topu), on a norm level it should be corrected as (e.g., futbol topu).
- if the whole text or some of its parts was written in CAPS, it should be changed to lower case except where Turkish standard orthography demands capitalization.
Apostrophe
- Proper names (e.g., Starbucks), country names (e.g., Almanya) and nouns like sokak, cadde, mahallede require an apostrophe in case they are used together with proper names and used in a dative, accusative, or locative cases (e.g., Necatibey Sokağı'nda);
Dates and Numbers
- Code 'ef on altı'/ 'fe on altı' is represented as F16 (written in one box);
- If a date is written as 15.10.2018, it stays as 15.10.2018 on a norm level;
- But if a date is spelled out (e.g., on beş ekim), the day is represented with numbers, but the month is spelled out and capitalized (e.g., 15 Ekim).
Cancellations, Repairings
- If a canceled/repaired token is a meaningful word, then it stays on a norm level, but without a slash (e.g., çarpma/ çarpıştı --> çarpma çarpıştı -- in separate boxes).
- if a canceled/repaired token is a word, but is not meaningful according to the context, we delete it on a norm layer (e.g., arı/ araba geliyordu --> araba geliyordu);
- if a canceled/repaired token is not a word (e.g., kö/ köpek havladı), we delete it on a norm layer (e.g., kö/ köpek havladı --> köpek havladı).
Assumed material
- if on a diplomatic layer, a word is placed in parentheses as assumed material, we leave the parentheses out and keep only the word itself (e.g., (muhtemelen) --> muhtemelen);
- if the assumed material is not a word (e.g., trafik kasağı), we try to guess the closest meaningful word and normalize it as trafik kazası.
- if the meaning of the assumed material cannot be guessed, we leave it empty on a norm layer.
- orphographic mistakes (especially in written data) are corrected (for example, qma -> ama, cüpike --> köpek).
Foreign Material
- Foreign material is normalized according to the rules of the language it is taken from.
Hesitation Markers
dipl layer | norm layer |
---|---|
e, ee, eee | e |
ı, ıı, ııı | e |
em, hm | e |
Punctuation
- Each punctuation mark is placed in a separate box.
- A triple-dot punctuation mark (...) is placed in one box.
Material which is not found in TDK (please update the list during the normalization)
dipl layer | norm layer |
---|---|
laylaylom/lay lay lom | laylaylom |
boooah | boah |
off | of |
Long Vocals & Consonants
- Long vocals & consonants (may they be represented with a colon: or with the help of two vowels), the words are spelled according to the orthographic norms (e.g., ya: --> ya)
Emojis
- Emojis are represented on a norm layer as they are on a dipl layer;
- One emoji is placed into one box.
Dipl[lang] Layer
- The code of Turkish is tur
- if we see UNKNOWN instead of tur on a dipl[lang] layer, we change it into tur manually.
- we delete the language code in case of pauses, non-verbal material (like laughing, coughing), punctuation marks.
- we change the language code in case of foreign material BUT: change the code of the language only in case when the word has not been adopted into Turkish yet. E.g., hasar would be still tur, or tişört is tur. But bye bye is eng.
code | language |
---|---|
tur | Turkish |
deu | German |
eng | English |
ara | Arabic |
kur | Kurdish |
Dipl[line] and dipl[message] Layers
- disregard these layers.