Russian Normalization

0. General information

  • the conversion of the participant speech 1 into a correct orthographic form 2, which correlates with the applicable linguistic norm/with the linguistic standards of the respective language (in this case of the Russian language), is called normalization

  • during the normalization of the participant speech morphosyntactic (grammatical) mistakes don´t get respected /don´t get changed into its grammatically correct forms

    Example:

    Participant: | Я калаской видел | ==> Normalization: | Я коляской видел | 3

  • the normalization is manually carried out in the application program EXMARaLDA

1. Structure of normalization in EXMARaLDA

  • EXMARaLDA on normalization level is structured in the follwoing layers:
    • dipl-[dipl]-layer
    • norm-[norm]-layer
    • dipl-[language]-layer
    • dipl-[CU]-layer 4
Layer Function
dipl-[dipl] shows the original speech of the participant, which never gets corrected - neither grammatically nor orthographically
norm-[norm] here the speech of the participant should be manually normalized (orthographical - not grammatical - corrected)
dipl-[language] shows, to which language the concerned word or emoji belongs 5
dipl-[CU] here the speech of the participant is segmented into communication units (which is one of the results from the transcription); in case of written files the person, who wants to normalize the file, has to add at first the dipl-[CU]-layer (because it is absent) 6 and then he has to segmente the participant speech in CUs by himself.

2. The subjects of normalization are ...

  • ... files from DEbi---R; USbi---R and RUmo---R with the following symbols at the end:
    • _fsR (formal spoken Russian)
    • _fwR (formal written Russian)
    • _isR (informal spoken Russian)
    • _iwR (informal written Russian)

3. Steps of procedure

    1. step: Push/Pull/Fetch in GitHub
    1. step: Open EXMARaLDA Partitur-Editor
    1. step: File ==> Open ==> rueg repository ==> GitHub ==> rueg-corpus ==> exb ==> P3 ==> 1, 2, 3 …
    1. step: verify (in case of the spoken files), if the CUs on dipl-[CU]-layer correlate with the CU-guidelines - if not, please correct it; in case of the written files you have to add an extra CU-layer and then to segment the speech of the participant into CUs according to the CU-guidelines
    1. step: verify, if every single word correlates with its right language on the dipl[language]-layer - if not, please correct it
    1. step: normalization according to the orthographical rules of the Малый академический словарь MAC ==> translate the speech of the participant from the dipl-[dipl]-layer in its orthographically correct form on norm-[norm]-layer and delete all phenomena, which are not necessary for the following annotation levels (lemmatization, POS-taggin etc.) ==> which phenomena this in practice concerns, can be taken from the table in 4. Normalization guidelines and problems
    1. step: Save your results
    1. step: go to GitHub ==> submit your file ==> push/pull/fetch

4. Normalization guidelines and problems

Phenomenon/Problem What to do on norm-[norm]-layer Example dipl-[dipl]-layer result on norm-[norm]-layer
pauses delete them 7 привет (-) ты не (--) ты не поверишь что случилось привет ты не ты не поверишь что случилось
hesitation markers delete them 8 там короче эм шла женщина с коляской и э там короче шла женщина с коляской и
phonetic markers delete them 9 мальчик играл с мячиком и:: и потом мячик покатился на дорогу мальчик играл с мячиком и и потом мячик покатился на дорогу
emojis they don´t get removed and receive an own event / :-P привет / я / сегодня / увидел / аварию :-( / / :-P / привет/ я / сегодня / увидел / аварию / :-( /
punctuation marks (in general) they don´t get removed and receive an own event 10 / мячик / покатился / на / дорогу . / потом / собака / начала / лаять . / мячик / покатился / на / дорогу /. / потом / собака / начала / лаять / . /
punctuation marks (many in immediate succession without spaces between them) keep them in one event /это / просто / был / кошмар / !!!!! /это / просто / был / кошмар / !!!!! /
punctuation marks (many in immediate succession with spaces between them) every punctuation mark receives an own event /это / просто / был / кошмар / ! ! ! ! ! /это / просто / был / кошмар / ! / ! / ! / ! / ! /
merges / clitics 11 separate and normalize, if possible муж=женой муж с женой
repetitions they don´t get removed / потом / мяч / покатился /на /на/ дорогу / / потом / мяч / покатился /на / на / дорогу /
self-correction in whole words they don´t get removed and each of them receives an own event / парень / играл / с/ в /мячик / / парень / играл / с / в /мячик /
unknown material delete it 12 UNK
interaction/communication with the elicitor delete it 13 <Q> можно я ещё раз </Q>
interrupted speech delete it 14 пошёл чтобы пс/ посмотреть пошёл чтобы посмотреть
non-verbal material delete it 15 первая машина [tongueclicking] тормозила первая машина тормозила
foreign material normalize according to agreement and adapt the language from which the word originally comes
capitalization (in written data ==> iw, fw) do not correct it at the beginning of a new sentence, but correct it in the middle of a sentence, if it doesn´t conform to the russian orthographic rules / . / В / семье / был / Муж / , / Жена / и / их / ребёнок / . / / . / В / семье / был / муж / , / жена / и / их / ребёнок / . /
Abbreviations (merge and) normalize according to standard orthography 100 км/ч сто / км/ч

5. Language values

Each token on dipl-[dipl]-layer has to be marked with an abbreviation, to which language it belongs (language value). In general, this gets conducted automatically, but in some cases you have to change the language value - for example in case of foreign words in the participant´s speech on dipl-[dipl]-layer. If this situation/problem occures, please change the language value (abbreviation). In case, that you don´t know the abbreviation of the respective language, use the ISO 639-2 language code. A list of valid codes can be found here.

Language language value on dipl-[language]-layer
Russian rus
German ger
English eng

6. Additional examples

Common short or colloquial forms that are acceptable according to Малый академический словарь or other academic dictionaries shown on https://rus-academic-dict.slovaronline.com:

normalized short or colloquial form
этот etc. тот, того, та, той, то, те, тех
здравствуйте здрасте
сейчас/час щас
привет здарова/здорова

Common short or colloquial forms that are not acceptable according to Малый академический словарь:

normalized short or colloquial form
только тока

7. Comments

1 The original speech of the participant, which never gets changed (neither in process of normalization nor in process of lemmatization or POS-tagging), is located on the dipl-[dipl]-layer in EXMARaLDA.

2 The orthographically corrected speech of the participant is located on the norm-[norm]-layer in EXMARaLDA. In the most cases, on this layer the events (gaps) are already filled in, but the orthographic accuracy of the words should be checked manually each time.

3 This CU gets translated into its orthographically correct form (калаской ==> коляской). In contrast to this, the grammatical (morphological and syntactic) incorrectness of the CU (the correct government of the verb видеть is видеть кого/что? (Acc.) ==> Я видел коляску** and the more correct word order in this case would be Я вижу калаской) doesn´t get respected or changed at all.

4 In case of written files (_fwR, _iwR) a dipl-[CU]-layer has to be added (in case of spoken files the dipl-[CU]-layer already exists) and the speech of the participant from the dipl-[dipl]-layer has to be segmented in CUs on the added dipl-[CU]-layer.

5 Keep in mind, that in case of foreign words or emojis (f.e. ХД), these words or emojis get transformed into the russian alphabet, but they still stay foreign words (even if they are written in Russian). Therefore, foreign words have to be marked on dipl-[language]-layer with the concerned abbreviation (usually ger or eng) from which language they originally came.

6 You can add a new layer in EXMARaLDA through the menu bar (Tier ==> Add Tier). then, choose the following settings:

  • Speaker: dipl [dipl]
  • Type: A(nnotation)
  • Category: CU

By depending on the diplomatic tokenization, the tier has one timeslot per event on dipl-[dipl]-layer. To annotate a CU, merge all timeslots on the added dipl-[CU]-layer that overlap the tokens on dipl-[dipl]-layer belonging to the CU (including sentence-final punctuation).Finally, assign a simple annotation value "CU" to the newly created span and everything is ready.

The CU tier does not explicitly contain any text. By overlapping its respective tokens, the entire text of a CU is already defined and can be searched.

7 If you delete pauses on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.

8 If you delete hesitation markers on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.

9 If you delete phonetic markers on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer.

10 General rule: We do not add and correct any punctuation marks. Only in case, when the punctuation mark sticks to a word, then we have to correct it on norm-[norm]-layer.

11 Clitics are synsemantic/function words (= words without an own semantic meaning [==> therefore, they have a grammatic meaning]), which phonologically "fuse" into their immediate "neighbour" words (which are in most of the cases nouns). That means, that clitics (which are in most of the cases prepositions) and their immediate "neighbour words" (nouns) get articulated as one (phonological) word. Depending on their position (behind or in front of their nouns to which they belong), clitics can be classified in proclitics (if the clitic stands in front of its noun to which it belongs) and in enclitics (if the clitic stands behind its noun to which it belongs).

Example:

работать из (=clitic) дому (=noun) ==> articulation: издому, врач на (=clitic) дом (=noun) ==> articulation: надом

12 If you delete unknown material on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer

13 If you delete the interaction/communication with the elicitor on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer

14 If you delete interrupted speech on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer

15 If you delete non-verbal material on norm-[norm]-layer, then you also have to delete additionally the concerned events (gaps) on dipl-[language]-layer

6. Useful links

  • for normalization of the participant speech according to the orthographical rules of Russian on norm[norm]-layer use the Малый академический словарь MAC or Викисловарь