Russian POS and Lemma

0. General information

Lemmatization

  • the term lemma may be defined as the base form of a word
  • the base form of a word is the form you can usually find in a dictionary
  • for verbs the base form correlates with the infinitive, for nouns with the nominative and for adjectives with the nominative in its masculine form
  • the conversion of a word into its base form is called lemmatization
  • the lemmatization is carried out semi-automatically in the application program EXMARaLDA using two POS- and lemma-taggers U-POS and MyStem, however the accuracy of the taggers should be checked manually each time
  • the lemmas or base forms of the words can be found in MyStem on the norm[mystem_lex] and in U-POS on the norm[lemma] layer

POS-Tagging

  • the term tagging means that each word of the participant is attributed with its part of speech (POS)
  • the tagging is carried out in the application program EXMARaLDA by semi-automatic U-POS and MyStem taggers, but the accuracy of the taggers should be checked manually each time
  • there are two taggers in EXMARaLDA which assume the task of POS-tagging - U-POS and MyStem
  • keep in mind that these two tagging-softwares are similar to each other, but not absolutely identical 1

1. Structure of POS-Tagging in EXMARaLDA

U-POS-Layers

  • to the U-POS-software belong the layers from norm[Animacy] to norm[voice] as well as the norm[lemma] and the norm[pos] layer
  • each layer in U-POS (and MyStem) correlates with a grammatical category
  • the meaning of each grammatical category in U-POS gets explained in the following table:
Layer Grammatical categorie Grammeme Part of speech
norm[Animacy] Одушевлённость Одушевлённость (Anim); Неодушевлённость (Inan) concerns only nouns
norm[Aspect] Вид Cовершенный вид [что сделать?] (Perf); Несовершенный вид [что делать?] (Imp) concerns only verbs
norm[Case] Падеж им.п. (Nom); род.п. (Gen); дат.п. (Dat); вин.п. (Acc); твор.п. (Ins); предл.п. (Loc); зват.п. (Voc) concerns all nominal categories of POS
norm[Degree] Степень сравнения положительная (Pos); сравнительная (Cmp); превосходная (Sup) concerns adjectives and adverbs
norm[Foreign] Определение заимствования (Yes) concerns all parts of speech
norm[Gender] Род муж.р. (Masc); жен.р. (Fem); сред.р. (Neut) concerns only nouns, adjectives and pronouns
norm[Mood] Наклонение изъяв.н. (Ind); услов.н. (Cnd); повел.н. (Imp) concerns only verbs
norm[Number] Число Единственное (Sing); Множественное (Plur) concerns nouns, adjectives, personal pronouns and verbs
norm[Person] Лицо Первое лицо (1); Второе лицо (2); Третье лицо (3) concerns personal pronouns and verbs
norm[Tense] Время Настоящее (Pres); Прошедшее (Past); Будущее (Fut) concerns verbs and participles
norm[VerbForm] Форма глагола Начальная форма (Inf); Спрягаемая форма (Fin) concerns verbs
norm[voice] Залог Действительный (Act); middle voice (Mid); Страдательный (Pas) concerns verbs and participles
norm [lemma] Начальная форма слова ------ concerns all parts of speech
norm[pos] Определение части речи данного слова существительное (NOUN); глагол (VERB); прилагательное (ADJ) ... concerns all parts of speech

MyStem-Layers

  • to the MyStem-tagger belong the norm[mystem_gr] and the norm[mystem_lex] layers
  • each layer in MyStem (and U-POS) correlates with a grammatical category
  • the meaning of each grammatical category in MyStem can be explained as in the following table:
Layer Grammatical category Grammeme Part of speech
norm[mystem_gr] POS-Determination of the given word Every redundant grammeme gets delated, except the first grammeme and - if they appear - the grammemes of transitivity (tran/intr) 2 and parenthesis (parent) concerns all parts of speech
norm[mystem_lex] Base form of a word should conform with the base form in U-POS concerns all parts of speech

2. The subjects of lemmatization and POS-Tagging are ...

  • ... files from DEbi---R; USbi---R and RUmo---R with following symbols at the end:
    • _fsR (formal spoken Russian)
    • _fwR (formal written Russian)
    • _isR (informal spoken Russian)
    • _iwR (informal written Russian)

3. Steps of procedure

    1. step: Push/Pull/Fetch in GitHub
    1. step: Open EXMARaLDA Partitur-Editor
    1. step: File ==> Open ==> rueg repository ==> GitHub ==> rueg-corpus ==> exb ==> P3 ==> 1, 2, 3 …
    1. step: verify if the CUs in every file correlate with the CU-guidelines - if not, please correct it
    1. step: verify if every word correlates with its right language on the dipl[language]-layer - if not, please correct it
    1. step: POS-Tagging ==> verify the accuracy of the POS-Tagging-softwares (U-POS and MyStem)
    1. step: delete all features from the norm[mystem_gr]-layer except the first one and - if available - the features of transitivity, parenthesis and other features which are not redundant with U-POS features
    1. step: Save your results
    1. step: go to GitHub ==> submit your file ==> push/pull/fetch

4. Tagging-Guidelines and problems

Phenomenon/Problem Solution Example
личные местоимения norm[mystem_gr]: SPRO; norm[pos]: PRON -----
ф все граммемы на уровнях U-POS удаляются; dipl[language]: rus; norm[mystem_gr]: S,persn; norm[pos]: PROPN ф шестнадцать
его, её, их в качестве притяжательных местоимений norm[case]: Gen Он уронил её пакет
тут, там, так norm[mystem_gr]: ADVPRO; norm[pos]: ADV -----
слова с буквой ё ё пишется на всех уровнях кроме dipl ==> norm[norm]: …ё…; norm[lemma]: …ё…; norm[mystem_lex]: …ё… -----
значит norm[mystem_gr]: ADV, parenth; norm[pos]: INTJ значит, он уронил всё и пошёл
короче norm[mystem_gr]: ADV, parenth; norm[pos]: INTJ Короче, там шла женщина с коляской
окей norm[mystem_gr]: PART; norm[pos]: PART -----
всё (ещё) norm[mystem_gr]: ADV; norm[pos]: ADV -----
блин norm[mystem_gr]: INTJ; norm[pos]: INTJ -----
то есть norm[mystem_gr]: СONJ (то) PART (есть); norm[pos]: СONJ (то) PART (есть) -----
спасибо norm[mystem_gr]: PART; norm[pos]: PART -----
этот norm[mystem_gr]: APRO; norm[pos]: DET -----
свой norm[mystem_gr]: APRO; norm[pos]: PRON -----
привет, здравствуйте norm[mystem_gr]: INTJ; norm[pos]: INTJ; norm[lemma]: привет, здравствуйте; norm[mystem_lex]: привет, здравствуйте -----
играть norm[mystem_gr]: tran мальчик играл с мячом 3

5. Comments

1 U-POS and MyStem use partly different features for the POS-tagging of words.

Example: In case of the Russian personal pronoun я U-POS dismisses it to be a pronoun (PRON). Further specifications in U-POS are not given in this context. In contrast to that, MyStem specifies the pronoun. MyStem dismisses я to be a noun-pronoun (SPRON).

4 In general, all reflexive verbs in Russian can be identified by the verb postfix -ся. But not all verbs which end with the postfix -ся are reflexive verbs. Verbs with a transitive word stem and the postfix -ся are not reflexive verbs, but verbs in passive voice. When in doubt, check the Russian verb by translating it into German. If you can translate the Russian verb with sich... into German, then it is very likely a reflexive verb. If that is not possible and you have to translate the verb into German with the aid of the passive construction wird/werden...ge-..., then it is very likely a transitive verb in its passive form.

Example: Книга читается. Das Buch liest sich. ==> This translation wouldn't make sense (except in fairy-tales), because a book can't usually read itself. Das Buch wird gelesen. ==> This translation is more logical than the translation above (if we imagine, that the context is not a fairy-tale), because the word stem is a transitive verb with the postfix ending -ся. Therefore, the verb expresses the passive and can be translated here in that way, that the book gets read by someone, who is unknown or who doesn't want to be mentioned. Example: Человек развивается. Der Mensch wird entwickelt. ==> Развивать is an transitive verb and the postfix -ся could lead to the conclusion, that in this case we are dealing with the passive voice. Basically, it is absolutely possible and without the context of course difficult to define. In view of this, that we don't have a context, orient yourself on the general meaning of this sentence, which is often used. Der Mensch entwickelt sich. ==> This is the general meaning of this sentence, which is used quite often. In its general meaning the verb doesn´t have a passive, instead a reflexive meaning. This meaning can be preferred in such cases, in which the context doesn't exist or is not very clear.

2 Transitive verbs are verbs, which govern direct objects (objects in accusative without preposition). Between verbs and an accusative object is no preposition. Only transitive verbs can create the passive voice. The passive voice can be recognized by a word stem of a transitive verb + postfix -ся.

Example: Мальчик читает книгу. Книга читается мальчиком. Junge (Nom) liest (tran.verb) Buch (Acc.obj. wihtout preposition). Buch (Nom) wird gelesen (pass. voice of a tran.verb) vom Jungen (Inst).

Intransitive verbs are verbs which govern indirect objects (objects in accusative with preposition or objects in other grammatical cases). Between verbs and object can appear a preposition. The objects can appear in accusative with a preposition, in dative with or without a preposition, in genitive with or without a preposition, in instrumental with or without a preposition and in locative with preposition (objects in locative always stand with a preposition, therefore the Russian locative is called the preposition case). Intransitive verbs can't create the passive voice.

Example: Папа звонит маме. *Мама звонится папой. Papa (Nom) ruft (intr.verb) an Mama (Dat.). *Mama wird angerufen von Papa.

3 In this context the verb играть is intransitive, because the Russian preposition c usually requires the instrumental. However there exist cases in which играть can be used as a transitive verb.

Example: Вася играет дурака в этом спектакле. Vasja (Nom) spielt (tran.verb) den Dummen (acc.object without a preposition between verb and object) in diesem Stück (Loc).

Therefore, all verbs which might have a transitive meaning in other contexts have to be defined as transitive on MyStem layer even if the verb is used as an intransitive verb in the current context! The reason is that a verb which can be used (theoretically) as a transitive verb gets always treated as a verb with a transitive basic meaning, no matter if this transitive meaning of the verb appears in the current situation or not.

6. Useful links