Russian POS and Lemma
0. General information
Lemmatization
- the term lemma may be defined as the base form of a word
- the base form of a word is the form you can usually find in a dictionary
- for verbs the base form correlates with the infinitive, for nouns with the nominative and for adjectives with the nominative in its masculine form
- the conversion of a word into its base form is called lemmatization
- the lemmatization is carried out semi-automatically in the application program EXMARaLDA using two POS- and lemma-taggers U-POS and MyStem, however the accuracy of the taggers should be checked manually each time
- the lemmas or base forms of the words can be found in MyStem on the norm[mystem_lex] and in U-POS on the norm[lemma] layer
POS-Tagging
- the term tagging means that each word of the participant is attributed with its part of speech (POS)
- the tagging is carried out in the application program EXMARaLDA by semi-automatic U-POS and MyStem taggers, but the accuracy of the taggers should be checked manually each time
- there are two taggers in EXMARaLDA which assume the task of POS-tagging - U-POS and MyStem
- keep in mind that these two tagging-softwares are similar to each other, but not absolutely identical 1
1. Structure of POS-Tagging in EXMARaLDA
U-POS-Layers
- to the U-POS-software belong the layers from norm[Animacy] to norm[voice] as well as the norm[lemma] and the norm[pos] layer
- each layer in U-POS (and MyStem) correlates with a grammatical category
- the meaning of each grammatical category in U-POS gets explained in the following table:
Layer | Grammatical categorie | Grammeme | Part of speech |
---|---|---|---|
norm[Animacy] | Одушевлённость | Одушевлённость (Anim); Неодушевлённость (Inan) | concerns only nouns |
norm[Aspect] | Вид | Cовершенный вид [что сделать?] (Perf); Несовершенный вид [что делать?] (Imp) | concerns only verbs |
norm[Case] | Падеж | им.п. (Nom); род.п. (Gen); дат.п. (Dat); вин.п. (Acc); твор.п. (Ins); предл.п. (Loc); зват.п. (Voc) | concerns all nominal categories of POS |
norm[Degree] | Степень сравнения | положительная (Pos); сравнительная (Cmp); превосходная (Sup) | concerns adjectives and adverbs |
norm[Foreign] | Определение заимствования | (Yes) | concerns all parts of speech |
norm[Gender] | Род | муж.р. (Masc); жен.р. (Fem); сред.р. (Neut) | concerns only nouns, adjectives and pronouns |
norm[Mood] | Наклонение | изъяв.н. (Ind); услов.н. (Cnd); повел.н. (Imp) | concerns only verbs |
norm[Number] | Число | Единственное (Sing); Множественное (Plur) | concerns nouns, adjectives, personal pronouns and verbs |
norm[Person] | Лицо | Первое лицо (1); Второе лицо (2); Третье лицо (3) | concerns personal pronouns and verbs |
norm[Tense] | Время | Настоящее (Pres); Прошедшее (Past); Будущее (Fut) | concerns verbs and participles |
norm[VerbForm] | Форма глагола | Начальная форма (Inf); Спрягаемая форма (Fin) | concerns verbs |
norm[voice] | Залог | Действительный (Act); middle voice (Mid); Страдательный (Pas) | concerns verbs and participles |
norm [lemma] | Начальная форма слова | ------ | concerns all parts of speech |
norm[pos] | Определение части речи данного слова | существительное (NOUN); глагол (VERB); прилагательное (ADJ) ... | concerns all parts of speech |
MyStem-Layers
- to the MyStem-tagger belong the norm[mystem_gr] and the norm[mystem_lex] layers
- each layer in MyStem (and U-POS) correlates with a grammatical category
- the meaning of each grammatical category in MyStem can be explained as in the following table:
Layer | Grammatical category | Grammeme | Part of speech |
---|---|---|---|
norm[mystem_gr] | POS-Determination of the given word | Every redundant grammeme gets delated, except the first grammeme and - if they appear - the grammemes of transitivity (tran/intr) 2 and parenthesis (parent) | concerns all parts of speech |
norm[mystem_lex] | Base form of a word | should conform with the base form in U-POS | concerns all parts of speech |
2. The subjects of lemmatization and POS-Tagging are ...
- ... files from DEbi---R; USbi---R and RUmo---R with following symbols at the end:
- _fsR (formal spoken Russian)
- _fwR (formal written Russian)
- _isR (informal spoken Russian)
- _iwR (informal written Russian)
3. Steps of procedure
-
- step: Push/Pull/Fetch in GitHub
-
- step: Open EXMARaLDA Partitur-Editor
-
- step: File ==> Open ==> rueg repository ==> GitHub ==> rueg-corpus ==> exb ==> P3 ==> 1, 2, 3 …
-
- step: verify if the CUs in every file correlate with the CU-guidelines - if not, please correct it
-
- step: verify if every word correlates with its right language on the dipl[language]-layer - if not, please correct it
-
- step: POS-Tagging ==> verify the accuracy of the POS-Tagging-softwares (U-POS and MyStem)
-
- step: delete all features from the norm[mystem_gr]-layer except the first one and - if available - the features of transitivity, parenthesis and other features which are not redundant with U-POS features
-
- step: Save your results
-
- step: go to GitHub ==> submit your file ==> push/pull/fetch
4. Tagging-Guidelines and problems
Phenomenon/Problem | Solution | Example |
---|---|---|
личные местоимения | norm[mystem_gr]: SPRO; norm[pos]: PRON | ----- |
ф | все граммемы на уровнях U-POS удаляются; dipl[language]: rus; norm[mystem_gr]: S,persn; norm[pos]: PROPN | ф шестнадцать |
его, её, их в качестве притяжательных местоимений | norm[case]: Gen | Он уронил её пакет |
тут, там, так | norm[mystem_gr]: ADVPRO; norm[pos]: ADV | ----- |
слова с буквой ё | ё пишется на всех уровнях кроме dipl ==> norm[norm]: …ё…; norm[lemma]: …ё…; norm[mystem_lex]: …ё… | ----- |
значит | norm[mystem_gr]: ADV, parenth; norm[pos]: INTJ | значит, он уронил всё и пошёл |
короче | norm[mystem_gr]: ADV, parenth; norm[pos]: INTJ | Короче, там шла женщина с коляской |
окей | norm[mystem_gr]: PART; norm[pos]: PART | ----- |
всё (ещё) | norm[mystem_gr]: ADV; norm[pos]: ADV | ----- |
блин | norm[mystem_gr]: INTJ; norm[pos]: INTJ | ----- |
то есть | norm[mystem_gr]: СONJ (то) PART (есть); norm[pos]: СONJ (то) PART (есть) | ----- |
спасибо | norm[mystem_gr]: PART; norm[pos]: PART | ----- |
этот | norm[mystem_gr]: APRO; norm[pos]: DET | ----- |
свой | norm[mystem_gr]: APRO; norm[pos]: PRON | ----- |
привет, здравствуйте | norm[mystem_gr]: INTJ; norm[pos]: INTJ; norm[lemma]: привет, здравствуйте; norm[mystem_lex]: привет, здравствуйте | ----- |
играть | norm[mystem_gr]: tran | мальчик играл с мячом 3 |
5. Comments
1 U-POS and MyStem use partly different features for the POS-tagging of words.
Example: In case of the Russian personal pronoun я U-POS dismisses it to be a pronoun (PRON). Further specifications in U-POS are not given in this context. In contrast to that, MyStem specifies the pronoun.
MyStem dismisses я to be a noun-pronoun (SPRON).
4 In general, all reflexive verbs in Russian can be identified by the verb postfix -ся. But not all verbs which end with the postfix -ся are reflexive verbs. Verbs with a transitive word stem and the postfix -ся are not reflexive verbs, but verbs in passive voice. When in doubt, check the Russian verb by translating it into German. If you can translate the Russian verb with sich... into German, then it is very likely a reflexive verb. If that is not possible and you have to translate the verb into German with the aid of the passive construction wird/werden...ge-..., then it is very likely a transitive verb in its passive form.
Example: Книга читается.
Das Buch liest sich. ==> This translation wouldn't make sense (except in fairy-tales), because a book can't
usually read itself.
Das Buch wird gelesen. ==> This translation is more logical than the translation above (if we imagine, that the
context is not a fairy-tale), because the word stem is a transitive verb with the
postfix ending -ся. Therefore, the verb expresses the passive and can be translated
here in that way, that the book gets read by someone, who is unknown or who doesn't
want to be mentioned.
Example: Человек развивается.
Der Mensch wird entwickelt. ==> Развивать is an transitive verb and the postfix -ся could lead to the
conclusion, that in this case we are dealing with the passive voice. Basically,
it is absolutely possible and without the context of course difficult to define.
In view of this, that we don't have a context, orient yourself on the
general meaning of this sentence, which is often used.
Der Mensch entwickelt sich. ==> This is the general meaning of this sentence, which is used quite often. In its
general meaning the verb doesn´t have a passive, instead a reflexive meaning.
This meaning can be preferred in such cases, in which the context doesn't exist
or is not very clear.
2 Transitive verbs are verbs, which govern direct objects (objects in accusative without preposition). Between verbs and an accusative object is no preposition. Only transitive verbs can create the passive voice. The passive voice can be recognized by a word stem of a transitive verb + postfix -ся.
Example: Мальчик читает книгу. Книга читается мальчиком.
Junge (Nom) liest (tran.verb) Buch (Acc.obj. wihtout preposition). Buch (Nom) wird gelesen (pass. voice of a
tran.verb) vom Jungen (Inst).
Intransitive verbs are verbs which govern indirect objects (objects in accusative with preposition or objects in other grammatical cases). Between verbs and object can appear a preposition. The objects can appear in accusative with a preposition, in dative with or without a preposition, in genitive with or without a preposition, in instrumental with or without a preposition and in locative with preposition (objects in locative always stand with a preposition, therefore the Russian locative is called the preposition case). Intransitive verbs can't create the passive voice.
Example: Папа звонит маме. *Мама звонится папой.
Papa (Nom) ruft (intr.verb) an Mama (Dat.). *Mama wird angerufen von Papa.
3 In this context the verb играть is intransitive, because the Russian preposition c usually requires the instrumental. However there exist cases in which играть can be used as a transitive verb.
Example: Вася играет дурака в этом спектакле.
Vasja (Nom) spielt (tran.verb) den Dummen (acc.object without a preposition between verb and object) in diesem
Stück (Loc).
Therefore, all verbs which might have a transitive meaning in other contexts have to be defined as transitive on MyStem layer even if the verb is used as an intransitive verb in the current context! The reason is that a verb which can be used (theoretically) as a transitive verb gets always treated as a verb with a transitive basic meaning, no matter if this transitive meaning of the verb appears in the current situation or not.
6. Useful links
- If you have problems to decide to which part of speech the current word belongs, then look the word up in the
Национальный корпус русского языка and check their results or solution. But keep in mind that they have analyzed the speech of their participants partly under different conditions and assumptions. - All U-POS features are available here: Universal features part 1 and Universal features part 2
- If you have problems to decide whether the current word is a transitive or an intransitive one or if you simply don't know in which grammatical case a word appears, use Викисловарь