Annotation Step 3: POS and Lemma
Tools involved
- we prepare lemma, pos and feature annotations for you (automatically)
- you will correct these annotations in EXMARaLDA
- you will find the data in the same subfolder of
exb/
as during normalization
If you ever face the following error message when opening an EXMARaLDA file:
Tier ... is not stratified. Please choose a method for stratifying the tier:
Choose Stratify by deletion.
Lemmatization
-
binds several occurences within the corpus to a common type
-
facilitates / enables for search with / for lexical items
-
lemma: standardized form which might occur in different realizations / variants
-
lemma usually (but not necessarily!) determined by paradigms, i. e.:
inflected form → prototype = a single inflected form within paradigm or uninflected form = lemma
- lemmatization varies across annotation schemes and languages
- lemmatization is linked to part of speech
Part of Speech (POS)
- we would like to categorize several lemmata to more abstract categories
- factors for determining a category can be:
- syntactic / distributional
- semantic
- morphological
- graphical
- ...
Know the guidelines and document difficult cases
-
part of speech is the baseline for many further annotations
-
we need correct annotations
-
we will measure agreement
-
highly connected to lemmatization, thus underlies the same restrictions and parameters of variation
Rule of thumb in cases of doubt (Lemma & POS)
Always trust the guidelines more than your “grammatical intuition”, but in cases of doubt consult both.
Language-specific POS vs. UD-POS
- language specific: you might have to find new ways / rules for undescribed phenomena (Please document!)
- UD: strictly stick to UD guidelines for your language and please do not decide by what seems more logical to you
POS – Schemes by Language
English | German | Greek | Russian | Turkish |
---|---|---|---|---|
British National Corpus Part of Speech Tagset | STTS 2.0 | Universal POS tags | MyStem Morphology | MULTILIT |
All languages
Universal Dependencies POS-tags, lemmas, and features
- https://universaldependencies.org/
- https://universaldependencies.org/u/feat/
- https://universaldependencies.org/u/pos/
POS – Who does what?
English | German | Greek | Russian | Turkish |
---|---|---|---|---|
correct BNC-POS, lemma, features (?) | correct STTS-POS, lemma, features (?) | correct UD-POS, UD features, lemma | correct MyStem-POS, lemma, features, and UD-POS | correct MULTILIT-POS, lemma, features (?), and UD-POS (?) |
UD-POS by language
English | German | Greek | Russian | Turkish |
---|---|---|---|---|
derivable | derivable | needs manual correction | needs manual correction (?) | derivable (?) |
Remarks
- delete tier norm [CU]
- feel free to move the tiers up and down
- rename pos_1_1 to pos_lang
- look out for errors in the data and report them immediately
General decisions for all languages:
- lemma "F16" is tagged as a proper noun with the respective tag from the specific tagset you are using (e.g. "PROPN" from Universal Postags)
- all kinds of greetings should be treated as interjections and tagged with the respective tag from the specific tagset you are using (e.g. "INTJ" from Universal Postags), unless there is a specific tag for greetings in the language specific tagset you are using
- regarding the lemmatization of informal greetings: you can just copy the word form from the norm/dipl-layer