Annotation Step 3: POS and Lemma

Tools involved

  • we prepare lemma, pos and feature annotations for you (automatically)
  • you will correct these annotations in EXMARaLDA
  • you will find the data in the same subfolder of exb/ as during normalization

If you ever face the following error message when opening an EXMARaLDA file:

Tier ... is not stratified. Please choose a method for stratifying the tier:

Choose Stratify by deletion.

Lemmatization

  • binds several occurences within the corpus to a common type

  • facilitates / enables for search with / for lexical items

  • lemma: standardized form which might occur in different realizations / variants

  • lemma usually (but not necessarily!) determined by paradigms, i. e.:

inflected form → prototype = a single inflected form within paradigm or uninflected form = lemma

  • lemmatization varies across annotation schemes and languages
  • lemmatization is linked to part of speech

Part of Speech (POS)

  • we would like to categorize several lemmata to more abstract categories
  • factors for determining a category can be:
  • syntactic / distributional
  • semantic
  • morphological
  • graphical
  • ...

Know the guidelines and document difficult cases

  • part of speech is the baseline for many further annotations

  • we need correct annotations

  • we will measure agreement

  • highly connected to lemmatization, thus underlies the same restrictions and parameters of variation

Rule of thumb in cases of doubt (Lemma & POS)

Always trust the guidelines more than your “grammatical intuition”, but in cases of doubt consult both.

Language-specific POS vs. UD-POS

  • language specific: you might have to find new ways / rules for undescribed phenomena (Please document!)
  • UD: strictly stick to UD guidelines for your language and please do not decide by what seems more logical to you

POS – Schemes by Language

English German Greek Russian Turkish
British National Corpus Part of Speech Tagset STTS 2.0 Universal POS tags MyStem Morphology MULTILIT

All languages

Universal Dependencies POS-tags, lemmas, and features

POS – Who does what?

English German Greek Russian Turkish
correct BNC-POS, lemma, features (?) correct STTS-POS, lemma, features (?) correct UD-POS, UD features, lemma correct MyStem-POS, lemma, features, and UD-POS correct MULTILIT-POS, lemma, features (?), and UD-POS (?)

UD-POS by language

English German Greek Russian Turkish
derivable derivable needs manual correction needs manual correction (?) derivable (?)

Remarks

  • delete tier norm [CU]
  • feel free to move the tiers up and down
  • rename pos_1_1 to pos_lang
  • look out for errors in the data and report them immediately

General decisions for all languages:

  • lemma "F16" is tagged as a proper noun with the respective tag from the specific tagset you are using (e.g. "PROPN" from Universal Postags)
  • all kinds of greetings should be treated as interjections and tagged with the respective tag from the specific tagset you are using (e.g. "INTJ" from Universal Postags), unless there is a specific tag for greetings in the language specific tagset you are using
  • regarding the lemmatization of informal greetings: you can just copy the word form from the norm/dipl-layer