Appendix T: Technical details
Annotation Layers
In CoNLL-U
The following describes the function of the 10 columns in the CoNLL-U format specifically for our annotation format.
- 1 – ID
- 2 – allomorph
- this refers to the form of the morpheme
- automatically derived and needs correction
- 3 – morpheme
- this refers to the lemma of the morpheme
- automatically derived from SMOR, usually no correction necessary
- 4 – category of the morphemes
- 5 – category of the formation of all joint dependents (empty)
- this is only for the automatic part of the annotation. we can disregard this column in the manual annotation process
- 6 – features (empty)
- 7 – head
- ID of the formation head, i.e. the ID of the morpheme of which the morpheme of this row is a dependent of
- both the head and the dependent are part of the same word-formation process
- 8 – the type of word-formation process
- derivation
- determinative compounds
+
cdet
also includes the particle verb formation process! - other compounds
ccop
&cphras
member
- for circumfixation
- head: suffix
- dependent: prefix
- for circumfixation
- deep feature
@flat
- if there is no clear head-dependent structure, this label needs to be added to the word-formation process label
- the other WF processes (
conv
,trans
,der
) are annotated in column 10 since they involve transformations of the morpheme itself and not its composition as a dependent with other morphemes
- 9 – (empty)
- 10 – pre- & post-operations
- definitely needs to be corrected!
- preop
- an operation on the morpheme itself before it joins a WF process with other morpheme(s)
- postop
- operations on the morpheme itself after it joined a WF process with other morpheme(s)
- both pre- and post-operations include the following labels:
trans
,conv
,der
- multiple labeling of operations is possible and are separated with commas
- the order is relevant!
- if there exist both pre- and post-operations for the same morpheme, they are separated with the pipe-symbol “|”
The conll-U annotation format is illustrated in the table below:
lemma = "Augenzeugenbericht"
1: ID | 2: allomorph form | 3: morpheme lemma | 4: morpheme category | 5: (empty) | 6: (empty) | 7: head ID | 8: type of WF process | 9: (empty) | 10: pre- & post-operations |
---|---|---|---|---|---|---|---|---|---|
1 | auge | Auge | N | _ | _ | 2 | cdet:N | _ | _ |
2 | zeuge | Zeuge | N | _ | _ | 3 | cdet:N | _ | _ |
3 | bericht | berichten | V | _ | _ | _ | _ | _ | preop=conv:N |
In PTB
In ptb, the annotations are fairly straightforward. They follow the bracket system of syntactic derivation trees. As such, they have root, branch, and terminal nodes. The morpheme lemmas within the word are portrayed within the terminal nodes, whereas the word-formation processes (incl. pre-/post-operations) constitute the root and branch nodes.
This is illustrated in the diagram below, which corresponds to the following bracketed expression:
lemma = "Tathergang"
:small_blue_diamond: (N:cdet (N Tat) (N:der (V:cdet (VPART her) (V gehen))))
graph TD;
A{{N:cdet}}---B(N <br> Tat);
A---C{{N:der}};
C---D{{V:cdet}};
D---E(VPART <br> her);
D---F(V <br> gehen);