Nominal Word Formation in the RUEG Corpus (German) -- Preliminary Annotation Guidelines

Please note that these guidelines are still updated on a regular basis and represent work in progress, thus they might be incomplete. A final version will be presented later and will be linked in this section.

The guidelines build on previous work of Lüdeling, Lukassek & Akbari (to appear) [LukassekAkbariLuedeling2023].

Authors

  • Nadine Zürn (Universität Mannheim)
  • Martin Klotz -- martin.klotz (at) hu-berlin (dot) de (Humboldt-Universität zu Berlin)
  • Stella Baumann (Universität Mannheim)

Annotation Purpose

The purpose of this annotation is to gain insight into the composition of morphologically complex nouns. The annotations identify transparent(1) word-formation (WF) processes and their constituents. Therefore, they contain four levels of information:

  1. each individual morpheme (lemma for free forms, form representation for bound morphemes) within the complex noun
  2. each word-formation process of the intermediate complex constituent(s) up to the complex lexeme
  3. the category of each word-formation process
  4. the order and hierarchy of all involved word-formation processes

(1)Even when taking a synchronous perspective on word formation, one must consider the diachronic perspective at times, breaking down the etymology of a word. The difficulty then is to not carry the analysis too far down the historical road. Considering only synchronically transparent formation processes and morphemes helps with this issue. Borderline cases are discussed and documented (see e.g. lemma = "Fischen").

Basic annotation principles

There has to be at least one input morpheme or already complex form. Each input constituent has a lexical category. There is a word-formation process applied to the input constituent(s). This process has a label and a single output of a certain category as well. The output of a word-formation process can then again be an input to a subsequent formation process. Available lexical categories of process inputs and outputs as well as a classification of formation processes are described below.

Here are three examples:

(1) Autounfall is of category noun and the outcome of the process determinative compounding, which is applied to the two inputs auto (noun) and unfall (noun)

(2) (das) Fahren is a noun originates in the verbal root fahren via transposition (recategorization of a syntactical form)

(3) Spaziergang (noun) is a compound of spazieren (verb) and gang (noun). The latter morpheme results from the implicit derivation of gehen (verb)

Lexical categories of morphemes and complex forms

  • "A" = adjective
  • "ADP" = adposition
  • "ADV" = adverb
  • "CIRCFX" = circumfix (left and right part)
  • "N" = noun
  • (A/N/V)PFX = (adjectival/nominal/verbal) prefix
  • "PRON" = pronoun
  • "PROPN" = proper noun
  • (A/N/V)SFX = (adjectival/nominal/verbal) suffix
  • "V" = verb
  • "VPART" = verb particle

Formation processes

The following section describes the word-formation processes we can describe in our data. For the purpose of explanation, we select two ways of representing word-formation processes. A parentheses-based representation borrowed from the penn-treebank format (ptb) and a table style representation (CoNLL). Both ways of representing our annotations have different advantages and disadvantages, which will be discussed below in the technical section.

The morpheme forms (lemma for free forms, form representation for bound morphemes) are represented along with their lexical category:

  • ptb (in a single bracket with a space in betweeen): ({lexical_category} {morpheme_form})
  • CoNLL style (tab-separated): {morpheme_form} \t {lexical_category}

The word-formation process outcome is coded together with the process class using the following forms:

  • ptb: {category_of_outcome}:{process_class_label}
  • CoNLL style: {process_class_label}:{category_of_outcome}

The distinction between formats helps readability in the individual format.

Representing categories and process classes this way helps determine the stage of the word-formation processes further. Additionally, it helps create a uniform annotation scheme and facilitates consistency in the search queries later on. Scheme uniformity can lead to redundant representations of information, for instance regarding the formation of participles, comparatives and superlatives. Nonetheless, uniformity is the priority. It does not hold that any combination of categories and process labels is grammatically possible in German.

Note that simplex nouns are represented by just their category label and the respective morpheme (lemma for free forms, form representation for bound morphemes). Hence, they are recognizable by their lack of WF tag in the WF tree.



adopt

The label adopt is used for all WF processes within or between morphemes that are ambiguous or unknown. Ideally, this label does not exist after the manual correction of the automatic WF parsing.



conv

The label conv is used in the following case:

  • There is a change in the morphosyntactic category, e.g. from V to N.
  • The lemma occurs in its stem form, e.g. lemma "prallen", stem form "Prall".

conv:A

lemma = "schmuck"

:small_blue_diamond: (A:conv (N Schmuck))

graph TD;
  A{{A:conv}}---B(N Schmuck);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1schmuckSchmuckN_____preop=conv:A

conv:N

lemma = "Zusammenprall"

:small_blue_diamond: (N:conv (V:cdet (VPART zusammen) (V prallen)))

graph TD;
  A{{N:conv}}---B{{V:cdet}};
  B---C(VPART <br> zusammen);
  B---D(V <br> prallen);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1zusammenzusammenVPART__2cdet:V__
2prallprallenV_____postop=conv:N

conv:V

lemma = "Fischen"

:small_blue_diamond: (N:conv (V:conv (N Fisch)))

graph TD;
  A{{N:trans}}---B{{V:conv}};
  B---C(N <br> Fisch);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1fischenFischN_____preop=conv:V,trans:N


der


explicit derivation

The label der is used in cases when there is either affixation (explicit derivation) or a stem vowel change (implicit derivation).


der:A

lemma = "Gehässige"

:small_blue_diamond: (N:der (A:der (CIRCFX ge) (V hassen) (CIRCFX ig)))

graph TD;
  A{{N:der}}---B{{A:der}};
  B---C(CIRCFX <br> ge);
  B---D(V <br> hassen);
  B---E(CIRCFX <br> ig);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1gegeCIRCFX__3member__
2hässhassenV__3der:A__
3igeigCIRCFX_____postop=der:N

lemma = "schmerzhaft"

:small_blue_diamond: (A:der (N Schmerz) (ASFX haft))

graph TD;
  A{{A:der}}---B(N <br> Schmerz);
  A---C(ASFX <br> haft);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1schmerzSchmerzN__2der:A__
2hafthaftASFX______

der:N

lemma = "Unterhaltung"

:small_blue_diamond: (N:der (V unterhalten) (NSFX ung))

graph TD;
  A{{N:der}}---B(V <br> unterhalten);
  A---C(NSFX <br> ung);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1unterhaltunterhaltenV__2der:N__
2ungungNSFX______

lemma = "Gebremse"

:small_blue_diamond: (der:N (CIRCFX Ge) (V bremsen) (CIRCFX e))

graph TD;
  A{{N:der}}---B(CIRCFX <br> Ge);
  A---C(V <br> bremsen);
  A---D(CIRCFX <br> e);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1GeGeCIRCFX__3member__
2bremsbremsenV__3der:N__
3eeCIRCFX______

der:V

lemma = "Vorgang"

:small_blue_diamond: (N:der (V:der (VPFX vor) (V gehen)))

graph TD;
  A{{N:der}}---B{{V:der}};
  B---C(VPFX <br> vor);
  B---D(V <br> gehen);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1vorvorVPFX__2der:V__
2ganggehenV_____postop=der:N


implicit derivation

The label der is used implicitly as part of a pre- or post-operation (in CoNLL-U) if the noun is derived from a verb by changing the stem vowel of stem form. See also section "participle: present & past".

lemma = "Kurzschlussreaktion"

:small_blue_diamond: (N:cdet (N:der (V:cdet (A kurz) (V schließen))) (N:der (V:der (VPFX re) (V agieren)) (NSFX ion)))

graph TD;
  A{{N:cdet}}---B{{N:der}};
  B---C{{V:cdet}};
  C---D(A <br> kurz);
  C---E(V <br> schließen);
  A---F{{N:der}};
  F---G{{V:der}};
  G---H(VPFX <br> re);
  G---I(V <br> agieren);
  F---J(NSFX <br> ion);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1kurzkurzA__2cdet:V__
2schlussschließenV__5cdet:N_postop=der:N
3rereVPFX__4der:V__
4aktaktV__5der:N__
5ionionNSFX______

lemma = "Tränken"

:small_blue_diamond: (N:trans (V:der (V trinken))

graph TD;
  A{{N:trans}}---B{{V:der}};
  B---E(V <br> trinken);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1tränkentrinkenV_____preop=der:V,trans:N


@flat

This label is used for compounds in case it is unclear which morpheme is the head and which is the dependent. This is particularly useful for compounds that are comprised of more than two morphemes. The right-most morpheme functions as the head of the compound since in German, compounds are usually right-headed constructions.

In the ptb-format, all ambiguous morphemes are simply placed within the same bracket as the phrasal head.

lemma = "Windschutzscheibe"

:small_blue_diamond: (N:cdet (N Wind) (N:der (V schützen)) (N Scheibe))

graph TD;
  A{{N:cdet}}---B(N <br> Wind);
  A---C{{N:der}}
  C---D(V <br> schützen);
  A---E(N <br> Scheibe);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1windWindN__3cdet:N@flat__
2schutzSchutzN__3cdet:N@flat__
3scheibeScheibeN______

Here, it is not entirely clear whether “Wind + Schutz” or “Schutz + Scheibe” first enter the compounding process. Hence, @flat is attached to cdet:N for both “Wind” and “Schutz”. The head is chosen to be “Scheibe”.



cdet

The relation label cdet is used for all kinds of determinative compounds, as well as particle verbs.

The label has the following specific use cases (see Falko Guidelines, p. 3):

  • the word is formed out of a head infinitive and verb phrase components in a non-head position :arrow_right: A & N subrelations are possible
  • the word form is a compound in which the head governs the non-head :arrow_right: all subrelations are possible
  • the word form is a compound where the non-head is a numeral :arrow_right :arrow_right: all subrelations are possible
  • the word form is deverbal and preceded by a preposition and there is no verb with the preposition as a prefix :arrow_right: A & N subrelations are possible

cdet:A

lemma = "Merkwürdiges"

:small_blue_diamond: (N:trans (A:cdet (V merken) (A:der (N Würde) (ASFX ig))))

graph TD;
  A{{N:der}}---B{{A:cdet}};
  B---C(V <br> merken);
  B---D{{A:der}};
  D---E(N <br> Würde);
  D---F(ASFX <br> ig);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1merkmerkenV__3cdet:A__
2würdWürdeN__3der:A__
3igesigASFX_____postop=trans:N

cdet:N

lemma = "Augenzeugenbericht"

:small_blue_diamond: (N:cdet (N:cdet (N Auge) (N Zeuge)) (N:conv (V berichten)))

graph TD;
  A{{N:cdet}}---B{{N:cdet}};
  A---C{{N:conv}};
  C---D(V <br> berichten);
  B---E(N <br> Auge);
  B---F(N <br> Zeuge);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5: (empty)6: (empty)7: head ID8: type of WF process9: (empty)10: pre- & post-operations
1augeAugeN__2cdet:N__
2zeugeZeugeN__3cdet:N__
3berichtberichtenV_____preop=conv:N

cdet:V

The label cdet:V is used both for verbal compounds in the more general sense (see table 1 and 2) as well as particle verbs, which are more phrases than compounds due to the syntactic mobility (see table 3).

lemma = "Stillstand"

:small_blue_diamond: (N:der (V:cdet (A still) (V stehen)))

graph TD;
  A{{N:der}}---B{{V:cdet}};
  B---C(A <br> still);
  B---D(V <br> stehen);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1stillstillA__2cdet:V__
2standstehenV_____postop=der:N

lemma = "Spazierengehen"

:small_blue_diamond: (N:trans (V:cdet (V spazieren) (V gehen)))

graph TD;
  A{{N:trans}}---B{{V:cdet}};
  B---C(V <br> spazieren);
  B---D(V <br> gehen);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1spazierenspazierenV__2cdet:V__
2gehengehenV_____postop=trans:N

lemma = "Zusammenstoß"

:small_blue_diamond: (N:conv (V:cdet (VPART zusammen) (V stoßen)))

graph TD;
  A{{N:conv}}---B{{V:cdet}};
  B---C(VPART <br> zusammen);
  B---D(V <br> stoßen);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1zusammenzusammenVPART__2cdet:V__
2stoßstoßenV_____postop=conv:N


ccop

The label ccop is used for copular compounds, which consist of two semantically and hierarchically equal free morphemes.

lemma = "schwarzweiß"

:small_blue_diamond: (A:ccop (A schwarz) (A weiß))

graph TD;
  A{{A:ccop}}---B(A <br> schwarz);
  A---C(A <br> weiß);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1schwarzschwarzA__2ccop:A__
2weißweißA______


cphras

The label cphras is used for phrasal compounds (PCs), which have a lexical head (right-most constituent) and a phrasal non-head (left constituent).

The head of the left constituent receives the label cphras with the corresponding subrelation (A, N, or V).

The elements within the phrasal non-head are annotated in a flat manner with the label member. The head of the left constituent is the phrasal head, and not strictly the right-most element. This helps us distinguish which type of PC we are dealing with.

  • For instance, for PCs with a verb phrase as the left constituent, the verb heading the verb phrase is selected as the head of the left constituent (as seen in the example below, line 1). If the left constituent consists of a prepositional phrase, the preposition forms the head of the left constituent (e.g. "Zwischen-den-Mahlzeiten-Imbisse", Lawrenz 1996).
  • All other elements within the left constituents are attached to the phrasal head with the label member within the CoNLL-U format.

In the ptb-format, they are simply placed within the same bracket as the phrasal head (see section "member").

lemma = "komm-wie-du-bist-Hochzeit"

:small_blue_diamond: (N:cphras (V kommen) (PRON wie) (PRON du) (V sein) (N:cdet (A hoch) (N Zeit)))

graph TD
    A{{N:cphras}} --- B(V <br> kommen)
    A --- C(PRON <br> wie)
    A --- D(PRON <br> du)
    A --- E(V <br> sein)
    A --- F{{N:cdet}}
    F --- G(A <br> hoch)
    F --- H(N <br> Zeit)
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1kommkommenV__6cphras:N__
2wiewiePRON__4member__
3duduPRON__4member__
4bistseinV__4member__
5hochhochA__6cdet:N__
6zeitZeitN______


member

In the CoNLL-U format, the label member is used to connect constituents that perform the same function within a word-formation process, e.g. circumfixes in derivation. They have two use cases:

  • derivation
    • Here, the label follows the right-hand head principle. In other words, the right-most member of the circumfixation is the head, whereas the relation label member points towards the members left to it
  • phrasal compounds (see section "cphras" for reference)
    • Here, the label is used for each constituent that is dependent of the phrasal head (the left constituent of the PC)

In the ptb-format, all member-elements are simply placed within the same bracket as their head.

lemma = "Gebremse"

:small_blue_diamond: (der:N (CIRCFX Ge) (V bremsen) (CIRCFX e))

graph TD;
  A{{N:der}}---B(CIRCFX <br> Ge);
  A---C(V <br> bremsen);
  A---D(CIRCFX <br> e);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1GeGeCIRCFX__3member__
2bremsbremsenV__3der:N__
3eeCIRCFX______


mov

The label mov refers to derivational processes differentiating the grammatical gender (cf. Movierung).

lemma = "Fahrerin"

:small_blue_diamond: (N:mov (N:der (V fahren))

graph TD;
  A{{N:mov}}---B{{N:der}};
  B---C(V <br> fahren);
  B---D(NSFX <br> er);
  A---E(NSFX <br> in);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1fahrfahrenV__2der:N__
2ererNSFX__3mov:N__
3ininNSFX______


trans

The label trans is used in case there is a morphosyntactic reclassification without a semantic reclassification. A further criterion here is that the morpheme form is not changed, i.e. it is not used for conversions into the stem form (see section "conv") and for affixations or stem vowel changes (see section "der").

In opposition to the Falko guidelines, trans is also ([-currently-]) used if the noun resembles an infinitive but contrary to actual transpositions it has masculine grammatical gender.

  • Note: this can be systematically checked by searching for all masculine nouns part of the case annotations of P5/P11
    • search query: canon:Gender=‎"Masc‎" _o_ pos_lang=‎/N.‎/ _o_ lemma

trans:A

lemma = "feind"

:small_blue_diamond: (A:trans (N Feind))

graph TD;
  A{{A:trans}}---B(N <br> Feind);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1feindFeindN_____preop=trans:A

trans:N

lemma = "Überqueren"

:small_blue_diamond: (N:trans (V:der (VPFX über) (V queren)))

graph TD;
  A{{N:trans}}---B{{V:der}};
  B---C(VPFX <br> über);
  B---D(V <br> queren);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1überüberVPFX__2der:V__
2querenquerenV_____postop=trans:N

trans:V

(to do or to delete)



between inflection and derivation

We treat forms where it is unclear if we are dealing with an inflectional or derivational process as processes of word-formation separate from derivation.

adjectives: comparative & superlative

Here, we use the labels comp for the comparative and sup for the superlative form.

lemma = "Liebste"

:small_blue_diamond: (N:trans (A:sup (A lieb)))

graph TD;
  A{{N:trans}}---B{{A:sup}};
  B---C(A <br> lieb);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1liebsteliebA_____preop=sup:A,trans:N

participle: present & past

Here, we use the labels PPres for the present participle and PPast for the past participle.

lemma = "Fahrende"

:small_blue_diamond: (N:trans (A:trans (V:PPres (V fahren)))

graph TD;
  A{{N:trans}}---B{{A:trans}};
  B---C{{V:PPres}};
  C---D(V <br> fahren);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1fahrenfahrenV_____preop=PPres:V,trans:A,trans:N

lemma = "Unfallverursachende"

:small_blue_diamond: (N:trans (A:cdet (N Unfall) (A:trans (V:PPres (V verursachen))))

graph TD;
  A{{N:trans}}---B{{A:cdet}};
  B---E(N <br> Unfall);
  B---C{{A:trans}};
  C---D{{V:PPres}};
  D---F(V <br> verursachen);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1unfallUnfallN__2cdet:A__
2verursachendeverursachenV_____preop=PPres:V,trans:A|postop=trans:N

lemma = "Reingefahrene"

:small_blue_diamond: (N:trans (A:trans (V:PPast (V:cdet (VPART rein) (V fahren)))))

graph TD;
  A{{N:trans}}---B{{A:trans}};
  B---C{{V:PPast}};
  C---D{{V:cdet}};
  D---E(VPART <br> rein);
  D---F(V <br> fahren);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1reinreinVPART__2cdet:V__
2gefahrenefahrenV_____postop=PPast:V,trans:A,trans:N


Special Cases

Here are some cases where we have to pay attention when annotating.


diminutives (work in progress, to be discussed)

Diminutives are a type of derivation, see grammis: Systematische Grammatik/.../Die Transposition:

Der Gegenfall zur Transposition ist die Determination: In Derivaten wie Kindchen, gelblich, hypergemein wird determiniert. So bestimmt z.B. das Suffix -chen die Basis Kind semantisch näher als 'Kind, und zwar ein besonders kleines, liebenswertes, niedliches'. >>>

It may be that the stem vowel changes in this derivational process. This is indicated in column 2 & 3.

lemma = "Pünktchen"

:small_blue_diamond: (N:der (N Punkt) (SFX chen))

graph TD;
  A{{N:der}}---B(N <br> Punkt);
  A---C(SFX <br> chen);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1pünktPunktN__2der:N__
2chenchenSFX______

non-native WF (work in progress, to be discussed)

To the best of our ability, we annotate this with the label set of native WF and by using the German orthography instead of the non-native one. See, for instance, the deverbal noun "Reaktion": We use the verb lemma "agieren" instead of "agere (āctum)".

lemma = "Reaktion"

:small_blue_diamond: (N:der (V:der (VPFX re) (V agieren)) (NSFX ion))

graph TD;
  A{{N:der}}---B{{V:der}};
  B---C(VPFX <br> re);
  B---D(V <br> agieren);
  A---E(NSFX <br> ion);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1rereVPFX__2der:V__
2aktagierenV__3der:N__
3ionionNSFX______

prefixed deverbal noun, yet not a prefix verb

lemma = "Rücksicht"

:small_blue_diamond: (N:cdet (PFX rück) (N:der (V sehen)))

graph TD;
  A{{N:cdet}}---B(PFX <br> rück);
  A---C{{N:der}};
  C---D(V <br> sehen);
1: ID2: allomorph form3: morpheme lemma4: morpheme category5:empty6: empty7: head ID8: type of WF process9: empty10: pre- & post-operations
1rückzurückPFX__2cdet:N__
2sichtsehenV_____preop=der:N
  • Digitales Wörterbuch der deutschen Sprache
    • helpful for looking up the morphological composition of words
      • contains a hyperlink to each morphological component, reports the semantic(s) of the word, etc.
    • tied to the etymological dictionary of German (see link below)
  • Etymologisches Wörterbuch des Deutschen
    • helpful for conducting etymological research of word forms
      • e.g. concerning morphological composition, function, origin

Appendix T: Technical details

Annotation Layers

In CoNLL-U

The following describes the function of the 10 columns in the CoNLL-U format specifically for our annotation format.

  • 1 – ID
  • 2 – allomorph
    • this refers to the form of the morpheme
    • automatically derived and needs correction
  • 3 – morpheme
    • this refers to the lemma of the morpheme
    • automatically derived from SMOR, usually no correction necessary
  • 4 – category of the morphemes
  • 5 – category of the formation of all joint dependents (empty)
    • this is only for the automatic part of the annotation. we can disregard this column in the manual annotation process
  • 6 – features (empty)
  • 7 – head
    • ID of the formation head, i.e. the ID of the morpheme of which the morpheme of this row is a dependent of
    • both the head and the dependent are part of the same word-formation process
  • 8 – the type of word-formation process
    • derivation
    • determinative compounds + cdet also includes the particle verb formation process!
    • other compounds
      • ccop & cphras
    • member
      • for circumfixation
        • head: suffix
        • dependent: prefix
    • deep feature @flat
      • if there is no clear head-dependent structure, this label needs to be added to the word-formation process label
    • the other WF processes (conv, trans, der) are annotated in column 10 since they involve transformations of the morpheme itself and not its composition as a dependent with other morphemes
  • 9 – (empty)
  • 10 – pre- & post-operations
    • definitely needs to be corrected!
    • preop
      • an operation on the morpheme itself before it joins a WF process with other morpheme(s)
    • postop
      • operations on the morpheme itself after it joined a WF process with other morpheme(s)
    • both pre- and post-operations include the following labels:
      • trans, conv, der
    • multiple labeling of operations is possible and are separated with commas
      • the order is relevant!
    • if there exist both pre- and post-operations for the same morpheme, they are separated with the pipe-symbol “|”

The conll-U annotation format is illustrated in the table below:

lemma = "Augenzeugenbericht"

1: ID2: allomorph form3: morpheme lemma4: morpheme category5: (empty)6: (empty)7: head ID8: type of WF process9: (empty)10: pre- & post-operations
1augeAugeN__2cdet:N__
2zeugeZeugeN__3cdet:N__
3berichtberichtenV_____preop=conv:N

In PTB

In ptb, the annotations are fairly straightforward. They follow the bracket system of syntactic derivation trees. As such, they have root, branch, and terminal nodes. The morpheme lemmas within the word are portrayed within the terminal nodes, whereas the word-formation processes (incl. pre-/post-operations) constitute the root and branch nodes.

This is illustrated in the diagram below, which corresponds to the following bracketed expression:

lemma = "Tathergang"

:small_blue_diamond: (N:cdet (N Tat) (N:der (V:cdet (VPART her) (V gehen))))

graph TD;
  A{{N:cdet}}---B(N <br> Tat);
  A---C{{N:der}};
  C---D{{V:cdet}};
  D---E(VPART <br> her);
  D---F(V <br> gehen);

Bibliography

[LukassekAkbariLuedeling2023] - Anke Lüdeling, Julia Lukassek, Roodabeh Akbari - Richtlinie zur morphologischen Annotation von Nomina in Falko. - to appear. -

Summary/Abstract

N/A