German Normalisation

(in German)

Basics

The RUEG-Korpus' general guidelines for normalisaton apply: Step 2: Normalisation

In the following, a few principles will be repeated and language specific decisions included.

  • orthographic normalisation
  • no normalisation in the syntax
  • no grammatical normalisation
  • discontinuities and pauses are erased and receive an empty event (on language level the value for discontinuities and pauses is erased -> empty event); EXCEPT word internal discontinuities (dipl: vorbeige$le$ rollt, norm: vobeigerollt)
  • repetitions remain
  • spoken: non-verbal material, such as [laughing], is not transferred -> empty event
  • punctuation marks are not included

normalisation of pronunciation phenomena

  • this counts also for the written files - (change 7.10.2019)

reductions of determiners, adjectives and nouns are not normalised with respect to case and gender information*:

dipl norm
schön guten tag schön guten Tag
mit ein hund mit ein Hund
so ein klein hund so ein klein Hund

reductions and slip of the tongues of verbs, nouns etc. not related to case and gender marking are normalised:

dipl norm
is ist
Umfall Unfall
gesprung gesprungen

short forms of indefinite articles are normalised as:

dipl norm
n ein, einen
nen ein, einen
ne eine
eim, nem einem
  • normalise 'nen' as 'ein' in cases of nominative masculine and nominative/accusative neuter, e.g.:

    • dipl: "da is nen auto um die ecke gebogen". norm: "da ist ein auto..."
    • dipl: "ich hab nen auto gesehen". norm: "ich habe ein auto..." BUT not in: dipl: "nen ne vollbremsung" as norm: "einen eine vollbremsung"
  • so=n either as "so ein" or as "so einen", depending on the context:

    • dipl: "so=n typ hat mitm ball...". norm: "so ein Typ hat mitm Ball"
    • dipl: "die frau hat so=n hund dabei". norm: "die Frau hat so einen Hund dabei"

According to the Duden and scholarly literature (e.g. Vogel 2006, Schäfer & Sayatz 2014), 'n' and 'nen' can each represent both ('ein', 'einen'). One normalises according to the principle of minimal deviation from the standardly expected form.

hesitation markers / filled pauses

  • hesitation markers are all normalised as „äh“, these include äh, ähm, öh, hm etc.
dipl norm
äh, öh, ähm, hm etc. äh

no lexical changes

  • when meaning is clearly constant, determine and document a standard, such as:
dipl norm
aufgrund, auf Grund aufgrund
andren,anderen anderen
bro, brother brother (lang=eng)
Dicker, Digger Dicker (29.05.2019)
etwas, was etwas
grad, grade, gerade gerade
gern, gerne gerne
habe, hab habe
hey, hi, hei (as greeting, not as outcry) hi
langlaufen entlanglaufen
mache , mach (imperative) mach
nichts, nix nichts
noch mal, nochmal nochmal (28.05.2019)
rumspielen herumspielen
rum herum
runterfallen herunterfallen
sodass, so dass (when conjunction) sodass
vorn, vorne vorne
  • when change of meaning is possible or when context is restricted, leave lexemes as they are, the variations remain:
    • daran, dran
    • darin, drin, drinnen
    • drauf, darauf
    • sone (as in "sone autos", so only for plural nouns), solche
    • reinfahren, hereinfahren, hineinfahren
    • auffahren, rauffahren, drauffahren
    • reinpacken, einpacken, hineinpacken

Foreign Language Material (FM) and translingual elements

  • FM with German inflection, e.g.:
dipl norm lang
gecrasht gecrasht eng/deu
  • material that is included in the Duden, such as sorry, Van etc. are marked as deu on language level. The Online-Duden serves as a reference, the date of the viewing must be documented (list FM).

numbers...

  • until twelve: spelled out
  • beginning with 13: numerals
  • in the written texts, keep the variation the subject chose

individual choices

dipl norm
pekawe PKW
ef sechzehn F16
  • gender gap
    • dipl: Fußgänger innen; norm: Fußgänger_innen

Language Values

dipl norm
deu deutsch
eng englisch
ara arabisch
tur türkisch
spa spanisch

written texts

  • include CU level

  • 'dass' as conjunction if spelled 'das' is normalised to 'dass'

  • punctuation marks:

    • do not add any, do not correct any, except when missing a space:
    dipl norm
    eingepackt.auf eingepackt / . / auf
    • several punctuation marks one behind the other without space, e.g. three dots: … leave in an event
    dipl norm
    ... / ... /
    • if there is a space in between, then also leave it, e.g. . /. / .
    dipl norm
    . . . /. / . / . /
  • Emojis

    • include Emojis such as :) on norm
  • abbreviations/acronyms

    • conventionalised abbreviations are left on norm
    • unconventionalised abbreviations or acronyms are spelled out, e.g. dipl: kp norm: kein | Plan
    • acronyms that are also "action words" (e.g., lol) are left this way on norm