English Normalization

Individual acronyms and letters are represented with lowercase letters.

  • ef > f
  • pe:em > pm
  • be:em:doubleyou > bmw
  • aysap > asap

Number less than twelve are written out, while numbers greater than twelve are represented as numerals.

  • ef sixteen > F16 (one token)
  • nine | one | one > 911 (one token)
  • -_- emoticons to be left as one token
  • consecutive punctuation such as "..." are to be left as one token, unless they are already separated on dipl layer.
  • twelve > twelve

Foreign language material

  • preevyet (hello) > privet

Abbreviations to be written normalized, even if they are in the dictionary

  • sec > second
  • prob > "probably" or "problem", depending on the context

Capitalizations

  • I
  • Renault, Volkswagen, Golf

Other

  • rear end (if noun), rear-end (if adjective or verb) e.g. "rear-end collision", "rear-ended", "bumped into the rear end"