KOMeT: Corpus linguistics methods for ePhilology with TEI XML

KOMeT (Korpuslinguistische Methoden für ePhilologie mit TEI) is a researcher group project at the Department of Corpus Linguistics in the Institute for German Language and Linguistics, Humboldt-Universität zu Berlin. The group is funded by the German Federal Ministry of Education and Research (BMBF).

The project aims to apply computational corpus linguistics methods to ancient texts encoded in TEI XML, focusing initially on richly annotated corpora of Sahidic Coptic, the language of Hellenistic Ancient Egypt and the early Coptic church.

We aim to bring together researchers in any area dealing with textual resources from the ancient world, including linguistics, Coptology/Egyptology, history of religion, classical studies and more. The pilot phase in 2014 is dedicated exclusively to Coptic, with a possibility to expand to other ancient languages later on.

KOMeT Standoff

KOMeT is developing an annotation format for extending ready-made TEI XML projects with standoff annotation based on PAULA XML. For many purposes, TEI XML documents need to be enriched with further linguistic annotations that are not covered in an appropriate way by the TEI standard. Yet at the same time we wish to avoid editing the TEI document or converting it to another format, which would prevent the creators of the corpus or other researchers from extending the document using TEI based tools. We therefore offer a standard for annotating TEI documents from the outside, using separate external XML files (standoff annotation).

Demo Corpus

The demo corpus for KOMeT annotation standards is called Besa.letters, taken from two letters by Besa, the 5th century Abbot of Atripe. It contains the following letters:

Both documents are available in KOMeT standoff, using EpiDoc TEI for the structural markup and adding additional linguistic annotations using standoff XML. We are grateful to the project Coptic SCRIPTORIUM for collaborating on the tools to digitize and annotate these documents. Long term archival of the corpus is also planned together with the project LAUDATIO at Humboldt-Universität zu Berlin.

[Download Corpus].




I have moved to Georgetown University. You can contact me at the address below:

Prof. Amir Zeldes
Department of Linguistics
Georgetown University

Poulton Hall, Room 257
1421 37th St. NW
Washington, DC 20057

E-Mail: amir(dot)zeldes at georgetown(dot)edu