Tokenizer

java.lang.Object
- de.hu_berlin.german.korpling.saltnpepper.salt.saltCommon.sDocumentStructure.tokenizer.Tokenizer

```
public class Tokenizer
extends java.lang.Object
```
The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do. A list of tokenized text is returned with the text anchor (start and end position) in original text. Reimplemented in Java with permission from the original TreeTagger tokenizer in Perl by Helmut Schmid (see http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/). This implementation uses sets of abbreviations to detect tokens, which are abbreviations in a specific language. Therefore you can set a file containing abbreviations, to take others than the default ones. Because of abbreviations are language dependend, you can set a language, to use only a specific set of abbreviations. The current version of the Tokenizer supports abbreviations for english, french, italian and german language. If no language is set, all available abbreviations will be used.

Author:

Amir Zeldes, Florian Zipser

Constructor Summary

Constructors
Constructor and Description

Tokenizer()
Initializes a new TTokenizer object.

Constructors
Constructor and Description
`Tokenizer()` Initializes a new TTokenizer object.

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`addAbbreviation(com.neovisionaries.i18n.LanguageCode language, java.io.File abbreviationFile)` Adds the content of given file as a list of abbreviation to the internal map corresponding to given language.
`void`	`addAbbreviation(com.neovisionaries.i18n.LanguageCode language, java.util.HashSet<java.lang.String> abbreviations)` Adds the given list of abbreviation to the internal map corresponding to given language.
`com.neovisionaries.i18n.LanguageCode`	`checkLanguage(java.lang.String text)` Tries to detect language and returns ISO 639-2 language code
`java.util.HashSet<java.lang.String>`	`getAbbreviations(com.neovisionaries.i18n.LanguageCode language)` Returns a list of abbreviations corresponding to the given language.
`SDocumentGraph`	`getsDocumentGraph()`
`com.neovisionaries.i18n.LanguageCode`	`mapISOLanguageCode(java.lang.String language)` Maps the knallgrau `TextCategorizer` language description codes to ISO 639 codes.
`void`	`setsDocumentGraph(SDocumentGraph sDocumentGraph)`
`org.eclipse.emf.common.util.EList<SToken>`	`tokenize(STextualDS sTextualDSs)` Sets the `STextualDS` to be tokenized.
`org.eclipse.emf.common.util.EList<SToken>`	`tokenize(STextualDS sTextualDSs, com.neovisionaries.i18n.LanguageCode language)` Sets the `STextualDS` to be tokenized and the language of the text.
`org.eclipse.emf.common.util.EList<SToken>`	`tokenize(STextualDS sTextualDS, com.neovisionaries.i18n.LanguageCode language, java.lang.Integer startPos, java.lang.Integer endPos)` Sets the `STextualDS` to be tokenized and the language of the text.
`java.util.List<java.lang.String>`	`tokenizeToString(java.lang.String strInput, com.neovisionaries.i18n.LanguageCode language)` The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.
`org.eclipse.emf.common.util.EList<SToken>`	`tokenizeToToken(STextualDS sTextualDS, com.neovisionaries.i18n.LanguageCode language, java.lang.Integer startPos, java.lang.Integer endPos)` The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - Tokenizer
```
public Tokenizer()
```
    Initializes a new TTokenizer object.
- Method Detail
  - setsDocumentGraph
```
public void setsDocumentGraph(SDocumentGraph sDocumentGraph)
```
  - getsDocumentGraph
```
public SDocumentGraph getsDocumentGraph()
```
  - tokenize
```
public org.eclipse.emf.common.util.EList<SToken> tokenize(STextualDS sTextualDSs)
```
    Sets the STextualDS to be tokenized. Its language will be detected automatically if possible.
    
    Parameters:
    sTextualDSs -
  - tokenize
```
public org.eclipse.emf.common.util.EList<SToken> tokenize(STextualDS sTextualDSs,
                                                 com.neovisionaries.i18n.LanguageCode language)
```
    Sets the STextualDS to be tokenized and the language of the text. If language is null, it will be detected automatically if possible.
    
    Parameters:
    sTextualDSs -
  - tokenize
```
public org.eclipse.emf.common.util.EList<SToken> tokenize(STextualDS sTextualDS,
                                                 com.neovisionaries.i18n.LanguageCode language,
                                                 java.lang.Integer startPos,
                                                 java.lang.Integer endPos)
```
    Sets the STextualDS to be tokenized and the language of the text. If language is null, it will be detected automatically if possible.
    
    Parameters:
    sTextualDSs - STextualDS object containing the text to be tokenized
    language - language of text, if null, language will be detected automatically
    startPos - start position, if text to be tokenized is subset (0 assumed if set to null)
    startPos - end position, if text to be tokenized is subset (length of text assumed if set to null)
  - checkLanguage
```
public com.neovisionaries.i18n.LanguageCode checkLanguage(java.lang.String text)
```
    Tries to detect language and returns ISO 639-2 language code
    
    Parameters:
    text -
    
    Returns:
  - mapISOLanguageCode
```
public com.neovisionaries.i18n.LanguageCode mapISOLanguageCode(java.lang.String language)
```
    Maps the knallgrau TextCategorizer language description codes to ISO 639 codes.
    
    Returns:
  - addAbbreviation
```
public void addAbbreviation(com.neovisionaries.i18n.LanguageCode language,
                   java.util.HashSet<java.lang.String> abbreviations)
```
    Adds the given list of abbreviation to the internal map corresponding to given language.
    
    Parameters:
    language -
    abbreviations -
  - addAbbreviation
```
public void addAbbreviation(com.neovisionaries.i18n.LanguageCode language,
                   java.io.File abbreviationFile)
```
    Adds the content of given file as a list of abbreviation to the internal map corresponding to given language. Form of the file: Adm.
    Ala.
    Ariz.
    Ark.
    Aug.
    Ave.
    Bancorp.
    
    Parameters:
    language -
    abbreviations -
  - getAbbreviations
```
public java.util.HashSet<java.lang.String> getAbbreviations(com.neovisionaries.i18n.LanguageCode language)
```
    Returns a list of abbreviations corresponding to the given language.
    
    Parameters:
    language -
    
    Returns:
  - tokenizeToToken
```
public org.eclipse.emf.common.util.EList<SToken> tokenizeToToken(STextualDS sTextualDS,
                                                        com.neovisionaries.i18n.LanguageCode language,
                                                        java.lang.Integer startPos,
                                                        java.lang.Integer endPos)
```
    The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do. A list of tokenized text is returned with the text anchor (start and end position) in original text.
    
    Parameters:
    strInput - original text
    
    Returns:
    tokenized text fragments and their position in the original text
  - tokenizeToString
```
public java.util.List<java.lang.String> tokenizeToString(java.lang.String strInput,
                                                com.neovisionaries.i18n.LanguageCode language)
```
    The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do. Returns a list of tokenized text.
    
    Parameters:
    strInput - original text
    
    Returns:
    tokeized text fragments

Class Tokenizer

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

Tokenizer

Method Detail

setsDocumentGraph

getsDocumentGraph

tokenize

tokenize

tokenize

checkLanguage

mapISOLanguageCode

addAbbreviation

addAbbreviation

getAbbreviations

tokenizeToToken

tokenizeToString