public class Tokenizer
extends java.lang.Object
Tokenizer
supports abbreviations for english, french, italian and german language.
If no language is set, all available abbreviations will be used.Constructor and Description |
---|
Tokenizer()
Initializes a new TTokenizer object.
|
Modifier and Type | Method and Description |
---|---|
void |
addAbbreviation(com.neovisionaries.i18n.LanguageCode language,
java.io.File abbreviationFile)
Adds the content of given file as a list of abbreviation to the internal map corresponding to given language.
|
void |
addAbbreviation(com.neovisionaries.i18n.LanguageCode language,
java.util.HashSet<java.lang.String> abbreviations)
Adds the given list of abbreviation to the internal map corresponding to given language.
|
com.neovisionaries.i18n.LanguageCode |
checkLanguage(java.lang.String text)
Tries to detect language and returns ISO 639-2 language code
|
java.util.HashSet<java.lang.String> |
getAbbreviations(com.neovisionaries.i18n.LanguageCode language)
Returns a list of abbreviations corresponding to the given language.
|
SDocumentGraph |
getsDocumentGraph() |
com.neovisionaries.i18n.LanguageCode |
mapISOLanguageCode(java.lang.String language)
Maps the knallgrau
TextCategorizer language description codes to ISO 639 codes. |
void |
setsDocumentGraph(SDocumentGraph sDocumentGraph) |
org.eclipse.emf.common.util.EList<SToken> |
tokenize(STextualDS sTextualDSs)
Sets the
STextualDS to be tokenized. |
org.eclipse.emf.common.util.EList<SToken> |
tokenize(STextualDS sTextualDSs,
com.neovisionaries.i18n.LanguageCode language)
Sets the
STextualDS to be tokenized and the language of the text. |
org.eclipse.emf.common.util.EList<SToken> |
tokenize(STextualDS sTextualDS,
com.neovisionaries.i18n.LanguageCode language,
java.lang.Integer startPos,
java.lang.Integer endPos)
Sets the
STextualDS to be tokenized and the language of the text. |
java.util.List<java.lang.String> |
tokenizeToString(java.lang.String strInput,
com.neovisionaries.i18n.LanguageCode language)
The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.
|
org.eclipse.emf.common.util.EList<SToken> |
tokenizeToToken(STextualDS sTextualDS,
com.neovisionaries.i18n.LanguageCode language,
java.lang.Integer startPos,
java.lang.Integer endPos)
The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.
|
public void setsDocumentGraph(SDocumentGraph sDocumentGraph)
public SDocumentGraph getsDocumentGraph()
public org.eclipse.emf.common.util.EList<SToken> tokenize(STextualDS sTextualDSs)
STextualDS
to be tokenized. Its language will be detected automatically if possible.sTextualDSs
- public org.eclipse.emf.common.util.EList<SToken> tokenize(STextualDS sTextualDSs, com.neovisionaries.i18n.LanguageCode language)
STextualDS
to be tokenized and the language of the text. If language is null, it will
be detected automatically if possible.sTextualDSs
- public org.eclipse.emf.common.util.EList<SToken> tokenize(STextualDS sTextualDS, com.neovisionaries.i18n.LanguageCode language, java.lang.Integer startPos, java.lang.Integer endPos)
STextualDS
to be tokenized and the language of the text. If language is null, it will
be detected automatically if possible.sTextualDSs
- STextualDS
object containing the text to be tokenizedlanguage
- language of text, if null, language will be detected automaticallystartPos
- start position, if text to be tokenized is subset (0 assumed if set to null)startPos
- end position, if text to be tokenized is subset (length of text assumed if set to null)public com.neovisionaries.i18n.LanguageCode checkLanguage(java.lang.String text)
text
- public com.neovisionaries.i18n.LanguageCode mapISOLanguageCode(java.lang.String language)
TextCategorizer
language description codes to ISO 639 codes.public void addAbbreviation(com.neovisionaries.i18n.LanguageCode language, java.util.HashSet<java.lang.String> abbreviations)
language
- abbreviations
- public void addAbbreviation(com.neovisionaries.i18n.LanguageCode language, java.io.File abbreviationFile)
language
- abbreviations
- public java.util.HashSet<java.lang.String> getAbbreviations(com.neovisionaries.i18n.LanguageCode language)
language
- public org.eclipse.emf.common.util.EList<SToken> tokenizeToToken(STextualDS sTextualDS, com.neovisionaries.i18n.LanguageCode language, java.lang.Integer startPos, java.lang.Integer endPos)
strInput
- original textpublic java.util.List<java.lang.String> tokenizeToString(java.lang.String strInput, com.neovisionaries.i18n.LanguageCode language)
strInput
- original text