OpenNLP Developer 4
OpenNLP Developer 4
OpenNLP tools have similar command line structure and options. To discover tool
options, run it with no parameters:
$ opennlp ToolName
The first block describes the general structure of this tool command line:
The general structure of this tool command line includes the obligatory tool name
(TokenizerTrainer), the optional format
parameters ([.namefinder|.conllx|.pos]),
the optional parameters ([-abbDict path] ...), and the obligatory parameters
(-model
modelFile ...).
The format parameters enable direct processing of non-native data without conversion.
Each format might have its own
parameters, which are displayed if the tool is
executed without or with help parameter:
Arguments description:
-abbDict path
...
To switch the tool to a specific format, add a dot and the format name after
the tool name:
The second block of the help message describes the individual arguments:
Arguments description:
-type maxent|perceptron|perceptron_sequence
-dict dictionaryPath
...
When tool is executed this way, the model is loaded and the tool is waiting for
the input from standard input. This input is
processed and printed to standard
output.
Alternative, or one should say, most commonly used way is to use console input and
output redirection options to provide also
an input and an output files:
Most tools for model training need to be provided first a model name,
optionally some training options (such as model type,
number of iterations),
and then the data.
For the data one has to specify the location of the data (filename) and often
language and encoding.
or with a format:
Most tools for model evaluation are similar to those for task execution, and
need to be provided fist a model name, optionally
some evaluation options (such
as whether to print misclassified samples), and then the test data. A generic
example of a
command line to launch an evaluation tool might be:
https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html 6/64
8/10/2021 Apache OpenNLP Developer Documentation
Chapter 2. Language Detector
Table of Contents
Classifying
Language Detector Tool
Language Detector API
Training
Training Tool
Training with Leipzig
Training API
Classifying
The OpenNLP Language Detector classifies a document in ISO-639-3 languages according to the model capabilities.
A model
can be trained with Maxent, Perceptron or Naive Bayes algorithms. By default normalizes a text and
the context generator
extracts n-grams of size 1, 2 and 3. The n-gram sizes, the normalization and the
context generator can be customized by
extending the LanguageDetectorFactory.
Table 2.1. Normalizers
Normalizer Description
EmojiCharSequenceNormalizer Replaces emojis by blank space
UrlCharSequenceNormalizer Replaces URLs and E-Mails by a blank space.
TwitterCharSequenceNormalizer Replaces hashtags and Twitter user names by blank spaces.
NumberCharSequenceNormalizer Replaces number sequences by blank spaces
ShrinkCharSequenceNormalizer Shrink characters that repeats three or more times to only two repetitions.
The input is read from standard input and output is written to standard output, unless they are redirected
or piped.
First you need to grab the bytes from the serialized model on an InputStream - we'll leave it you to do that, since you were the
one who serialized it to begin with. Now for the easy part:
InputStream is = ...
Note that the both the API or the CLI will consider the complete text to choose the most probable languages.
To handle mixed
language one can analyze smaller chunks of text to find language regions.
Training
The Language Detector can be trained on annotated training material. The data
can be in OpenNLP Language Detector training
format. This is one document per line,
containing the ISO-639-3 language code and text separated by a tab. Other formats can
also be
available.
The following sample shows the sample from above in the required format.
https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html 7/64