0% found this document useful (0 votes)
50 views2 pages

OpenNLP Developer 4

This document summarizes how to use the Apache OpenNLP language detector tool and API. It describes how to classify languages using a pre-trained model, and how to train new models using annotated training data. The language detector normalizes text and extracts n-grams to classify documents into ISO-639-3 languages using maximum entropy, perceptron, or naive bayes algorithms. The document provides examples of using the command line tool and Java API to load models and predict languages. It also outlines how to train new models by specifying options like the model name, language, data, and encoding.

Uploaded by

Safdar Husain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views2 pages

OpenNLP Developer 4

This document summarizes how to use the Apache OpenNLP language detector tool and API. It describes how to classify languages using a pre-trained model, and how to train new models using annotated training data. The language detector normalizes text and extracts n-grams to classify documents into ISO-639-3 languages using maximum entropy, perceptron, or naive bayes algorithms. The document provides examples of using the command line tool and Java API to load models and predict languages. It also outlines how to train new models by specifying options like the model name, language, data, and encoding.

Uploaded by

Safdar Husain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

8/10/2021 Apache OpenNLP Developer Documentation

OpenNLP tools have similar command line structure and options. To discover tool
options, run it with no parameters:

$ opennlp ToolName

The tool will output two blocks of help.

The first block describes the general structure of this tool command line:

Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] ... -model modelFile ...

The general structure of this tool command line includes the obligatory tool name
(TokenizerTrainer), the optional format
parameters ([.namefinder|.conllx|.pos]),
the optional parameters ([-abbDict path] ...), and the obligatory parameters
(-model
modelFile ...).

The format parameters enable direct processing of non-native data without conversion.
Each format might have its own
parameters, which are displayed if the tool is
executed without or with help parameter:

$ opennlp TokenizerTrainer.conllx help

Usage: opennlp TokenizerTrainer.conllx [-abbDict path] [-alphaNumOpt isAlphaNumOpt] ...

Arguments description:

-abbDict path

abbreviation dictionary in XML format.

...

To switch the tool to a specific format, add a dot and the format name after
the tool name:

$ opennlp TokenizerTrainer.conllx -model en-pos.bin ...

The second block of the help message describes the individual arguments:

Arguments description:

-type maxent|perceptron|perceptron_sequence

The type of the token name finder model. One of maxent|perceptron|perceptron_sequence.

-dict dictionaryPath

The XML tag dictionary file

...

Most tools for processing need to be provided at least a model:

$ opennlp ToolName lang-model-name.bin

When tool is executed this way, the model is loaded and the tool is waiting for
the input from standard input. This input is
processed and printed to standard
output.

Alternative, or one should say, most commonly used way is to use console input and
output redirection options to provide also
an input and an output files:

$ opennlp ToolName lang-model-name.bin < input.txt > output.txt

Most tools for model training need to be provided first a model name,
optionally some training options (such as model type,
number of iterations),
and then the data.

A model name is just a file name.

Training options often include number of iterations, cutoff,


abbreviations dictionary or something else. Sometimes it is possible
to provide these
options via training options file. In this case these options are ignored and the
ones from the file are used.

For the data one has to specify the location of the data (filename) and often
language and encoding.

A generic example of a command line to launch a tool trainer might be:

$ opennlp ToolNameTrainer -model en-model-name.bin -lang en -data input.train -encoding UTF-8

or with a format:

$ opennlp ToolNameTrainer.conll03 -model en-model-name.bin -lang en -data input.train \

-types per -encoding UTF-8

Most tools for model evaluation are similar to those for task execution, and
need to be provided fist a model name, optionally
some evaluation options (such
as whether to print misclassified samples), and then the test data. A generic
example of a
command line to launch an evaluation tool might be:
https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html 6/64
8/10/2021 Apache OpenNLP Developer Documentation

$ opennlp ToolNameEvaluator -model en-model-name.bin -lang en -data input.test -encoding UTF-8

Chapter 2. Language Detector
Table of Contents

Classifying
Language Detector Tool
Language Detector API
Training

Training Tool
Training with Leipzig
Training API

Classifying
The OpenNLP Language Detector classifies a document in ISO-639-3 languages according to the model capabilities.
A model
can be trained with Maxent, Perceptron or Naive Bayes algorithms. By default normalizes a text and
the context generator
extracts n-grams of size 1, 2 and 3. The n-gram sizes, the normalization and the
context generator can be customized by
extending the LanguageDetectorFactory.

The default normalizers are:

Table 2.1. Normalizers

Normalizer Description
EmojiCharSequenceNormalizer Replaces emojis by blank space
UrlCharSequenceNormalizer Replaces URLs and E-Mails by a blank space.
TwitterCharSequenceNormalizer Replaces hashtags and Twitter user names by blank spaces.
NumberCharSequenceNormalizer Replaces number sequences by blank spaces
ShrinkCharSequenceNormalizer Shrink characters that repeats three or more times to only two repetitions.

Language Detector Tool


The easiest way to try out the language detector is the command line tool. The tool is only
intended for demonstration and
testing. The following command shows how to use the language detector tool.

$ bin/opennlp LanguageDetector model

The input is read from standard input and output is written to standard output, unless they are redirected
or piped.

Language Detector API


To perform classification you will need a machine learning model -
these are encapsulated in the LanguageDetectorModel class
of OpenNLP tools.

First you need to grab the bytes from the serialized model on an InputStream - we'll leave it you to do that, since you were the
one who serialized it to begin with. Now for the easy part:

InputStream is = ...

LanguageDetectorModel m = new LanguageDetectorModel(is);

With the LanguageDetectorModel in hand we are just about there:

String inputText = ...

LanguageDetector myCategorizer = new LanguageDetectorME(m);

// Get the most probable language

Language bestLanguage = myCategorizer.predictLanguage(inputText);

System.out.println("Best language: " + bestLanguage.getLang());

System.out.println("Best language confidence: " + bestLanguage.getConfidence());

// Get an array with the most probable languages

Language[] languages = myCategorizer.predictLanguages(null);

Note that the both the API or the CLI will consider the complete text to choose the most probable languages.
To handle mixed
language one can analyze smaller chunks of text to find language regions.

Training
The Language Detector can be trained on annotated training material. The data
can be in OpenNLP Language Detector training
format. This is one document per line,
containing the ISO-639-3 language code and text separated by a tab. Other formats can
also be
available.
The following sample shows the sample from above in the required format.

https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html 7/64

You might also like