IRST Language Modeling Toolkit User Manual
IRST Language Modeling Toolkit User Manual
Version 5.60.01
USER MANUAL
M. Federico, N. Bertoldi, M. Cettolo
FBK-irst, Trento, Italy
This manual together with source code, executables, examples and regression tests can be accessed in the
official website of IRSTLM Toolkit:
http://hlt.fbk.eu/en/irstlm
1
1 Introduction
This manual illustrates the functionalities of the IRST Language Modeling (LM) toolkit. It should put you
quickly in the condition of:
• pruning a LM
Among state-of-the-art n-gram smoothing techniques, the IRST LM toolkit features LM very efficient data
structures to handle very large LMs and adaptation methods which can be effective when limited task-related
data are available.
The IRST Language Modeling Toolkit features algorithms and data structures suitable to estimate, store,
and access very large LMs. Our software has been integrated into a popular open source SMT decoder called
Moses1 , and is compatible with LMs created with other tools, such as the SRILM Tooolkit2
M. Federico, N. Bertoldi, M. Cettolo, IRSTLM: an Open Source Toolkit for Handling Large
Scale Language Models, Proceedings of Interspeech, Brisbane, Australia, 2008.
1
http://www.statmt.org/moses/
2
http://www.speech.sri.com/projects/srilm
2
2 Installation
In order to install the IRSTLM on your machine, please perform the following steps.
Warning: Run with the ”–force” parameter if you want to recreate all links to the autotools
Run the following command to get more details on the compilation options
The IRSTLM library and commands are generated, respectively, under the directories
/path/where/to/install/lib and /path/where/to/install/bin.
If enabled and PdfLatex is installed, this user manual (in pdf) is generated under the directory
/path/where/to/install/doc.
Although caching is not enabled by default, it is highly recommended to activate through its compilation
flag ”--enable-caching”.
3
3 Getting started
Environment Settings We assume that all steps for installation described in Section 2 have been per-
formed correctly. In particular, the environment variable IRSTLM is set , and that environment variable
PATH includes the command directory IRSTLM/bin.
Data sets used in the examples can be found in an archive you can download from the official website of
IRSTLM toolkit.
Preparation of Training Data In order to estimate a Language Model, you first need to prepare your
training corpus. The corpus just consists of a text. We assume that the text is already preprocessed according
to the user needs; this means that lowercasing, uppercasing, tokenization, and any other text transformation
has to be performed beforehand with other tools.
You can only decide whether you are interested that IRSTLM toolkit is aware of sentence boundaries, i.e.
where a sentence starts and ends. Otherwise, the toolkit considers the corpus as one continuous stream of
text, and does not identify sentence splits. The following script adds start and end symbols (<s> and </s>,
respectively, which should be considered reserved symbols, that is used only as delimiters) to all sentences
in your training corpus.
$> add-start-end.sh < your-text-file
IRSTLM toolkit does not compute probabilities for cross-sentence n-grams, i.e. any n-gram including the
pair </s> <s>.
Training our first LM We are now ready to estimate a 3-gram (trigram) LM by running the command:
$> tlm -tr="gunzip -c train.gz" -n=3 -lm=wb -te=test
which produces the output:
n=49984 LP=301734.5406 PP=418.4772517 OOVRate=0.05007602433
The output shows the number of words in the test set, the LM log-probability, the LM perplexity and the
out-of-vocabulary rate of the test set.
If you need to train and test different language models on the same data, a more efficient way to proceed is
to first create an n-gram table of the training data:
$> ngt -i="gunzip -c train.gz" -n=3 -o=train.www -b=yes
The command ngt reads an input text file, creates an n-gram table of specified size (-n=3), and saves it in
binary format (-b=yes) into a specified output file.
Now, the LM can be estimated and evaluated more quickly:
$> tlm -tr=train.www -n=3 -lm=wb -te=test
Once estimated, a LM can be also saved in the standard ARPA text format:
$> tlm -tr=train.www -n=3 -lm=wb -o=train.lm
or in a binary format
$> tlm -tr=train.www -n=3 -lm=wb -obin=train.blm
(Remark: the binary format formerly used by the IRST speech recognizer is still available, through the
option -oasr <filename>, but is no more supported.)
4
4 More on ngt
Given a text corpus we can compute its dictionary and word frequencies with the command:
$> dict -i="gunzip -c train.gz" -o=train.dict -f=yes
For speech recognition applications, it can be often the case to limit the LM dictionary only to the top
frequent, let us say, 10K words. We can obtain such a list by:
$> dict -i="gunzip -c train.gz" -o=top10k -pr=10000
Notice: the list will also include the start/end-sentence symbols.
An alternative pruning strategy is to filter out words occurring less or equal than a specified count. The
following example removes the word occurring ≤ 5 times and keeps the top frequent 10K (at most) of the
others:
$> dict -i="gunzip -c train.gz" -o=top10k5f -pr=10000 -pf=5
Statistics about the frequency of words inside a corpus can be gathered through the command dict with
the option -curve=yes, while out-of-vocabulary rate statistics over a test set can be computed with the
option -TestFile=<sample>. The following example illustrates both features:
$> dict -i="gunzip -c train.gz" -Curve=yes -TestFile=test
A new n-gram table for the limited dictionary can be computed with ngt by specifying the sub-dictionary:
$> ngt -i=train.www -sd=top10k -n=3 -o=train.10k.www -b=yes
The command replaces all words outside top10K with the special out-of-vocabulary symbol unk .
Another useful feature of ngt is the merging of two n-gram tables. Assume that we have split our training
corpus into files text-a and file text-b and have computed n-gram tables for both files, we can merge
them with the option -aug:
$> ngt -i="gunzip -c text-a.gz" -n=3 -o=text-a.www -b=yes
$> ngt -i="gunzip -c text-b.gz" -n=3 -o=text-b.www -b=yes
$> ngt -i=text-a.www -aug=text-b.www -n=3 -o=text.www -b=yes
5
Warning: Note that if the concatenation of text-a.gz and text-b.gz is equal to train.gz the
resulting n-gram tables text.www and train.www can slightly differ. This happens because during the
construction of each single n-gram table few n-grams are automatically added to make it consistent for
further computation.
6
5 More on tlm
Language models have to cope with out-of-vocabulary words, that is internally represented with the word
class unk . In order to compare perplexity of LMs having different vocabulary size it is better to define a
conventional dictionary size, or dictionary upper bound size, trough the parameter (-dub). In the following
example, we compare the perplexity of the full vocabulary LM against the perplexity of the LM estimated
over the more frequent 10K-words. In our comparison, we assume a dictionary upper bound of one million
words.
The large difference in perplexity between the two LMs is explained by the significantly higher OOV rate
of the 10K-word LM.
N-gram LMs generally apply frequency smoothing techniques, and combine smoothed frequencies accord-
ing to two main schemes: interpolation and back-off. The toolkit assumes interpolation as default. The
back-off scheme is computationally more costly but often provides better performance. It can be activated
with the option -bo=yes, e.g.:
This toolkit implements several frequency smoothing methods, which are specified by the parameter -lm.
Three methods are particularly recommended:
a) Modified shift-beta, also known as “improved kneser-ney smoothing”. This smoothing scheme gives
top performance when training data is not very sparse but it is more time and memory consuming
during the estimation phase:
b) Witten Bell smoothing. This is an excellent smoothing method which works well in every data
condition and is much less time and memory consuming:
c) Shift-beta smoothing. This smoothing method is a simpler and cheaper version of the Modified
shift-beta method and works sometimes better than Witten-Bell method:
7
Moreover, the non linear smoothing parameter β can be specified with the option -beta:
This could be helpful in case we need to use language models with very limited frequency smoothing.
Limited Vocabulary
Using an n-gram table with a fixed or limited dictionary will cause some performance degradation, as LM
smoothing statistics result slightly distorted. A valid alternative is to estimate the LM on the full dictionary
of the training corpus and to use a limited dictionary just when saving the LM on a file. This can be achieved
with the option -d (or -dictionary):
8
6 LM Adaptation
Language model adaptation can be applied when little training data is given for the task at hand, but much
more data from other less related sources is available. tlm supports two adaptation methods.
Warning: modified shift-beta smoothing cannot be applied in open vocabulary mode (-ao=yes). If this
is the case, you should either change smoothing method or simply add the adaptation text to the background
LM (use -aug parameter of ngt). In general, this solution should provide better performance.
$> ngt -i=train.www -aug=adapt -o=train-adapt.www -n=3 -b=yes
$> tlm -tr=train-adapt.www -lm=msb -n=3 -te=test -dub=1000000 -ad=adapt -ar=0.8
n=49984 LP=312276.1746 PP=516.7311396 OVVRate=0.04193341869
9
$> tlm -tr=train.www -lm=mix -slmi=sublm -n=3 -te=test -dub=1000000
n=49984 LP=307199.3273 PP=466.8244383 OVVRate=0.04193341869
Warning: for computational reasons it is expected that the n-gram table specified by -tr contains AT
LEAST the n-grams of the last table specified in the slmi file, i.e. train.www in the example. Faster
computations are achieved by putting the largest dataset as the last sub-model in the list and the union of all
data sets as training file.
It is also IMPORTANT that a large -dub value is specified so that probabilities of sub-LMs can be correctly
computed in case of out-of-vocabulary words.
10
7 Estimating Gigantic LMs
LM estimation starts with the collection of n-grams and their frequency counters. Then, smoothing param-
eters are estimated for each n-gram level; infrequent n-grams are possibly pruned and, finally, a LM file is
created containing n-grams with probabilities and back-off weights. This procedure can be very demanding
in terms of memory and time if it applied on huge corpora. We provide here a way to split LM training
into smaller and independent steps, that can be easily distributed among independent processes. The pro-
cedure relies on a training scripts that makes little use of computer RAM and implements the Witten-Bell
smoothing method in an exact way.
Before starting, let us create a working directory under examples, as many files will be created:
The script splits the estimation procedure into 5 distinct jobs, that are explained in the following section.
There are other options that can be used. We recommend for instance to use pruning of singletons to get
smaller LM files. Notice that build-lm.sh produces a LM file train.ilm.gz that is NOT in the
final ARPA format, but in an intermediate format called iARPA, that is recognized by the compile-lm
command and by the Moses SMT decoder running with IRSTLM. To convert the file into the standard ARPA
format you can use the command:
this will create the proper ARPA file lm-final. To create a gzipped file you might also use:
In the following sections, we will discuss on LM file formats, on compiling LMs into a more compact and
efficient binary format, and on querying LMs.
11
7.1 Estimating a LM with a Partial Dictionary
A sub-dictionary can be defined by just taking words occurring more than 5 times (-pf=5) and at most the
top frequent 5000 words (-pr=5000):
The LM can be restricted to the defined sub-dictionary with the command build-lm.sh by using the
option -d:
Notice that all words outside the sub-dictionary will be mapped into the <unk> class, the probability of
which will be directly estimated from the corpus statistics. A preferable alternative to this approach is to
estimate a large LM and then to filter it according to a list of words (see Filtering a LM).
12
8 LM File Formats
The toolkit supports three output format of LMs. These formats have the purpose of permitting the use of
LMs by external programs. External programs could in principle estimate the LM from an n-gram table
before using it, but this would take much more time and memory! So the best thing to do is to first estimate
the LM, and then compile it into a binary format that is more compact and that can be quickly loaded and
queried by the external program.
13
9 LM Pruning
Large LMs files can be pruned in a smart way by means of the command prune-lm that removes n-grams
for which resorting to the back-off results in a small loss. IRSTLM toolkit implements a method similar to
the Weighted Difference Method described in the paper Scalable Backoff Language Models by Seymore and
Rosenfeld.
The syntax is as follows:
Thresholds for each n-gram level, up from 2-grams, are based on empirical evidence. Threshold zero results
in no pruning. If less thresholds are specified, the right most is applied to the higher levels. Hence, in the
above example we could have just specified one threshold, namely --threshold=1e-6. The effect of
pruning is shown in the following messages of prune-lm:
The saved LM table train.plm contains about 3% less bigrams, and 34% less trigrams. Notice that the
output of prune-lm is an ARPA LM file, while the input can be either an ARPA or binary LM. In order to
measure the loss in accuracy introduced by pruning, perplexity of the resulting LM can be computed (see
below).
Warning: IRSTLM toolkit does not provide a reliable probability for the special 1-gram composed by the
“sentence start symbol” (<s>) , because none should ever ask for it. However, this pruning method requires
the computation of the probability of this 1-gram. Hence, (only) in this case the probability of this special
1-gram is arbitrarily set to 1.
14
10 LM Quantization
A language model file in ARPA format, created with the IRST LM toolkit or with other tools, can be quan-
tized and stored in a compact data structure, called language model table. Quantization can be performed
by the command:
which generates the quantized version train.qlm that encodes all probabilities and back-off weights in
8 bits. The output is a modified ARPA format, called qARPA. Notice that quantized LMs reduce memory
consumptions at the cost of some loss in performance. Moreover, probabilities of quantized LMs are not
supposed to be properly normalized.
15
11 LM Compilation
LMs in ARPA, iARPA, and qARPA format can be stored in a compact binary table through the command:
which generates the binary file train.blm that can be quickly loaded in memory. If the LM is really very
large, compile-lm can avoid to create the binary LM directly in memory through the option -memmap
1, which exploits the Memory Mapping mechanism in order to work as much as possible on disk rather than
in RAM.
This option clearly pays a fee in terms of speed, but is often the only way to proceed. It is also recommended
that the hard disk for the LM storage belongs to the computer on which the compilation is performed.
Notice that most of the functionalities of compile-lm (see below) apply to binary and quantized models.
By default, the command uses the directory “/tmp” for storing intermediate results. For huge LMs, the
temporary files can grow dramatically causing a “disk full” system error. It is possible to explicitly set the
directory used for temporary computation through the parameter “–tmpdir”.
16
12 Filtering a LM
A large LM can be filtered according to a word list through the command:
The resulting LM will only contain n-grams inside the provided list of words, with the exception of the
1-gram level, which by default is preserved identical to the original LM. This behavior can be changed
by setting the option --keepunigrams no. LM filtering can be useful once very large LMs can be
specialized in advance to work on a particular portion of language. If the original LM is in binary format
and is very large, compile-lm can avoid to load it in memory, through the memory mapping option
-memmap 1.
17
13 LM Interface
LMs are useful when they can be queried through another application in order to compute perplexity scores
or n-gram probabilities. IRSTLM provides two possible interfaces:
• at the c++ library level, mainly through methods of the class lmtable
In the following, we will only focus on the command level interface. Details about the c++ library interface
will be provided in a future version of this manual.
To compute the perplexity directly from the LM on disk, we can use the command:
Notice that PPwp reports the contribution of OOV words to the perplexity. Each OOV word is indeed
penalized by dividing the LM probability of the unk word by the quantity
DictionaryUpperBound - SizeOfDictionary
The OOV penalty can be modify by changing the DictionaryUpperBound with the parameter --dub
(whose default value is set to 107 ).
Interestingly, a slightly better value is obtained which could be explained by the fact that pruning has re-
moved many unfrequent trigrams and has redistributed their probabilities over more frequent bigrams.
Notice that PPwp reports the perplexity with a fixed dictionary upper-bound of 10 million words. Indeed:
Again, if the LM is in binary format and is very large, compile-lm can avoid to load it in memory, through
the memory mapping option -memmap 1.
By enabling the option “--sentence yes”, compile-lm computes perplexity and related figures
(OOV rate, number of backoffs, etc.) for each input sentence. The end of a sentence is identified by a
given symbol (</s> by default).
18
$> compile-lm train.plm --eval test --dub 10000000 --sentence yes
Finally, tracing information with the --eval option are shown by setting debug levels from 1 to 4
(--debug):
the command reports the currently observed n-gram, including unk words, a dummy constant frequency
1, the log-probability of the n-gram, and the number of back-offs performed by the LM.
Warning: All cross-sentence n-grams are skipped. The 1-grams with the sentence start symbol are also
skipped. In a n-grams all words before the sentence start symbol are removed. For n-grams, whose size is
smaller than the LM order, probability is not computed, but a NULL value is returned.
19
14 LM Interpolation
We provide a convenient tool to estimate mixtures of LMs that have been already created in one of the
available formats. The tool permits to estimate interpolation weights through the EM algorithm, to compute
the perplexity, and to query the interpolated LM.
Data used in those examples can be found in the directory example/interpolateLM/, which repre-
sents the relative path for all the parameters of the referred commands.
Interpolated LMs are defined by a configuration file in the following format:
3
0.3 lm-file1
0.3 lm-file2
0.4 lm-file3
The first number indicates the number of LMs to be interpolated, then each LM is specified by its weight
and its file (either in ARPA or binary format). Notice that you can interpolate LMs with different orders
Given an initial configuration file lmlist.init (with arbitrary weights), new weights can be estimated
through Expectation-Maximization on some text sample test by running the command:
$> interpolate-lm lmlist.init --learn test
New weights will be written in the updated configuration file, called by default lmlist.init.out. You
can also specify the name of the updated configuration file as follows:
$> interpolate-lm lmlist.init --learn test lmlist.final
Similarly to compile-lm, interpolated LMs can be queried through the option --score
$> interpolate-lm lmlist.final --score yes < test
and can return the perplexity of a given input text (“--eval text-file”), optionally at sentence level
by enabling the option “--sentence yes”,
$> interpolate-lm lmlist.final --eval test
$> interpolate-lm lmlist.final --eval test --sentence yes
If there are binary LMs in the list, interpolate-lm can avoid to load them in memory through the
memory mapping option -memmap 1.
The full list of options is:
--learn text-file learn optimal interpolation for text-file
--order n order of n-grams used in --learn (optional)
--eval text-file compute perplexity on text-file
--dub dict-size dictionary upper bound (default 10ˆ7)
--score [yes|no] compute log-probs of n-grams from stdin
--debug [1-3] verbose output for --eval option (see compile-lm)
--sentence [yes|no] (compute perplexity at sentence level (identified
through the end symbol)
--memmap 1 use memory map to read a binary LM
20
15 Parallel Computation
This package provides facilities to build a gigantic LM in parallel in order to reduce computation time. The
script implementing this feature is based on the SUN Grid Engine software3 .
To apply the parallel computation run the following script (instead of build-lm.sh):
Besides the options of build-lm.sh, parameters for the SGE manager can be provided through the
following one:
The script performs the same split-and-merge policy described in Section 7, but some computation is per-
formed in parallel (instead of sequentially) distributing the tasks on several machines.
3
http://www.sun.com/software/gridware
21
16 Class and Chunk LMs
IRSTLM toolkit allows the use of class and chunk LMs, and a special handling of input tokens which are
concatenation of N ≥ 1 fields separated by the character #, e.g.
word#lemma#part-of-speech#word-class
The processing is guided by the format of the file passed to Moses or compile-lm: if it contains just the
LM, either in textual or binary format, it is treated as usual; otherwise, it is supposed to have the following
format:
where:
The various cases are discussed with examples in the following. Data used in those examples can be found in
the directory example/chunkLM/ which represents the relative path for all the parameters of the referred
commands. Note that texts with different tokens (words, POS, word#POS pairs...) used either as input or
for training LMs are all derived from the same multifield texts in order to allow direct comparison of results.
Examples:
22
$> compile-lm --eval test/test.w-micro cfgfile/cfg.1stfield
%% Nw=126 PP=9.71 PPwp=0.00 Nbo=76 Noov=0 OOV=0.00%
The result of the latter case is identical to that obtained with the standard configuration involving just words:
w1 class(w1)
w2 class(w2)
...
wM class(wM)
The map is applied to each component of ngrams before the LM query. Examples:
Such a sequence is collapsed into a single chunk label (let us say CHNK) as long as TAG(, TAG+ and
TAG) are all mapped into the same label CHNK. The map into different labels or a different use/position of
characters (, + and ) in the lexicon of tags prevent the collapsing operation even if <collapse> is set to
true. Of course, if <collapse> is false, no collapse is attempted.
23
Warning: In this context, it assumes an important role the parameter <lmmacroSize>: it defines the
size of the n-gram before the collapsing operation, that is the number of microtags of the actually processed
sequence. <lmmacroSize> should be large enough to ensure that after the collapsing operation, the
resulting n-gram of chunks is at least of the size of the LM to be queried (the <lmfilename>). As an
example, assuming <lmmacroSize>=6, <selectedField>=1, <collapse>=true and 3 the size
of the chunk LM, the following input
will yield to query the LM with just the bigram (PP,NP), instead of a more informative trigram; for
this particular case, the value 6 for <lmmacroSize> is not enough. On the other side, for efficiency
reasons, it cannot be set to an unlimited valued. A reasonable value could derive from the average number
of microtags per chunk (2-3), which means setting <lmmacroSize> to two-three times the size of the LM
in <lmfilename>. Examples:
Note that the configuration (16.3.c) gives the same result of that in example (16.2.b), as they are equivalent.
16.3.d) As an actual example related to the “warning” note reported above, the following configuration with
usual LM:
not necessarily yields the same log-likelihood (logPr) nor the same perplexity (PP) of case (16.3.a). In
fact, concerning PP, the length of the input sequence is definitely different (126 tokens before collapsing,
73 after that). Even the logPr is different (-33.29979642 vs. -33.28748842) because in (16.3.a) some
6-grams (<lmmacroSize> is set to 6) after collapsing reduce to n-grams of size less than 3 (the size
of lm/train.macro.blm). By setting <lmmacroSize> to a larger value (e.g. 8), the same logPr will be
computed.
24
A Reference Material
The following books contain basic introductions to statistical language modeling:
• Speech and Language Processing, by Dan Jurafsky and Jim Martin, chapter 6.
Marcello Federico and Mauro Cettolo, Efficient Handling of N-gram Language Models for
Statistical Machine Translation, In Proc. of the Second Workshop on Statistical Machine
Translation, pp. 88–95, ACL, Prague, Czech Republic, 2007.
Marcello Federico and Nicola Bertoldi, How Many Bits Are Needed To Store Probabilities
for Phrase-Based Translation?, In Proc. of the Workshop on Statistical Machine Transla-
tion. pp. 94-101, NAACL, New York City, NY, 2006.
Marcello Federico and Nicola Bertoldi, Broadcast news LM adaptation over time, Com-
puter Speech and Language. 18(4): pp. 417-435, October, 2004.
25
B Release Notes
B.1 Version 3.2
• Quantization of probabilities
• Scripts and data structures for the estimation and handling of gigantic LMs
• Bug fixes
26
B.5 Version 5.05
• (Optional) computation of OOV penalty in terms of single OOV word instead of OOV class
• Updated documentation
27
B.10 Version 5.30
• Support for a safe management of LMs with a total amount of n-grams larger than 250 million
• Use of a new parameter to specify a directory for temporary computation because the default (”/tmp”)
could be too small
28