-
Notifications
You must be signed in to change notification settings - Fork 13
Docent Configuration
Docent's configuration is specified in an XML file with a root tag <docent> enclosing five subsections named <random>, <state-generator>, <search>, <models> and <weights>, all of which are mandatory and must be present in this order. Some of the tags described below can take extra parameters, which can be supplied using <p name="..."> subtags. You can find some example configuration files in the tests/config subdirectory.
The <random> section controls the initialisation of the random number generator. Typically, it will be left empty (<random/>). This means that the random number generator should be seeded with a non-predictable value which will be read from /dev/urandom or, if this fails, from the system clock. The actual seed value that was used will be output in the decoder log. If the <random> tag is non-empty, its contents must be an unsigned 32-bit integer which will be used to seed the random number generator. This is useful for starting predictable runs when debugging the decoder.
The <state-generator> section controls how the decoder state is initalised (<initial-state> tag) and what operations are can be applied to the state in each step (<operation> tags). The initialisation options and available operations are described in our EMNLP 2012 paper.
-
<initial-state type="monotonic">specifies uninformed monotonic initialisation with randomly selected phrase translations. -
<initial-state type="beam-search">requests state initialisation by dynamic programming beam search. It takes one mandatory parameter calledini, a pointer to amoses.inifile.
State operations are specified with a set of <operation type="..." weight="..."> tags. The operations of type change-phrase-translation, swap-phrases and resegment are described in the paper. The weight attribute contains the probabilities with which the operations will be selected. Some of the operations take additional parameters controlling the range over which they are likely to be applied. Stick to the defaults if unsure.
The <search> section configures the search algorithm to be used. Docent currently implements two search algorithms, simulated-annealing and local-beam-search. Simulated annealing search takes three parameters named max-steps, max-rejected and schedule. The first two parameters define the step limit and the rejection limit discussed in our paper. When the cooling schedule hill-climbing is used, simulated annealing is equivalent to local beam search with beam size 1. This is the setup we currently recommend using. Some other cooling schedules are implemented, but difficult to parametrise and not well tested.
In this section, the feature functions are defined. For users familiar with Moses, the options in the example files should be fairly self-explanatory. Each model needs an id attribute by which it can be referred to in the <weights> section. The following models are currently supported:
-
geometric-distortion-model: The standard simple unlexicalised distortion cost model. The model provides a second score that counts violations of a given maximum distortion distance, which can be used to implement a distortion limit like the one commonly used in DP beam search. -
word-penalty: Word count feature. -
oov-penalty: Out-of-vocabulary word count feature. Note that unlike Moses, Docent needs this to be explicitly specified if you want to use it. -
ngram-model: Standard n-gram language model. The parameterlm-filespecifies the language model file. Set the parameterannotation-levelto specify a language model based on annotations. Acceptable formats include models built with KenLM or SRILM language modelling toolkits. -
phrase-table: The phrase table. The parameterfilespecifies the location of the phrase table. Set the parameterload-alignmentsto true if your binary phrase table contains phrase-internal word alignments. -
semantic-space-language-model: The semantic language model described in our EMNLP 2012 paper. -
sentence-parity-model: A proof-of-concept model enforcing sentence length parity which exists mainly to demonstrate how to implement a simple feature function. -
bleu-model: A model which maximizes the BLEU score of the output based on a set of reference translations. Thereference-fileparameter specifies where the reference translations are found (plain text format). Note that currently only one reference translation per sentence is supported.
This section contains the feature weight for each score. Features are referred to by their id attributes. If a model produces multiple scores, these are numbered starting from zero.
To use an annotated model with Docent, you must first create a phrase table that contains annotations. An example of how to do this is as follows:
Given a parallel corpus corpus.xx corpus.yy, where xx represents the source language and yy the target, firstly annotate corpus.yy, by placing a '|' symbol after each token followed by the annotation (e.g. the POS tag). Docent currently only handles annotations on the target side. Then train a model in Moses, making sure to pass the annotated corpus, as well as the flag --translation-factors 0-0,1. The resulting phrase table should then be filtered and binarized and specified in the phrase-table model in the configuration file as normal. The parameter annotation-count must also be set to 1 within the phrase-table model.
A language model should then be created to make use of the annotations. If using POS tags, for example, this could be achieved by extracting tags from a large tagged monolingual corpus, then running standard language model software such as KenLM to create a model (this should also be binarized to speed up simulations). In the Docent configuration file, a new ngram-model should then be specified, with the lm-file parameter set correctly and annotation-level set to 0.