Machine Learning in Translation Corpora Processing
Machine Learning in Translation Corpora Processing
p,
p,
A SCIENCE PUBLISHERS BOOK
A SCIENCE PUBLISHERS BOOK
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted,
or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, includ-
ing photocopying, microfilming, and recording, or in any information storage or retrieval system, without written
permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com
(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers,
MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety
of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment
has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
First of all, I want to thank everyone, without whom this work would not be
accomplished. Especially, much gratitude goes to my family—mostly to my
wife Agnieszka, for her help, patience and support. Additionally, I dedicate
this diploma to my children, Kamil, Jagoda and Zuzanna, because they
provided me with motivation and time for this work.
Special thanks go to my parents, for their effort in raising me and having
faith in me and in my actions. Without them, I would not be in the place where
I am now.
Lastly, I thank my supervisor and mentor, Professor Krzysztof Marasek,
for all the help and hard work he gave to support me—especially for his
valuable comments and support in my moments of doubt.
Preface
The main problem investigated here was how to improve statistical machine
speech translation between Polish and English. While excellent statistical
translation systems exist for many popular languages, it is fair to say that
the development of such systems for Polish and English has been limited.
Research has been conducted mostly on dictionary-based [145], rule-based
[146], and syntax-based [147] machine translation techniques. The most
popular methodologies and tools are not well-suited for the Polish language
and therefore require adaptation. Polish language resources are lacking in
parallel and monolingual data. Therefore, the main objective of the present
study was to develop an automatic and robust Polish-to-English translation
system in order to meet specific translation requirements and to develop
bilingual textual resources by mining comparable corpora.
Experiments were conducted mostly on casual human speech, consisting
of lectures [15], movie subtitles [14], European Parliament proceedings [36]
and European Medicines Agency texts [35]. The aims were to rigorously
analyze the problems and to improve the quality of baseline systems, i.e.,
adaptation of techniques and training parameters, in order to increase
the Bilingual Evaluation Understudy (BLEU) [27] score for maximum
performance. A further aim was to create additional bilingual and monolingual
data resources by using available online data and by obtaining and mining
comparable corpora for parallel sentence pairs. For this task, a methodology
employing a Support Vector Machine and the Needleman-Wunsch algorithm
was used [19], along with a chain of specialized tools. The resulting data was
used to enrich translation systems information. It was adapted to specific text
domains by linear interpolation [82] and Modified Moore-Lewis filtering [60].
Another indirect goal was to analyze available data and improve its quality,
thus enhancing the quality of machine language system output. A specialized
tool for aligning and filtering parallel corpora was developed.
This work was innovative for the Polish language, given the paucity
of good quality data and machine translation systems available, and meets
pressing translation requirements. While great progress has been made, it
vi Machine Learning in Translation Corpora Processing
1
www.iwslt.org
Contents
Acknowledgements iii
Preface v
Abbreviations and Definitions xiii
Overview xv
1. Introduction 1
1.1 Background and context 2
1.1.1 The concept of cohesion 4
1.2 Machine translation (MT) 5
1.2.1 History of statistical machine translation (SMT) 5
1.2.2 Statistical machine translation approach 6
1.2.3 SMT applications and research trends 8
based machine translation system that outperformed any previous systems for
Wall Street Journal text translation. The authors were able to achieve BLEU
scores as high as 41 by applying various techniques for improving machine
translation quality. The authors adapted an alignment symmetrization method
to the needs of the Czech language and used the lemmatized forms of words.
They proposed a post-processing step for correct translation of numbers and
enhanced parallel corpora quality and quantity.
In [162], the author prepared a factored, phrase-based translation system
from Czech to English and an experimental system using tree-based transfer
at a deep syntactic layer. In [163], the authors presented another factored
transition model that uses multiple factors taken from annotation of input
and output tokens. Experiments were conducted on news texts that were
previously lemmatized and tagged. A similar system, when it comes to text
domain and the main ideas, are also presented for Russian [164].
Some interesting translation systems to and from Czech to languages
other than English (German, French, Spanish) were recently presented in
[165]. The specific languages properties were described and their properties
were exploited in order to avoid language-specific errors. In addition, an
experiment was conducted on translation using English as a pivot language,
because of the great difference in parallel data.
A comparison of how the relatedness of two languages influences the
performance of statistical machine translation is described in [166]. The
comparison was made between the Czech, English, and Russian languages.
The authors proved that translation between related languages provides better
translations. They also concluded that, when dealing with closely-related
languages, machine translation is improved if systems are enriched with
morphological tags, especially for morphologically-rich languages [166].
Such systems are urgently required for many purposes, including web,
medical text and international translation services, for example, for the error-
free, real-time translation of European Parliament Proceedings (EUP) [36].
source language into the target language. Translations are built on gigantic
dictionaries and sophisticated linguistic rules. Users can improve out-of-the-
box translation quality by adding their own terminology into the translation
process. They create user-defined dictionaries, which override the system’s
default settings [137]. The rule-based translation (RBMT) process is more
complex than just substitution of one word for another. For such systems, it
is necessary to prepare linguistic rules that would allow such words to be put
in different places and to change their meaning depending on context, etc.
The RBMT methodology must apply a set of linguistic rules in three phases:
Syntax and semantic analysis, transfer, syntax and semantic generation [168].
An MT system, when given a source text, first tries to segment it, for
example, by expanding elisions or marking set phrases. The system then
searches for such segments in a dictionary. If a segment is found, the search
will return the base form and accompanying tags for all matches (using
morphological analysis). Next, the system tries to resolve ambiguous segments
(e.g., terms that have more than one match) by choosing only one. RBMT
systems sometimes add an extra lexical selection that would allow choices
between alternative meanings.
One of the main problems with the RBMT approach to translation is
making a choice between correct meanings. This involves a disambiguation,
or classification, problem. For example, to improve accuracy, it is possible to
disambiguate the meanings of single words. The biggest problem with such
systems is the fact that the construction of rules requires a great effort. The
same goes for any sort of modifications to the rules that do not necessarily
improve overall system accuracy.
At the most basic level, a MT system simply substitutes words in one
language for words in another. However, this process cannot produce a good
translation of a text, because recognition of whole phrases and their closest
counterparts in the target language is needed. Solving this problem with
corpus and statistical techniques is a rapidly growing field that is leading to
better translations [85].
SMT approaches the translation of natural language as a machine-learning
problem and is characterized by the use of machine learning methods. By
investigating and examining many examples of human-produced translation,
SMT algorithms automatically learn how to translate. This means that a
learning algorithm is applied to a large body of previously-translated text,
known as a parallel corpus. The “learned” system is then able to translate
previously-unseen sentences. With an SMT toolkit and enough parallel text, a
MT system for a new language pair can be developed in a short period of time.
The correctness of these systems depends significantly on the quantity, quality
and domain of the available data.
8 Machine Learning in Translation Corpora Processing
2
http://www.skype.com/en/translator-preview/
3
http://www.ustar-consortium.com/
4
http://www.eu-bridge.eu/
Introduction 9
5
https://pl.wikipedia.org/wiki/N-gram
12 Machine Learning in Translation Corpora Processing
2.2.1 Words
Naturally, the fundamental unit of written language (in European languages)
that native speakers can identify is a word, which is also used in SMT (In
spoken language, basic units are syllables and phrases). A good example is
the word “house” which means a dwelling that serves as living quarters for
one or more families. In terms of structure, it may translate to a structure that
has a roof and walls and stands more or less permanently in one place. While
the word (house) may be used in different contexts that surround the language
unit and help to determine its interpretation, the word more often than not
conveys the same meaning [85].
That aside, there are intricate language units, known as syllables, used
in a word (house). More precisely, “s” may not add any weight to the word,
even though the word can be understood by native English speakers. All in
all, words are separated to form a coherent meaning by the way they appear
Statistical Machine Translation and Comparable Corpora 13
2.2.2 Sentences
Sentences are strings of words that satisfy the grammatical rules of a language.
Below is an example:
Fox News was ranked among the top networks during the last
American general elections
The sentence above is complete, having a subject, verb and place.
Be that as it may be, stylistic fragments can be used in creative writing,
which is less restrictive than formal writing [85].
RUN-ON—Also known as fused sentences, run-on sentences are typified by
putting two complete sentences together without a break or punctuation. An
example of a run-on sentence is:
striking idiosyncrasy of language called recursion. Note that this can also be
applied to sentences. This exceptional sentence may be extended by modifiers
of Jane, who, starting late, won the lottery and acquired the house that was just
put on the market [85].
This provides an opportunity to refine the considered sentence. A sentence
may involve one or more modifiers, each of which embodies a verb with
discords and subordinates. Stipulations may be partners and controversies
themselves, as in: “I proposed to set out for some swimming.” The recursive
refinement of sentences with subtle improvements raises a huge number of
issues for the examination of text. One especially grievous issue for specific
sentences is structural obscurity. Consider the three sentences:
• Joe eats steak with a knife.
• Jim eats steak with ketchup.
• Jane watches the man with the telescope.
Each of the three sentences end in a helping prepositional phrase. In the
first sentence, the helper modifies the verb (the consumption happens with a
knife). In the second sentence, the subordinate clause modifies the object (the
steak has ketchup on it). Be that as it may, what about the third sentence? Does
Jane utilize a telescope to watch the man, or does the man she watches have a
telescope? Structural uncertainty is presented not only by helpers, for example,
the prepositional expression issue mentioned previously. Connectives likewise
add ambiguity, as their extension is not generally clear. Consider the sentence:
Jim washes the dishes and watches television with Jane. Is Jane assisting with
the dishes, or would she say that she is simply joining Jim for TV [85]?
The motivation behind why structural ambiguity is such a difficult
issue for programmed common dialect preparation frameworks is that the
uncertainty is usually determined semantically. People are not confused,
reading in context. In addition, speakers and authors use ambiguity when it is
not clear what the right intention is. However, what is clear to a human is not
evident to a machine. In what manner ought your desktop computer to realize
that “steak and knife” do not make a delicious dinner [85]?
2.2.3 Corpora
The multilingual nature of the world makes translation a crucial requirement
today. Parallel dictionaries constructed by humans are a widely-available
resource, but they are limited and do not provide enough coverage for good
quality translation purposes, due to out-of-vocabulary words and neologisms.
This motivates the use of statistical translation systems, which are dependent
on the quantity and quality of training data. A corpus is a large collection
of texts, stored on a computer. Text collections are called corpora. The term
Statistical Machine Translation and Comparable Corpora 17
e : witch
e: f :––––––*–
f :–––––––– p : .182
p: 1
6
http://www.statmt.org/moses/?n=Moses.Background
22 Machine Learning in Translation Corpora Processing
weightφ weightφ
p (e | f ) = φ ( f | e) × LM weightLM × D(e, f ) weightd × W (e) (3)
2.4.1 Tokenization
Among the most important steps in text translation is tokenization, the
dividing of rough text into words. For lingos and languages that use the Latin
letters in place, this is basically an issue of dividing a combination of many
words. It is a very difficult task to compose frameworks that do not provide
spaces between words. This section discusses some of the issues that must be
addressed in tokenization.
Sentences are a fundamental unit of text. Determining where sentences
end requires recognition of punctuation marks, very often the period (“.”).
Unfortunately, the period presents a high degree of ambiguity. For example,
it can serve as the end of a sentence, part of an abbreviation, an English
decimal point, an abbreviation and also the end of a sentence, or be followed
by quotation marks [215].
Recognizing abbreviations, which can come from a long list of potential
abbreviations, is a related problem, since the period often appears in them.
There are also many potential acronyms that can occur in text and must be
detected. No universal standard for abbreviations or acronyms exists [215].
Hyphens and dashes present another tokenization challenge. A hyphen
may be part of a token (e.g., “forty-five”), or it may not be or (e.g., “Paris-
based”). A hyphen is also used for splitting words at the end of a line and
within a single word following certain suffixes [216].
A variety of non-lexical expressions, which may include alphanumeric
characters and symbols that also serve as punctuation marks, can confuse a
tokenizer. Some examples include phone numbers, email addresses, dates,
serial numbers, numeric quantities with units, and times [216, 217].
One tokenization issue is words that occur in lowercase or uppercase in a
text. For example, the text “house,” “House,” and “HOUSE” may all occur in
A variety of non-lexical expressions, which may include alphanumeric characters and
symbols that also serve as punctuation marks, can confuse a tokenizer. Some examples
include phone numbers, email addresses, dates, serial numbers, numeric quantities with units,
and times [216, 217].
One tokenization issue is words that occur in lowercase or uppercase in a text. For
example, the text “house,”
24 “House,” and “HOUSE” may all occur in a large corpus, in the
Machine Learning in Translation Corpora Processing
midst of a sentence, at the beginning of a sentence, or in rare cases, independently. It is the
same word, so it is necessary to standardize
a large corpus, case, of
in the midst typically by lowercasing
a sentence, or truecasing.
at the beginning of a sentence, or in
rare cases,
Truecasing applies uppercase independently.
to names, It is the same
allowing differentiation word, phrases
between so it is necessary to standardize
such as “Mr.
case, typically by lowercasing or truecasing. Truecasing applies uppercase to
Fisher” and “a fisher” [85].
names, allowing differentiation between phrases such as “Mr. Fisher” and “a
fisher” [85].
Tokenization of modern text must alsoofhandle
Tokenization modern emoticons,
text mustwhich are handle
also often used in social which are
emoticons,
often used in social media and email
media and email messages. There are many different emoticons. They can be complex messages. There are many
and different
emoticons. They can be complex and may be easily confused with other text.
may be easily confused with other text. For example, any of the following emoticon symbols
For example, any of the following emoticon symbols may indicate a frown7:
7
>:[:-(:( :(:( :-c
may indicate a frown : >:[ :c :c :-<
:< ::っCC:< :<:-[
:[ :[
:[ :{
:{
Much work has gone into the separation of Polish content. Such tasks
require
Much work has gone into a dictionary of possible
the separation of Polish words and Such
content. a systemtaskstorequire
discover,a learn and
determine them with certainty, usually by consideration of ambiguities
dictionary of possible between
words and a systemdivisions
alternative to discover,(e.g.,learn
favorand
moredetermine
successive them with At the same
words).
time, there are
certainty, usually by consideration some tokenization
of ambiguities between issues for English
alternative divisionsthat do favor
(e.g. not have simple
answers:
more successive words). At the same time, there are some tokenization issues for English that
• What should be done with the possessive marker in “Joe’s,” or a
do not have simple answers:
contraction, for example, “doesn’t”? These are typically separated.
For example, the possessive marker “s” is viewed as a word in its own
• What should be done with theright.
particular possessive marker in “Joe's,” or a contraction, for
example, “doesn't”? These• are How typically
shouldseparated.
hyphenatedFor example,
words the be possessive
treated, for marker “s” is “co-work,”
example,
“purported,”
viewed as a word in its own particular right. or “high-risk”? There is, by all accounts, little advantage
in separating “co-work.” However, “high-risk” is truly comprised of two
words. When tokenizing content for machine translation, the primary
• How should hyphenated words be treated, for example, “co-work,” “purported,” or
guiding rule is that it is required to reduce content to a grouping of tokens
“high-risk”? There is, by allfrom accounts,
a smalllittle advantage
set. It is preferred in separating
not to learn“co-work.” However, of “house,”
diverse translations
“high-risk” is truly comprised contingent of two upon whether
words. When it is tokenizing
trailed by acontent
comma (house) or encompassed
for machine
by quotes (“house”) [85].
translation, the primary guiding rule is that it is required to reduce content to a grouping of
tokens from a small set. 2.4.2
It is preferred not to learn diverse translations of “house,” contingent
Compounding
upon whether it is trailedAby a comma
few (house)
languages, or encompassed
including by quotes
Polish, support ("house")
making new [85].
words by intensifying
7 existing words. Some of these words—przedszkole, metamaskarada, etc.—
https://en.wikipedia.org/wiki/List_of_emoticons
have even found their way into the English language. Exasperation is
Krzysztof Wołk amazingly conducive to the creation of new words. Breaking
38
up compound
words may be part of the tokenization stage for a couple of languages [12].
Different words serve different functions. Nouns are used for actual,
concrete objects (e.g., house) or abstractions (e.g., opportunity). Words can
7
https://en.wikipedia.org/wiki/List_of_emoticons
Statistical Machine Translation and Comparable Corpora 25
The language model typically does much more than enable fluent output.
It supports difficult decisions about word order and word translation. For
instance, a probabilistic language model pLM should prefer correct word order
over incorrect word order:
pLM (the house is small) > pLM (small the is house) (5)
this special token before n-grams counts are determined. Using this approach,
it is possible to estimate the transition probabilities of n-grams involving out-
of-vocabulary words [172].
Another recently introduced solution to the OOV problem is unsupervised
transliteration. In accordance with [80], character-based translation models
(transliteration models) were shown to be quite useful in MT when translating
OOV words is necessary, for disambiguation and for translating closely-
related languages. The solution presented in [80] extracts a transliteration
corpus from the parallel data and builds a transliteration model from it. The
transliteration model can then be used to translate OOV words or named-
entities.
The proposed transliteration-mining model is actually a mixture of two
sub-models (a transliteration model and a non-transliteration sub-model).
The transliteration model assigns higher probabilities to transliteration pairs
compared to the probabilities assigned by a non-transliteration model to the
same pairs. If a word pair is defined as (e, f), the transliteration probability for
them is defined as:
| a|
Ptr (e, f ) = ∑ ∏ p(q j ) (6)
a∈Align ( e, f ) j =1
∑ z p(az ) = 1 (10)
If Z is the set of all words in the vocabulary, Z0 is the set of all words with
c(az) = 0, Z1 is the set of all words with c(az) > 0, and f (az) is given, bow(a_)
can be calculated as:
∑ z p(az ) = 1 (11)
8
https://cxwangyi.wordpress.com/2010/07/28/backoff-in-n-gram-language-models/
30 Machine Learning in Translation Corpora Processing
(1 − ∑ z1 f (az ))
bow(a _) =
∑ z 0 p(_ z )
(1 − ∑ z1 f (az ))
= (13)
(1 − ∑ z1 p(_ z ))
(1 − ∑ z1 f (az ))
=
(1 − ∑ z1 f (_ z ))
When Moses is used with the SRILM tool [8], one of the most basic
smoothing methods is Ney’s absolute discounting. This method subtracts a
constant value D, which has a value between 0 and 1. Assuming that Z1 is a
set of all words z and c(az) > 0:
(c ( a z ) − D )
f (az ) = (14)
c(a _)
(1 − ∑ z1 f (az ))
bow(a _) = (16)
(1 − ∑ z1 f (_ z ))
The suggested factor for discounting is equal to:
n1
D= (17)
(n1 + 2n2 )
In this equation, n1 and n2 are the total numbers of n-grams that have
exactly one and two counts, respectively [85].
Kneser-Ney discounting is similar to absolute discounting if subtracting a
constant D from the n-gram counts is considered. The main idea is to use the
modified probability estimation of lower-order n-grams that is used for the
back-off. More specifically, it is taken with the intention to be proportional to
the number of unique words that were found in the training text [142].
Chen and Goodman’s modification to Kneser-Ney discounting differs by
using three discounting constants for each of the n-gram orders, defined as
follows:
n
D1 = 1 − 2Y ( 2 ) (18)
n1
n3
D2= 2 − 3Y ( ) (19)
n2
n4
D3= 3 − 4Y ( ) (20)
n3
Statistical Machine Translation and Comparable Corpora 31
alignment step
In this equation, the alignment function a maps each output word position
j to a foreign input position a(j) [177].
The fertility of words problem described above is addressed in IBM
Model 3. Fertility is modeled using a probability distribution defined as [180]:
n(φ|f) (25)
For each foreign word f, such a distribution indicates how many output
words (φ) it usually translates. Such a model deals with the dropping of input
words, because it allows the φ be equal to 0. However, there is still an issue
when adding words. For example, the English word “do” is often inserted
when negating in English. This issue generates a special NULL token that
can also have its fertility modeled using a conditional distribution defined as:
n(φ|NULL) (26)
34 Machine Learning in Translation Corpora Processing
1 2 3 4 5 6
ja nie pójdę tak do domu
fertility step
where Φi represents the fertility of ei, each source word s is assigned a fertility
distribution n, and I and J refer to the absolute lengths of the target and source
sentences, respectively [181].
A main objective of IBM Model 4 is to improve word reordering. In
Model 4, each word is dependent on the previously-aligned word and on the
word classes of the surrounding words. Some words tend to be reordered
more than others during translation (e.g., adjective–noun inversion when
translating Polish to English). Adjectives often get moved backwards if a noun
precedes them. The word classes introduced in Model 4 solve this problem by
conditioning the probability distributions of these classes. The result of such a
distribution is a lexicalized model.
Statistical Machine Translation and Comparable Corpora 35
Cept πi π1 π2 π3 π4 π5
Foreign position [i] 1 2 3 4 5
Foreign word f[i] Ja nie idę do domu
English word {ej} I go not to, the house
English position {j} 1 4 3 5, 6 7
Center of cepts 8i 1 4 3 6 7
where the functions A(f) and B(e) map words to their word classes, and ej and
f[i − 1] are English and foreign words, respectively. A typical way to define A(f)
36 Machine Learning in Translation Corpora Processing
and B(e) is to use POS tags annotation on the text. A uniform distribution is
used for a word generated for a translation that has no equivalent in the source
language [85].
Table 2 show the relative distortion for our example.
j 1 2 3 4 5 6 7
ej I Do not Go to the house
In cept πi, k π1,0 π0,0 π3,0 π2,0 π4,0 π4,1 π5,0
8i – 1 0 – 4 1 3 – 6
j– 8i – 1 +1 – –1 +3 +2 – +1
distortion d1(+1) 1 d1(–1) d1(+1) d1(+2) d>1(+2) d1(+1)
It must be noted that both Model 3 and Model 4 ignore if an input position
is chosen and if probability mass was reserved for the input positions outside
the sentence boundaries. This is the reason that the probabilities of all correct
alignments do not sum to unity in these two models [179]. These are called
deficient models, because probability mass is wasted.
IBM Model 5 reformulates IBM Model 4 by enhancing its alignment
model with more training parameters in order to overcome this model
deficiency [182]. During translation in Model 3 and Model 4, there are no
heuristics that would prohibit the placement of an output word in an already-
taken position. In practice, this is not possible. So, in Model 5, words are only
placed in free positions. This is accomplished by tracking the number of free
positions and allowing placement only in such places. The distortion model is
similar to that of IBM Model 4, but it is based on free positions. If vj denotes
the number of free positions in output, IBM Model 5 distortion probabilities
would be defined as [183]:
For the initial word in a cept:
d1 (vj|B(ej), v8i –1, vmax) (30)
For additional words:
d>1 (vj – vπ |B(ej), vmax') (31)
i, k – 1
FERTILITY STEP
chciałbym chciałbym chciałbym wyražnie usłyszeć jakie jest pana zdanie w tej kwestii
NULL STEP
chciałbym chciałbym chciałbym NULL wyražnie usłyszeć jakie jest pana zdanie w tej kwestii
DISTORTION STEP
Och and Ney [178] reported that Model 6 produces the best alignments
among the IBM models, with lower alignment error rates. This was observed
with varying sizes of training corpora. They also noted that the performance
of all the alignment models is significantly affected by the training method
used [178].
represented by non-terminals and such rules are best processed with a search
algorithm that is similar to syntactic chart parsing, such models fall into the
class of tree-based or grammar-based models [85].
For example, we can take the following sentences and rules:
Input: drzwi otwierają się szybko
Rules: drzwi → the door
szybko → quickly
otwierają X1 się → opens X1
X1 X2 → X1 X2
The translation that would be produced is:
X4 X4
X3
X3
X1 X1 X2
X2
First, the simple phrase mappings “drzwi” to “the door” and “szybko” to
“quickly” are made. This allows for the application of the more complex rule
“otwierają X1 się” to “opens X1.” Note that, at this point, the non-terminal
X, which covers the input span over “szybko,” is replaced by a known
translation, “quickly.” Finally, the glue rule “X1 X2 to X1 X2” combines the
two fragments into a complete sentence.
The target syntax model applies linguistic annotation to non-terminals in
hierarchical models. This requires running a syntactic parser on the text. The
Collins statistical natural language parser [211] is widely used for syntactic
parsing. This parser decomposes generation of a parse tree into a sequence
of decisions that consider probabilities associated with lexical headwords
(considering their parts of speech) at the nonterminal nodes of the parse tree
[211]. Bikel [74] describes a number of preprocessing steps for training the
Collins parsing model. Detailed descriptions of the parser may be found in
[211] and [74].
reordered with higher frequency than others. For instance, adjectives are
often switched with preceding nouns. A lexicalized reordering model that is
based on conditions determined from actual phrases is more appropriate for
MT. The problem of sparse data also arises in such a scenario. For example,
phrases may occur only a few times in the training texts. That would make the
probability distributions unreliable. It is possible in Moses to choose one out
of three possible orientation types for reordering (m—monotone order, s—
switch with previous phrase, or d—discontinuous) [85]. Figure 8 illustrates
the three orientation types on some sample alignments, designated by dark
rectangles [184]:
The Moses reordering model tries to predict the orientation type (m, s, or
d) for the phrase pair being translated in accordance with:
orientation ε {m, s, d}
(34)
po (orientation| f , e)
Such a probability distribution can be extracted from the word alignment.
To be more exact, while extracting a phrase pair, its orientation type is also
extracted for each specific occurrence. It is possible to detect the orientation
type in a word alignment if the top left or top right point of the extracted
phrase pair is checked. If the alignment point refers to the top left, it means
that the preceding E word is aligned to the preceding F word. In the opposite
situation, a preceding E word is aligned to the following F word. Words F and
E can be from any two languages [85]. Figure 9 illustrates both situations:
42 Machine Learning in Translation Corpora Processing
? ?
u v
bi
t
Black squares represent word alignments, and gray squares represent blocks
identified by phrase-extract [115]. It is assumed that the variables s and t
define a word range in a target sentence, and u and v define a word range in a
source sentence.
Unfortunately, it is impossible for this approach to detect swaps
with phrases that do not contain a word with such an alignment, as shown in
Figure 11 [8].
u v
The usage of phrase orientation rather than word orientation in the statistics
may solve this problem. Phrase orientation should be used for languages that
require a large number of word reordering operations. The phrase-based
orientation model was introduced by Tillmann [114]. It uses phrases in both
the training and decoding phase. When using the phrase-based orientation
model, the second case will be properly detected and counted as a swap.
This method was improved in the hierarchical orientation model introduced
by Galley and Manning [115]. The hierarchical model has the ability to detect
swaps and monotone arrangements between blocks that can be longer than
the length limit applied to phrases during the training or decoding step. It can
detect the swaps presented in Figure 12 [8].
44 Machine Learning in Translation Corpora Processing
u v
pozycję
swoją
on
he
took
advantage
of
his
position
mieszka
john
tutaj
nie
john
does
not
live
here
jim
kicked
the
bucket
2.4.6.1 Interpolation
This research provides a few approaches to gauge the likelihood dispersion
for an irregular variable. The most well-known reason for this is that it is easy
to obtain a huge corpora with general information and few specific examples,
Statistical Machine Translation and Comparable Corpora 47
but it is less dependable. So, typically there are two approaches to gauge the
likelihood circulation bringing about two functions p1 and p2 [9].
Using interpolation, it is possible to consolidate the two likelihood
approximations p1 and p2 for the same irregular variable X by giving each a
fixed weight and including them. If the first is given a weight 0 ≤ λ ≤ 1, then
1 – λ is left for the second:
p(x) = λ p1(x) + (1 – λ) p2(x) (38)
A typical first application emerges from information aspects that consider
distinct conditions. For example, it may be required to predict tomorrow’s
climate M based on today’s climate T and the day D (to consider run-of-the–
mill, occasional climate fluctuations) [85].
It is then necessary to gauge the contingent likelihood circulation
p(m|t, d) from climate insights. Modeling based on particular days yields
little information for use in this estimation. (As a case in point, there may be
scarcely any blustery days on August 1st in Los Angeles). Thus, it might make
more sense to introduce this distribution with the more robust p(m|t):
pinterpolated (m|t, d) = λ p(m|t, d) + (1 – λ) p(m|t) (39)
2
Final Score
= = 0.2857
7
Another problem with BLEU scoring is that it tends to favor translations
of short phrases (on the other hand, SMT usually returns longer translations
than a human translator would [197]), due to dividing by the total number
of words in the test phrase. For example, consider this translation for above
example [27]:
Test Phrase: “the cat” : score = (1 + 1)/2 = 1
Test Phrase: “the” : score = 1/1 = 1
50 Machine Learning in Translation Corpora Processing
9
http://www.kantanmt.com/
Statistical Machine Translation and Comparable Corpora 51
rarer words. The final NIST score is calculated using the arithmetic mean of
the n-gram matches between SMT and reference translations. In addition, a
smaller brevity penalty is used for smaller variations in phrase lengths. The
reliability and quality of the NIST metric has been shown to be superior to the
BLEU metric [29].
Translation Error Rate (TER) was designed in order to provide a very
intuitive SMT evaluation metric, requiring less data than other techniques
while avoiding the labor intensity of human evaluation. It calculates the
number of edits required to make a machine translation match exactly to the
closest reference translation in fluency and semantics [8, 30].
Calculation of the TER metric is defined in [8]:
E
TER = (42)
wR
where E represents the minimum number of edits required for an exact match,
and the average length of the reference text is given by wR. Edits may include
the deletion of words, word insertion and word substitutions, as well as
changes in word or phrase order [8]. It is similar to the Levenshtein distance
[85] calculation, but it has the advantage of correctly matching the word order
required for correct deletion and re-insertion of misplaced words. Adding an
extra editing step that would allow the movement of word sequences from one
part of the output to another can solve this. This is something a human post-
editor would do with the cut-and-paste function in a text processor, and this is
proposed in the TER metric as well [188].
The Metric for Evaluation of Translation with Explicit Ordering
(METEOR) is intended to more directly take into account several factors
that are indirectly considered in BLEU. Recall (the proportion of matched
n-grams to total reference n-grams) is used directly in this metric, to make
up for BLEU’s inadequate penalty for lack of recall. In addition, METEOR
explicitly measures higher order n-grams, explicitly considers word-to-
word matches and applies arithmetic averaging for a final score. Arithmetic
averaging enables meaningful scores on a sentence level, which can indicate
metric quality. METEOR uses the best matches against multiple reference
translations [31].
The METEOR method uses a sophisticated and incremental word
alignment method that starts by considering exact word-to-word matches,
word stem matches and synonym matches. Alternative word order similarities
are then evaluated based on those matches. It must be noted that the METEOR
metric was not adapted to the Polish language within this research [189]. Such
adaptation was done in [219] research, but this was not known at the time of
conducting experiments.
52 Machine Learning in Translation Corpora Processing
( ρ + 1)
ρ ( NSR) = (45)
2
The normalized Kendall’s coefficient is determined by:
(τ + 1)
τ ( NKT ) = (46)
2
These measures can be combined with precision P and modified to avoid
overestimating the correlation of only words that directly correspond in the
SMT and reference translations:
NSR Pα and NKT Pα
where α is a parameter in the range 0 < α < 1 [34].
was good enough at the first stages of the annotation but compounding effect
of disagreements reduced the effective final IAA to 0.44 for German and to
0.59 for English. The efficiency of HMEANT was stated as reasonably good
but it was not compared to other manual metrics.
Secondly, the annotators need to align the elements of frames. They must
link both actions and roles, and mark them as “Correct” or “Partially Correct”
(depending on equivalency in their meaning). In this research we used the
original guidelines for the SRL and alignment described in [220].
# Fi
P= ∑ # MTi
(47)
matched i
# Fi
R= ∑ # REFi
(48)
matched i
# Fi ( partial )
Ppart = ∑ # MTi
(49)
matched i
# Fi ( partial )
R part = ∑ # REFi
(50)
matched i
P + w ∗ Ppart
Ptotal = (51)
N mt
R + w∗ R
Rtotal = (52)
N ref
2 ∗ Ptotal ∗ Rtotal
HMEANT = (53)
Ptotal + Rtotal
In accordance to Lo and Wu [220] and Birch et al. [214] the IAA was
studied as well. It is defined as F1-measure, in which one of the annotators is
considered to be a gold standard as follows:
2∗ P∗ R
IAA = (54)
P+R
where P is precision (number of labels [roles, predicates or alignments], that
were matched between the annotators). Recall (R) is defined as quantity of
matched labels divided by total quantity of labels. In accordance to Birch
et al. [214], only exact word span matches are considered. The stages of the
annotation process described in [214] were adopted as well (role identification,
role classification, action identification, role alignment, action alignment).
Disagreements were analysed by calculating the IAA for each stage separately.
H1: F1 ≠ F2 (56)
In this test, as in the case of the t-test, a third variable is used. The third
variable specifies the absolute value of the difference between the values of
the paired observations. The Wilcoxon test involves ranking differences in
measurement for subsequent observations. First, the differences between
measurements 1 and 2 are calculated, then the differences are ranked (the
results are arranged from lowest to highest), and subsequent ranks are
assigned to them. The sum of ranks is then calculated for differences that
were negative and for those that were positive (results showing no differences
are not significant here). Subsequently, the bigger sum (of negative or positive
differences) is chosen, and this result constitutes the result of the Wilcoxon
test statistic, if the number of observations does not exceed 25.
For bigger samples, it is possible to use the asymptotic convergence of
the test statistic (assuming that H0 is true) for the normal distribution N(m,s),
where:
n(n + 1)
m= (57)
4
n(n + 1)(2n + 1)
s= (58)
24
58 Machine Learning in Translation Corpora Processing
10
iwslt.org
11
http://www.statmt.org/wmt14/
60 Machine Learning in Translation Corpora Processing
Many good quality translation systems already exist for languages that
are, to some extent, similar in structure to Polish. When considering such
SMT systems for Czech, Slovakian and Russian, it is possible to refer, for
example, to a recent Russian-English translation system for news texts that
scored almost 32 points in BLEU score [191]. In addition, systems that
translated medical texts from Czech to English and from English to Czech,
scoring around 37 and 23 BLEU points, respectively, were introduced in [75].
The authors of [193] report BLEU scores of 34.7, 33.7, and 34.3 for Polish,
Czech and Russian, respectively, in the movie subtitles domain. The authors
of [166] state that they obtained a BLEU score of 14.58 when translating news
from English to Czech. It must be noted that those results are not directly
comparable with results presented in this monograph, because they were
conducted on different corpora.
The development of SMT systems for Polish speech has progressed
rapidly in recent years. The tools used for mainstream languages were not
adapted for the Polish language, as the training and tuning settings for most
popular languages are not suited for Polish. In this research, the trained
systems are compared not only to the baseline systems, but also to those of
the IWSLT Evaluation Campaign, as a reference point [154]. The data used
in the IWSLT are multiple-subject talks of low quality, with many editorial
and alignment errors. More information about this domain can be found in
Section 4.7. The progress made on PL translation systems during 2012–2014,
as well as their quality in comparison with other languages, is shown in
Figures 16 and 17. Figure 16 depicts the state of Polish translation systems
in comparison to those of other languages in the evaluation campaign (light
gray), and the progress made for these systems between 2012 and 2013 (dark
Portuguese +0,26%
French +5,33%
Dutch +0,61%
German +9,84%
Romanian +9,79%
Arabic +4,34%
Slovak +4,38%
Russian +13,11%
Turkish +10,31%
Polish +18,67%
Chinese +45,87%
0 5 10 15 20 25 30 35 40 45
Spanish
Portuguese +1,39%
Italian
Dutch
Arabic +3,25%
Romanian
German +11,90%
Slovenian
Russian +2,19%
Polish +23,85%
Turkish
Chinese +12,06%
Farsi +21,01%
0 5 10 15 20 25 30 35 40
gray). The SMT system for Polish was one of the worst (mostly because the
baseline is rather low); however, through this research, the progress that was
made is similar to that of more popular languages. SMT system progress is
described by the percentage value beside the bars.
Figure 17 provides similar statistics, but it shows significant progress
between 2013 and 2014. In fact, this research resulted in the largest percentage
progress during the IWSLT 2014 evaluation campaign. In this campaign, the
baseline Polish SMT system advanced from almost last place, and the progress
bar indicates that the system could be compared to systems for mainstream
languages such as German. It must be noted that the progressive test sets for
each IWSLT campaign are different, that is why baseline system scores are
different in each year.
languages must first be created. It should be noted that the standard Yalign
implementation allows the building of only bilingual corpora.
The Yalign tool was implemented using a sentence similarity metric that
produces a rough estimate (a number between 0 and 1) of how likely it is for
two sentences to be a translation of each other. It also uses a sequence aligner,
which produces an alignment that maximizes the sum of the individual (per
sentence pair) similarities between two documents. Yalign’s initial algorithm
is actually a wrapper around a standard sequence alignment algorithm [19].
For sequence alignment, Yalign uses a variation of the Needleman-
Wunsch algorithm [20] in order to find an optimal alignment between the
sentences in two selected documents. The algorithm has a polynomial time
worst-case complexity, and it produces an optimal alignment. Unfortunately,
it cannot handle alignments that cross each other or alignments from two
sentences into a single one [20].
Since sentence similarity is a computationally-expensive operation, the
variation of the Needleman-Wunsch algorithm that was implemented uses an
A* approach in order to explore the search space, instead of using the classical
dynamic programming method that would require N * M calls to the sentence
similarity matrix.
After alignment, only sentences that have a high probability of being
translations of each other are included in the final alignment. The result is
filtered so as to deliver high-quality alignments. To do this, a threshold value
is used. If the sentence similarity metric is low enough, the pair is excluded.
For the sentence similarity metric, the algorithm uses a statistical
classifier’s likelihood output and normalizes it to the 0–1 range.
The classifier must be trained to determine whether or not sentence pairs
are translations of each other. A Support Vector Machine (SVM) classifier
was used in the Yalign project. Besides being an excellent classifier, an SVM
can provide a distance to the separation hyperplane during classification,
and this distance can be easily modified using a Sigmoid function to return a
likelihood between 0 and 1 [21]. The use of a classifier means that the quality
of the alignment depends not only on the input but also on the quality of the
trained classifier.
The quality of alignments is defined by a tradeoff between precision and
recall. Yalign has two configurable variables:
• Threshold: The confidence threshold to accept an alignment as “good.”
A lower value means more precision and less recall. The “confidence”
is a probability estimated from a support vector machine, classifying a
portion of text as “is a translation” or “is not a translation” of a portion of
source text.
State of the Art 65
start
Mcurrent=M0,0
stop Mcurrent≠Mn,m
no
Mopen–{Mcurrent}
Mclosed+{Mcurrent}
i=0
i++ Mtemp=Mcurrent*neigh(i)
no MtempϵMclosed
Yes
Score(path(Mtemp))
Mopen+{Mtemp}
i<
#neighbours no Mcurrent=bestpath(Mopen)
Yes
Mopen represents the set of all open paths that are represented by their final
point. Mclosed represents the set of all M matrix elements that are no longer the
final point of a path, but rather are in the middle of longer paths. Mcurrent.neigh(i)
is the i-th neighbor of Mcurrent. The minimal total length of the path that ends
in Mtemp is calculated by score(path(Mtemp)). Consideration must also be given
to the heuristic for the remaining distance calculation. The bestpath(Mopen)
State of the Art 67
Mclosed set
Mopen set
Neighbor of Mcurrent
function gives back the M element that corresponds to the ending point of the
minimal length path between all possible paths defined by score(path(...)) in
the Mopen set. An example of this algorithm is presented in Figure 19.
The second step is the definition of the gap penalty. This is necessary in
the case in which one element of a sequence must be associated with a gap in
the other sequence; however, such a step will incur a penalty (p) [20]. In the
algorithm, the occurrence of a large gap is considered to be more likely the
result of one large deletion than from the result of multiple, single deletions.
While an alignment match in an element is given +1 points and a mismatch is
given –1 points, each gap between elements is assigned 0 points.
For example, let Ai be the elements of the first sequence, and Bi the
elements of the second sequence.
Sequence 1:
A1, A2, gap, A3, A4
Sequence 2:
B1, gap, B2, B3, B4
Since two gaps were introduced into the alignment in these sequences
and each incurs a gap penalty p, the total gap penalty is 2p (counted as 0 for 2
elements in the alignment).
The calculation of the M matrix (containing alignments of phrases) is
performed starting from the M(0,0) element, which is, by definition, equal to 0.
The first row of the matrix is defined at the start as:
Mi, 0 = –i * p (59)
M0, j = –j * p (60)
After the first row and columns are initialized, the algorithm iterates
through the other elements of the M matrix, starting from the upper-left side
to the bottom-right side, making the calculation:
Mi, j = max(Mi–1, j–1 + S(Ai,Bj), Mi–1, j – p, Mi, j–1 – p (61)
Mup–left Mup
Mleft max(Mup–left + S(Acurrent , Bcurrent), Mup – p, Mleft – p
If one wanted to calculate the (i,j) element of the M matrix, having already
calculated Mup−left = 1.3, Mup = 3.1, Mleft = 1.7, and p = 1.5, by the definition
S(Ai, Bj ) = 0.1:
State of the Art 69
1.3 3.1
1.7 max(1.3 + 0.1,3.1 – 1.5,1.7 – 1.5) = 1.6
The final value of the Mn,m (bottom-right of the matrix M) element will be
the best score of the alignment algorithm. Finally, in order to find the actual
alignment, it is necessary to backtrack on the path in the inverse direction
from Mn,m to M0,0 [20].
This algorithm will always find the best solution, but ambiguity is in
the similarity matrix definition and in the gap definition, which may be well
defined or not. The time of computation is proportional to n * m * tS, where tS
is the time needed for the pair similarity calculation (S(Ai, Bj)) [20].
12
https://github.com/rsennrich/bleualign
70 Machine Learning in Translation Corpora Processing
start
i=0
j++ Mi,o=–i*P
j<m
yes
j=0
j++ Mo,j=–j*P
j<m
yes
i=0, j=0
i<n
yes
no
j<m
yes
no
stop
13
http://mokk.bme.hu/resources/hunalign/
State of the Art 71
14
http://www.abbyy.com/aligner/
15
http://www-igm.univ-mlv.fr/~unitex/
4
Author’s Solutions to PL-EN
Corpora Processing Problems
In this chapter, the author’s solutions to some of the problems posed by this
research and their implementation are described. First, improvements to
the alignment method used in Yalign [19] are presented. In addition, a new
method for mining parallel data from comparable corpora using a pipeline of
tools is described. The adaptation of the BLEU evaluation metric for the needs
of PL-EN translation is presented. Lastly, the author’s method for aligning
and filtering corpora at the sentence level is discussed.
For the experiments in statistical speech translation, the TED talks domain
that was prepared for the IWSLT 2014 evaluation campaign by the FBK16 was
chosen. This domain is very wide and covers many unrelated subject areas.
The data contains almost 2.5 M untokenized words [15]. Narrower domains
were selected from the data for use in experiments. The first parallel corpus
is composed of PDF documents from the European Medicines Agency
(EMEA) and medicine leaflets [16]. The Polish data was a corpora created
by the EMEA. Its size was about 80 MB, and it included 1,044,764 sentences
built from 11.67 M untokenized words. The vocabularies consisted of
148,170 unique Polish words forms and 109,326 unique English word forms.
The disproportionate vocabulary sizes are also a challenge, especially in
translation from English to Polish.
The second corpus was extracted from the proceedings of the European
Parliament (EUP) by Philipp Koehn (University of Edinburgh) [14]. In
addition, experiments on the Basic Travel Expression Corpus (BTEC), a
multilingual speech corpus containing tourism-related sentences similar
to those usually found in phrase books for tourists going abroad, were also
conducted [17]. Lastly, a large corpus obtained from the OpenSubtitles.org
web site was used as an example of human dialogs. Table 4 provides details on
the number of unique tokens (TOKENS), as well as the number of bilingual
sentence pairs (PAIRS).
Table 4. Corpora specification.
The solution can be divided into three main steps. First, data is collected,
then it is aligned at the article level, and finally the aligned results are mined
for parallel sentences. The last two steps are not trivial, because there are great
disparities between Wikipedia documents. Based on Wikipedia statistics, it is
known that articles on the PL Wiki contain an average of 379 words, whereas
articles on the EN Wiki contain an average of 590 words. This is most likely
why sentences in the raw Wiki corpus are mostly misaligned, with translation
16
https://wit3.fbk.eu/
74 Machine Learning in Translation Corpora Processing
lines whose placement does not correspond to any text lines in the source
language. Moreover, some sentences have no corresponding translations in
the corpus at all. The corpus might also contain poor or indirect translations,
making alignment difficult. Thus, alignment is crucial for accuracy. Sentence
alignment must also be computationally feasible to be of practical use in
various applications.
The procedure that was applied in this monograph starts with a specialized
web crawler. Because the PL Wiki contains less data, with almost all its
articles having corresponding EN Wiki articles, the program crawls the data
starting from the non-English site. The crawler can extract and save bilingual
articles of any language supported by Wikipedia. The tool requires archives of
at least two Wikipedia data sets on different languages and information about
language links between the articles in the data sets.
A web crawler was implemented and used for Euronews.com. This web
crawler was designed to use the Euronews.com archive page. In the first
phase, the crawler generates a database of parallel articles in two selected
languages, in order to obtain comparable data. The Wikipedia articles were
analyzed offline using current Wikipedia dumps.17
Before a mining tool processes the data, it must be prepared. First, all
the data is saved in a local database implemented using MongoDB.18 Second,
the tool aligns article pairs and removes articles that do not exist in both
languages from the database. These topic-aligned articles are filtered to
remove any HTML tags, XML tags or noisy data (tables, references, figures,
etc.). Finally, bilingual documents are tagged with a unique ID as a topic-
aligned, comparable corpus.
To extract the parallel sentence pairs, two different strategies were
employed. The first strategy uses methodology inspired by the Yalign Tool,19
and the second is based on a pipeline of specialized tools adapted for Polish
language needs. The MT results presented in this chapter were obtained
using the first strategy. This decision was motivated by the quality of Yalign
documentation, experience with the algorithms, optimizations that were
applied to it within the scope of this monograph, the computational feasibility
of an improved Yalign method, and the unknown outcome of experiments
with the tools pipeline described in Section 4.5.1. The second method is
still under development; nevertheless, the initial results are promising
17
https://dumps.wikimedia.org/
18
https://www.mongodb.org/
19
https://github.com/machinalis/yalign
Author’s Solutions to PL-EN Corpora Processing Problems 75
and worth mentioning, especially since there are many opportunities for
improvement.
Next, the parallelization of the fifth diagonal will start, and so forth.
76 Machine Learning in Translation Corpora Processing
start
n threads
m threads
Anti-diagonal (ad)
j=0
j++ j threads
j<max(m,n)
yes
no
stop
a d e g f
a X
d X
c X
d X
e X X X
78 Machine Learning in Translation Corpora Processing
a d e g f
a X
d X
c X
d X
e X X X
Classifier Improvement in %
BTEC 13.2
TED 12.5
EMEA 17
EUP 5
OPEN 11.4
For the testing purposes, 100 random article pairs were taken from the
Wikipedia comparable corpus and aligned by a human translator. Second, a
tuning script was run using classifiers trained on the previously-described text
domain. A percentage change in quantity of mined data was calculated for
each classifier.
The good quality Wikipedia articles are well referenced. It is more likely
for sentences to be cross-lingual equivalents if they are referenced with the
same publication. Such analysis joined with other comparison techniques can
lead to better accuracy in parallel text recognition.
Y 4,192
YMOD 5,289
DICT 868
DICTC 685
The results mean that the improved method using additional information
sources mined an additional 1,097 parallel alignments. Among them, it is
possible to identify 868 single words, which means that, in fact, 229 new
sentences were obtained. The growth in obtained data was 5.5%. After manual
analysis of the dictionary, 685 words were identified as proper translations.
This means that the accuracy of the dictionary was about 79%. In summary,
the author’s method, though time-consuming, may produce additional results
Author’s Solutions to PL-EN Corpora Processing Problems 83
that would greatly help when dealing with text domains with insufficient
textual resources.
20
http://mokk.bme.hu/resources/hunalign/
Author’s Solutions to PL-EN Corpora Processing Problems 85
The Levenshtein metric was used in this part of the present research for
distance calculation. A trial application of this metric was made, applying it
directly to the characters in the sentence, or considering each word in the
sentence as an individual symbol, and calculating Levenshtein distance
between symbol-coded sentences. The latter approach was employed, due
to the fact that this method was tested earlier on the Chinese and Japanese
languages [195], which use symbols to represent entire words.
After clustering, data from clusters are compared to each other in order to
find similarities between them. For the four sentences:
A:B::C:D
The algorithm looks for such E and F that:
C:D::E:F and E:F::A:B
However, none were found in our corpus; therefore, the experiments were
constrained to small clusters with two pairs of sentences. Matching sentences
from the parallel corpus were identified in every cluster. This allowed the
generation of new, similar sentences, which were present in the training corpus.
For each of the sequential analogies that were identified, a rewriting model is
constructed. This was achieved by string manipulation. Common prefixes and
suffixes for each of the sentences were calculated using the Longest Common
Subsequence (LCS) method [196].
An example of the rewriting model (prefix and suffix are shown in bold) is:
Poproszę koc i poduszkę.⬄ A blanket and a pillow, please.
Czy mogę poprosić o śmietankę i cukier?⬄Can I have cream
and sugar?
The rewriting model consists of a prefix, a suffix, and their translation.
It is possible to construct a parallel corpus from a non-parallel, monolingual
source. Each sentence in the corpus is tested for a match with the model. If the
sentence contains a prefix and a suffix, it is considered a matching sentence.
Poproszę bilet.⬄A unknown, please.
In the matched sentence, some of the words remained untranslated, but
the general meaning of the sentence is conveyed. Remaining words may be
translated word-for-word, while the translated sentence remains grammatically
correct.
bilet ⬄ ticket
Substituting unknown words with translated ones allowed the creation of
a parallel corpus entry.
Poproszę bilet.⬄ A ticket, please.
86 Machine Learning in Translation Corpora Processing
21
https://wordnet.princeton.edu/
88 Machine Learning in Translation Corpora Processing
a. s = s + log(Bi)
b. Ci = exp(s/i)
where Bi is the default BLEU score, and Ci is the cumulative score.
In addition, knowing that:
exp(log(a) + log(b)) = a * b (64)
and:
exp(log(a)/b) = a ^ (1/b) (65)
C1 = B1
C2 = (B1 * B2) ^ (1/2) (66)
The Wilcoxon signed-rank test was used to determine whether or not the
differences between metrics were statistically significant. Table 12 shows
the significance results for Polish-to-English translation, and Table 13 for
English-to-Polish translation.
Corr ( X, Y ) =
∑ ( x − x )( y − y ) (71)
∑ ( x − x )2 ∑ ( y − y )2
The correlation output table for the metrics is:
Table 14. Correlation for Polish to English.
Table 14 shows that the NIST metric is more strongly correlated with
EBLEU than with BLEU. The new metric shows a more negative correlation
with TER than does BLEU. The new metric shows a stronger correlation with
METEOR than does BLEU.
Figure 29 shows the data trends, as well as the association of different
variables.
90
80
70
60
NEW
50 BLEU
NIST
40 TER
MET
30 RIBES
20
10
0
0 2 4 6 8 10 12
Table 15 shows a stronger correlation between NIST and RIBES and the
new metric than between NIST or RIBES and BLEU. The enhanced metric
has a more negative correlation with TER than does BLEU. Lastly, the new
metric has a stronger correlation with METEOR than does BLEU.
92 Machine Learning in Translation Corpora Processing
λ (C|R) = ∑ i i
r −r
(73)
n−r
with:
n − ∑ i ri
var = (∑ ri + r − 2∑ (ri | li =l )) (74)
( n − r )3 i i
where:
max
Cri = j ( nij )
(75)
max
r= j ( n⋅ j )
The lambda results confirm that correlation is very strong for each metric.
In the case of METEOR, there is perfect correlation.
22
http://www-01.ibm.com/software/analytics/spss/
Author’s Solutions to PL-EN Corpora Processing Problems 93
Lastly, the Spearman correlation [38] was evaluated. Its rank is often
denoted by ρ (rho) or as rs. It is a nonparametric measure of statistical
dependence between two variables. It enables assessment of how well the
relationship between two variables can be described using a monotonic
function. If no repeated data values are found, then a perfect Spearman
correlation of +1 or −1 occurs.
Pearson correlation is unduly influenced by outliers, unequal variances,
non-normality and nonlinearity. This latter correlation is calculated by
applying the Pearson23 correlation formula to the ranks of the data rather than
to the actual data values themselves. In so doing, many of the distortions that
plague the Pearson correlation are reduced considerably.
Pearson correlation measures the strength of a linear relationship between
X and Y. In the case of nonlinear but monotonic relationships, a useful
measure is Spearman’s rank correlation coefficient, Rho, which is a Pearson’s
type correlation coefficient computed on the ranks of the X and Y values. It is
computed by the following formula:
[1 − 6∑ (di ) 2 ]
Rho = (76)
[n(n 2 − 1)]
where:
di is the difference between the ranks of Xi and Yi.
rs = +1 if there is a perfect agreement between the two sets of ranks.
rs = –1 if there is a complete disagreement between the two sets of ranks.
Spearman’s coefficient, like any correlation calculation, is appropriate for
both continuous and discrete variables, including ordinal variables. Table 17
shows the two-tailed Spearman’s correlation for the EBLEU metric in the
Correlation Coefficient row. The Sigma row represents the error rate (should
be less than 0.05), and N is number of samples used in the experiment.
Table 18 provides the results of Spearman’s correlation for the BLEU metric.
The sigma value for the Spearman correlation indicates the direction
of association between X (the independent variable) and Y (the dependent
variable). If Y tends to increase when X increases, the Spearman correlation
coefficient is positive. If Y tends to decrease when X increases, the Spearman
correlation coefficient is negative. A Spearman correlation of zero indicates
that there is no tendency for Y to either increase or decrease when X increases.
The Spearman correlation increases in magnitude as X and Y come closer to
being perfect monotonic functions of each other. When X and Y are perfectly
monotonically related, the Spearman correlation coefficient is 1.
23
http://onlinestatbook.com/2/describing_bivariate_data/pearson.html
94 Machine Learning in Translation Corpora Processing
For example, –0.951 for TER and EBLEU shows a strong negative
correlation between these values. Other results also confirm strong correlations
between the measured metrics. Correlation between EBLEU and BLEU is
0.947, EBLEU and NIST is 0.940, EBLEU and TER is –0.951, and EBLEU
and METEOR is 0.891. This shows strong associations between these
variables. The results for the RIBES metric show moderate rather than very
strong correlation.
On the other hand, for the BLEU metric, the following correlation
coefficient results were obtained: BLEU and NIST is 0.912, BLEU and TER
is 0.939, and BLEU and METEOR is 0.897. This shows a strong association
among the metrics as well, with EBLEU showing the strongest association.
Low correlation for RIBES occurs for each kind of translation.
In this research, it was proved by measuring correlations that the author’s
enhanced BLEU metric is more trustworthy than normal BLEU. There are
no deviations from the measurements from other metrics. Moreover, the new
method of evaluation is more similar to human evaluation. In conclusion from
the experiments, the new evaluation metric provides better precision, especially
for Polish and other Slavic languages. As anticipated, the correlation between
the new metric and RIBES is not very strong. The focus of the RIBES metric
is word order, which is almost free in the Polish language. To be more precise,
it uses rank correlation coefficients based on word order to compare SMT and
reference translations. As word order is not strict in Polish, having a rather
weak correlation with RIBES is a good indication.
The enhanced BLEU metric can deal with disparity of vocabularies
between language pairs and the free-word order that occurs in some non-
positional languages. The metric tool provides an opportunity for future
Author’s Solutions to PL-EN Corpora Processing Problems 95
Line PL EN
1 W przeciwnym razie alternatywa zdań In all other cases it is true.
jest fałszywa.
2 ISBN 8389795299. ISBN 1-55164-250-6.
3 ISBN 8311068119. ISBN 978-1-55164-311-3.
4 ISBN 8322924984. ISBN 9780691155296.
5 ISBN 9788361182085. ISBN 0-14-022697-4.
6 ASN.1 (skrót od „Abstract Syntax Abstract Syntax Notation One (ASN.1)
Notation One”—abstrakcyjna notacja is a standard and notation that describes
składniowa numer jeden) jest to standard rules and structures for representing,
służący do opisu struktur przeznaczonych encoding, transmitting, and decoding
do reprezentacji, kodowania, transmisji i data in telecommunications and computer
dekodowania danych. networking.
7 Standard ASN.1 określa jedynie składnię ASN.1 defines the abstract syntax of
abstrakcyjną informacji, nie określa information but does not restrict the way the
natomiast sposobu jej kodowania w pliku. information is encoded.
8 Metody kodowania informacji podanych X.683 | ISO/IEC 8824-4 (Parameterization of
w składni ASN.1 zostały opisane w ASN.1 specifications)Standards describing
kolejnych standardach ITU-T/ISO. the ASN.1 encoding rules:* ITU-T Rec.
9 pierwszego). One criterion.
10 problemy nierozstrzygalne. .''.
11 Jeżeli dany algorytm da się wykonać na I prefer to call it merely a logical-diagram
maszynie o dostępnej mocy obliczeniowej machine ... but I suppose that it could do very
i pamięci oraz akceptowalnym czasie, to completely all that can be rationally expected
mówi się, że jest obliczalny. of any logical machine”.
12 temperaturę) w optymalnym zakresie. ‘‘Algorithmic theories”...
13 Na początku lat 30. U.S. Dept.
For this part of the research, the Wikipedia corpus was extracted from
a comparable corpus generated from Wikipedia articles. It was about 104
MB in size and contained 475,470 parallel sentences. Its first version was
acknowledged as permissible data for the IWSLT 2014 evaluation campaign.
The TED Talks were chosen as a representative sample of noisy parallel
corpora. The Polish data in the TED talks (about 15 MB) include almost
2 million words that are not tokenized. The transcripts themselves are provided
as pure text encoded in UTF-8 format [27]. In addition, they are separated
into sentences (one per line) and aligned in language pairs. However, some
discrepancies in the text parallelism are present. These discrepancies are
mainly repetitions of Polish text not included in the parallel English text.
Another problem was that the TED 2013 data contained many errors. This
data set had spelling errors that artificially increased the dictionary size and
made the statistics unreliable. A very large Polish dictionary [40], consisting
Author’s Solutions to PL-EN Corpora Processing Problems 97
jak siedząc przy biurku pomyślałem, dobrze, wiem to. To jest wielkie naukowe
odkrycie.”
Another serious problem discovered (especially for statistical machine
translation) was that English sentences were translated in an improper manner.
There were four main problems:
1. Repetitions—part of the text is repeated several times after translation,
for example:
a. EN: Sentence A. Sentence B.
b. PL: Translated Sentence A. Translated Sentence B. Translated
Sentence B. Translated Sentence B.
2. Wrong usage of words—when one or more words used for the Polish
translation slightly change the meaning of the original English sentence,
for example:
a. EN: We had these data a few years ago.
b. PL (the proper meaning of the Polish sentence): We’ve been delivered
these data a few years ago.
3. Indirect translations or usage of metaphors—when the Polish translation
uses a different vocabulary to preserve the meaning of the original
sentence, especially when the exact translation would result in a sentence
that makes no sense. Many metaphors are translated this way.
4. Translations that are not precise enough—when the translated fragment
does not contain all the details of the original sentence, but only its overall
meaning is the same.
Original Translated
Online translation engines
Polish English
CO
MP
ARI
SO
N
Comparison
heuristics
Original
English
24
https://docs.python.org/2/library/difflib.html
100 Machine Learning in Translation Corpora Processing
with the next lines of src.trans enables making the best possible selection in
the alignment process.
There are additional complexities that must be addressed. Comparing the
src.trans lines with the src.en lines is not easy, and it becomes harder when
usage of the similarity rate to choose the correct, real-world translation is
required. There are many strategies to compare two sentences. It is possible
to split each sentence into its words and find the number of common words
in both sentences. However, this approach has some problems. For example,
let us compare “It is origami” to these sentences: “The common theme what
makes it origami is folding is how we create the form” and “This is origami.”
With this strategy, the first sentence is more similar, because it contains
all 3 words. However, it is clear that the second sentence is the correct choice.
This problem can be solved by dividing the number of words in both sentences
by the number of total words in the sentences. However, counting stop words
in the intersection of sentences sometimes causes incorrect results. So, stop
words are removed before comparing two sentences.
Another problem is that sometimes stemmed words can be found in
sentences, for example, “boy” and “boys.” Despite the fact that these two
words should be counted as similar in two sentences, the words are not
counted with this strategy.
The next comparison problem is the word order in sentences. In Python,
there are other ways for comparing strings that are better than counting
intersection lengths. The Python “difflib” library for string comparison
contains a function that first finds matching blocks of two strings. For example,
difflib can be used to find matching blocks in the strings “abxcd” and “abcd.”
Difflib’s “ratio” function divides the length of matching blocks by the
length of two strings and returns a measure of the sequences’ similarity as
a float value in the range <0, 1>. This measure is 2.0 * M/T, where T is the
total number of elements in both sequences, and M is the number of matches.
Note that this measure is 1.0 if the sequences are identical, and 0.0 if they
have nothing in common. Using this function to compare strings instead of
counting similar words helps to solve the problem of the similarity of “boy”
and “boys.” It also solves the problem of considering the position of words in
sentences.
Another problem in comparing lines is synonyms. For example, consider
these sentences: “I will call you tomorrow”; “I would call you tomorrow.” If
it is essential to know if these sentences are the same, the algorithm should
“know” that “will” and “would” can be used interchangeably.
The NLTK Python module and WordNet were used to find synonyms
for each word, and these synonyms were used in comparing sentences. Using
synonyms of each word, it was necessary to create multiple sentences from each
original sentence, which is not computationally feasible. For example, suppose
Author’s Solutions to PL-EN Corpora Processing Problems 101
that the word “game” has the synonyms: “play”, “sport”, “fun”, “gaming”,
“action” and “skittle.” If used, for example, for the sentence “I do not like
game”, the following sentences are created: “I do not like play”; “I do not like
sport”; “I do not like fun”; “I do not like gaming”; “I do not like action”; and “I
do not like skittle” It is necessary to use each word in a sentence.
Next, the algorithm should try to find the best score by comparing all
these sentences, instead of just comparing the main sentence. One issue is that
this type of comparison takes a long time, because the algorithm needs to do
many comparisons for each selection.
Difflib has other functions (in SequenceMatcher and Diff class) to
compare strings that are faster than the described solution, but their accuracy
is worse. To overcome all these problems and obtain the best results, two
criteria should be considered: The speed of the comparison function and the
comparison acceptance rate.
To obtain the best results, the script provides users with the ability to
specify multiple functions with multiple acceptance rates. Fast functions with
lower-quality results are tested first. If they can find results with a very high
acceptance rate, the tool should accept their selection. If the acceptance rate
is not sufficient, it can use slower but more accurate functions. The user can
configure these rates manually and test the resulting quality to get the best
results. All are well-described in documentation [44].
Some additional scoring algorithms were also implemented. These are
more suited for comparing language translation quality. The BLEU and native
implementations of TER and Character Edit Rate (CER) that use pycdec and
RIBES were implemented, as well [120].
These algorithms were used to generate likelihood scores for two
sentences, to choose the best one in the alignment process. For this purpose,
the cdec and pycdec Python modules were used. The cdec module is a decoder,
aligner and learning framework for statistical machine translation and similarly-
structured prediction models. It provides translation and alignment modeling
based on finite-state transducers and synchronous context-free grammars,
as well as implementations of several parameter-learning algorithms [121,
122]. The pycdec module supports the cdec decoder. It enables Python coders
to use cdec’s fast C++ implementation of core finite-state and context-free
inference algorithms for decoding and alignment. Its high-level interface
allows developers to build integrated MT applications that take advantage
of the rich Python ecosystem without sacrificing computational performance.
The modular architecture of pycdec separates search space construction,
rescoring and inference.
The cdec module includes implementations of the basic evaluation
metrics (BLEU, TER and CER), exposed in Python via the cdec.score module.
For a given (reference-hypomonograph) pair, sufficient statistics vectors
102 Machine Learning in Translation Corpora Processing
(SufficientStats) can be computed. These vectors are then summed for all
sentences in the corpus, and the final result is converted into a real-valued score.
Before aligning a large data file, it is important to determine the proper
comparators and acceptance rates for each comparison heuristic. Files of
1,000–10,000 lines result in the best performance. It is recommended to
first evaluate each comparison method separately, and then combine the best
ones in a specific scenario. For this purpose, using a binary search method to
determine the best threshold factor value is recommended.
Finally, the SMT systems were trained on the original and cleaned data
to show the data’s influence on the results. The results were compared using
the BLEU, NIST, METEOR and TER metrics. The results are shown in
Table 22. In these experiments, tuning was disabled because of the known
MERT instability [45]. Test and development sets were taken from the IWSLT
evaluation campaigns (from different years) and cleaned before usage with
the help of human translators.
The significance of the results of Table 24 was calculated. The baseline
was compared to human work and to the filtering tool. In addition, human
work was compared with the proposed filtering tool. The calculation was done
using the two-tailed Wilcoxon test. The results presented in Table 23 show
that the differences among the results were highly significant, that the filtering
tool performed better than the baseline method, and that humans performed
better than both.
Aligner Score
Proposed Method 98.94
Bleualign 96.89
Hunalign 97.85
ABBYY Aligner 84.00
Wordfast Aligner 81.25
Unitex Aligner 80.65
25
https://www.wordfast.net/wiki/Wordfast_Aligner
Author’s Solutions to PL-EN Corpora Processing Problems 105
The special metric was needed to evaluate sentences properly aligned but
built from synonyms or with a different phrase order. The following weights
were chosen by empirical research. For an aligned sentence, 1 point is given.
For a misaligned sentence, a –0.2 point penalty is given. For web service
translations, 0.4 points are given. For translations due to disproportion between
input files, 1 point is given (when one of two files included more sentences).
The score is normalized to fit between 1 and 100. A higher value is better. A
floor function can be used to round the score to an integer value. Point weights
were determined by empirical research and can be easily adjusted if needed.
The score S is defined as:
20(5 A − M + 2T + 5 | D |)
S = floor ( ) (77)
L
where A is the number of aligned sentences, M is the number of misaligned
sentences, T is the number of translated sentences, D is the number of lines
not found in both language files (one file can contain some sentences that do
not exist in the other one) and L is the total number of output lines.
Clearly, the first three aligners scored well. The proposed method is fully
automatic. It is important to note that Bleualign does not translate text and
requires that it be done manually.
As discussed earlier, it is important not to lose lines of text in the alignment
process. Table 26 shows the total lines resulting from the application of each
alignment method. Five TED transcriptions were randomly selected for this
part of the experiment.
Aligner Lines
Human Translation 1,005
Proposed Method 1,005
Bleualign 974
Hunalign 982
ABBYY Aligner 866
Wordfast Aligner 843
Unitex Aligner 838
Almost all the aligners compared, other than the proposed method, lose
lines of text as compared to a reference human translation. The proposed
method lost no lines.
106 Machine Learning in Translation Corpora Processing
For the purpose of showing the output quality with an independent metric,
it was decided to compare results with BLEU, NIST, METEOR and TER (the
lower the better), in a comparison with human-aligned texts. Those results are
presented in Table 27.
In general, sentence alignment algorithms are very important in creating
parallel corpora. Most aligners are not fully automatic, but the one proposed
here is, which gives it a distinct advantage. It also enables creation of a corpus
when sentences exist in only a single language. The proposed approach is also
language independent for ones with similar structure to PL or EN.
The results show that the proposed method performed very well in terms
of the metric. It also lost no lines of text, unlike the other aligners. This is
critical to the end goal of obtaining a translated text. The proposed alignment
method also scored better when compared to typical machine translation
metrics, and will most likely improve MT system output quality.
europejskiego
posiedzenia
wznowienie
parlamentu
ogłaszam
ja
∙
i
declare
resumed
the
session
of
the
european
parliament
∙
parlamentu
ogłaszam
ja
declare
resumed
the
session
of
the
european
parliament
∙
26
http://www.nlp.pwr.wroc.pl/en/tools-and-resources/narzedzia-przetwarzania-morfo-
syntaktycznego
112 Machine Learning in Translation Corpora Processing
<tok>
<orth>ludzi</orth>
<lex disamb=”1”> <base>człowiek</base>
<ctag>subst:pl:gen:m1</ctag></lex>
<lex disamb=”1”> <base>ludzie</base>
<ctag>subst:pl:gen:m1</ctag></lex>
</tok>
In this example, only one form (the first stem) is used for further
processing.
An XML extractor tool (in Python) was developed to generate three
different corpora for the Polish language data:
• Word stems
• Subject-Verb-Object (SVO) word order
• Both the word stem and the SVO word order form
This enables experiments with these preprocessing techniques.
4.9.2.1.2 Lemmatization
Another approach to deal with disproportionate vocabulary sizes is using
lemmas instead of surface forms to reduce the Polish vocabulary size, using
PSI-TOOLKIT [59] to convert each Polish word into a lemma. This toolkit
provides a tool chain for automatic processing of the Polish language and,
to a lesser extent, other languages like English, German, French, Spanish
and Russian (with a focus on machine translation). The tool chain includes
segmentation, tokenization, lemmatization, shallow parsing, deep parsing,
rule-based machine translation, statistical machine translation, automatic
generation of inflected forms from lemma sequences and automatic post editing.
The toolkit was used as an additional information source for SMT system
preparation. It can also be used as a first step in implementing a factored SMT
system that, unlike a phrase-based system, includes morphological analysis,
translation of lemmas and features such as the generation of surface forms.
Incorporating additional linguistic information should improve translation
performance [60].
As previously mentioned, lemma extracted from Polish words are used
instead of surface forms to overcome the problem of the huge difference in
vocabulary sizes. For Polish lemma extraction, a tool chain that included
tokenization and lemmatization from PSI-TOOLS was used [59].
These tools used in sequence provide a rich output that includes a lemma
form of the tokens, prefixes, suffixes and morphosyntactic tags. Unfortunately,
Author’s Solutions to PL-EN Corpora Processing Problems 113
unknown words like names, abbreviations, or numbers, etc., are lost in the
process. In addition, capitalization, as well as punctuation, is removed. To
preserve this relevant information, a specialized tool was implemented based
on differences between the input and output of PSI-TOOLS [59]. This tool
restores most of the lost information.
corpus. The selected sentences were removed from the corpora. The testing
system was trained with the baseline settings. In addition, a system was trained
with extended data from the Wikipedia corpora. Lastly, Modified Moore-
Lewis filtering was used for domain adaptation of the Wikipedia corpora.
The monolingual part of the corpora was used as a language model and was
adapted for each corpus by using linear interpolation [82].
An evaluation was conducted using test sets built from 2,000 randomly-
selected bi-sentences taken from each domain. For scoring purposes, four
well-known metrics that show high correlation with human judgments were
used: BLEU, the NIST metric, METEOR and TER.
Starting from the baseline system tests in the PL-to-EN and EN-to-PL
directions, the effects of the following changes were investigated: Extending
the language model, interpolating it, supplementing the corpora with additional
data and filtering additional data with Modified Moore-Lewis filtering [82].
It should be noted the language models were extended after MML filtration.
system was trained using baseline settings. The additional corpora were used
in the experiments by adding parallel data to the training set with Modified
Moore-Lewis filtering and by adding a monolingual language model with
linear interpolation.
To verify the results, an SMT system was trained using only data extracted
from comparable corpora (not using the original in domain data). The mined
data was also used as a language model. The evaluation was conducted using
the same test sets as before.
27
www.nlp.pwr.wroc.pl
Results and Conclusions 119
The TED data is a mix of various, unrelated topics. This is most likely the
reason why we cannot expect big improvements with this data and generally
low scores in translation quality metrics.
There is almost no additional data in Polish in comparison to a huge
amount of data in, for example, French or German. Some of the data were
obtained from the OPUS [223] project page, some from another small projects
and the rest was collected manually using web crawlers.
The new data were:
• A Polish—English dictionary (bilingual parallel)
• Additional (newer) TED Talks data sets not included in the original train
data (we crawled bilingual data and created a corpora from it) (bilingual
parallel)
• E-books (monolingual PL + monolingual EN)
• Proceedings of UK House of Lords (monolingual EN)
• Subtitles for movies and TV series (monolingual PL)
• Parliament and senate proceedings (monolingual PL)
• Wikipedia Comparable Corpus (bilingual parallel)
• Euronews Comparable Corpus (bilingual parallel)
• Repository of PJIIT’s diplomas (monolingual PL)
• Many PL monolingual data web crawled from main web portals like blogs,
chip.pl, Focus newsmonograph archive, interia.pl, wp.pl, onet.pl, money.
pl, Usenet, Termedia, Wordpress web pages, Wprost newsmonograph
archive, Wyborcza newsmonograph archive, Newsweek newsmonograph
archive, etc.
“Other” in the table below stands for many very small models merged
together. EMEA are texts from the European Medicines Agency, KDE4
is a localization file of that GUI, ECB stands for European Central Bank
corpus, OpenSubtitles [12] are movies and TV series subtitles, EUNEWS
is a web crawl of the euronews.com web page and EUBOOKSHOP
comes from bookshop.europa.eu. Lastly, bilingual TEDDL is additional TED
data.
Data perplexity was examined by experiments with the TED lectures.
Perplexities for the dev2010 data sets are shown in Table 28. In the table,
“PPL” indicates perplexity values without smoothing of the data. “PPL + KN”
indicates perplexity values with Kneser-Ney smoothing of the data.
120 Machine Learning in Translation Corpora Processing
5.1.1.2 Lemmatization
During the lemmatization experiments, the lemmatized version of the Polish
training data was reduced to 36,065 unique words, and the Polish language
model was reduced from 156,970 to 32,873 unique words. The results of the
experiments are presented in Tables 33 and 34. Each experiment was performed
only on the baseline data sets in the PL->EN and EN->PL directions. The year
(official test sets from the IWSLT campaigns28) column shows the test set that
was used in the experiment. If a year has the suffix “L,” it means that it is a
lemmatized version of the baseline system.
The experiments show that lemma translation to EN in each test set
decreased the evaluation scores, while translation from EN to lemma for each
set increased the translation quality. Such a solution also requires training of a
system from lemma to PL to restore proper surface forms of the words. Such
a system was trained, as well, evaluated on official tests sets from 2010–2014,
and tuned on 2010 development data. The results for that system are presented
28
www.iwslt.org
124 Machine Learning in Translation Corpora Processing
in Table 35. Even though the scores are relatively high, the results do not seem
satisfactory enough to provide an overall improvement of the EN-LEMMA-
PL pipeline over direct translation from EN to PL.
using lower casing, changing maximum sentence length to 85 and setting the
maximum phrase length to 7 improves the BLEU score. The best results were
achieved using these settings and models: A language model order of 6, the
Witten-Bell discounting method, a lexicalized reordering method of tgttosrc
during training, system enrichment using the OSM, compound splitting,
punctuation normalization, tuning using MERT with the batch-mira feature,
an n-best list size of 150 and training using a hierarchical phrase-based
translation model. These settings and language models produced a BLEU
score of 19.81. All available parallel data was used. The data was adapted
using Modified Moore-Lewis filtering.
From the empirical experiments, it was concluded that the best results are
obtained when sampling about 150,000 bi-sentences from in-domain corpora
and by using filtering after word alignment. The ratio of data to be retained
was set to 0.9, producing the best score of 22.76. The results are presented in
Table 37.
In the future, the author intends to try clustering the training data into word
classes in order to obtain smoother distributions and better generalizations.
Using class-based models has been shown to be useful when translating into
morphologically-rich languages like Polish [79]. The author also wants to use
unsupervised transliteration models [80], which have proven quite useful in
MT for translating OOV words, for disambiguation and for translating closely-
related languages [80]. This feature would most likely help us overcome the
difference in vocabulary size, especially when translating into PL. Using a
fill-up combination technique (instead of interpolation) may be useful when
the relevance of the models is known a priori—typically, when one is trained
on in-domain data and the others on out-of-domain data [57].
29
http://www.statmt.org/moses/?n=Moses.SupportTools
30
http://www.statmt.org/moses/?n=FactoredTraining.EMS
128 Machine Learning in Translation Corpora Processing
PL-> EN EN->PL
BLEU NIST METEOR TER BLEU NIST METEOR TER
O 53.21 7.57 66.40 46.01 51.87 7.04 62.15 47.66
OC 53.13 7.58 66.80 45.70 – – – –
OT 52.63 7.58 67.02 45.01 50.57 6.91 61.24 48.43
OF 53.51 7.61 66.58 45.70 52.01 6.97 62.06 48.22
PL-> EN EN->PL
BLEU NIST METEOR TER BLEU NIST METEOR TER
E 73.18 11.79 87.65 22.03 67.71 11.07 80.37 25.69
EL 80.60 12.44 91.07 12.44 – – – –
ELC 80.68 12.46 90.91 16.78 67.69 11.06 80.43 25.68
ELT 78.09 12.41 90.75 17.09 64.50 10.99 79.85 26.28
ELF 80.42 12.44 90.52 17.24 69.02 11.15 81.83 24.79
ELI 70.45 11.49 86.21 23.54 70.73 11.44 83.44 22.50
ELS 61.51 10.65 81.75 31.71 49.69 9.38 69.05 40.51
ELH 82.48 12.63 91.17 15.73 – – – –
the scores by a significant factor. Supposedly, the text was already properly
cased and punctuated. In Experiment 02, it was observed that, quite strangely,
OSM decreased some metrics results, whereas it usually increases translation
quality. Similar results can be seen in the EN->PL experiments. Here, the
BLEU score increased, but other metrics decreased.
Optimizer Exp. No. BLEU Phrase Table (GB) Reordering Table (GB)
None 2 35.39 6.4 2.3
Absolute 3 1 26.95 1.1 0.4
Absolute 2 7 30.53 2.5 0.9
Absolute 1 8 32.07 4.9 1.7
Relative 2.5 9 30.82 3.1 1.1
Relative 5 10 26.35 1.1 0.4
Relative 7.5 11 17.68 0.3 0.1
Pruning 30 3 32.36 1.9 0.7
Pruning 60 4 32.12 2.0 0.7
Pruning 90 5 32.11 2.0 0.75
Pruning 20 6 32.44 2.1 0.75
Absolute 1 + Pruning 20 12 30.29 0.85 0.3
have been doing laundry”, but “Zrobiłem pranie” as “I have done laundry”, or
“płakać-wypłakać” as “cry-cry out” [8].
The gender of a noun in English does not have any effect on the form
of a verb, but it does in Polish. For example, “Zrobił to. – He has done it”,
“Zrobiła to. – She has done it”, “lekarz/lekarka-doctor”, “uczeń/uczennica =
student”, etc. [8].
As a result of this complexity, progress in the development of SMT
systems for West-Slavic languages has been substantially slower than for
other languages. On the other hand, excellent translation systems have been
developed for many popular languages.
Sequence Model (OSM). We also used Compound Splitting feature and did
punctuation normalization. Tuning was done using MERT tool with batch-
mira feature and n-best list size was changed from 100 to 150. Training a
hierarchical phrase-based translation model also improved results in this
translation scenario [18].
In PL-CS experiments it was necessary to adjust parameters so that they
suit translation between two morphologically rich languages. We changed
language model order to 8 which, in contrast with PL-EN translation, produced
positive results. The maximum sentence length was changed to 100 from 80
(in PL-EN exceeding 85 did not raise the scores). The lexicalized reordering
method was changed to wbe-msd-bidirectional-fe, in order to use word-based
data extraction. In tuning, the n-best list size was raised to 200 because there
were many possible candidates to be chosen in that language pair.
5.1.5.5 Evaluation
Metrics are necessary in order to measure the quality of translations produced
by the SMT systems. For this, various automated metrics are available for
comparing SMT translations to high quality human translations. Since each
human translator produces a translation with different word choices and
orders, the best metrics measure SMT output against multiple reference human
translations. For scoring purposes we used four well-known metrics that show
high correlation with human judgments. Among the commonly used SMT
metrics are: Bilingual Evaluation Understudy (BLEU), the U.S. National
Institute of Standards & Technology (NIST) metric, the Metric for Evaluation
of Translation with Explicit Ordering (METEOR) and Translation Error Rate
(TER). According to Koehn, BLEU [16] uses textual phrases of varying length
to match SMT and reference translations. Scoring of this metric is determined
by the weighted averages of those matches [19]. To encourage infrequently
used word translation, the NIST [19] metric scores the translation of such
words higher and uses the arithmetic mean of the n-gram matches. Smaller
differences in phrase length incur a smaller brevity penalty. This metric has
shown advantages over the BLEU metric. The METEOR [19] metric also
changes the brevity penalty used by BLEU, uses the arithmetic mean like
NIST, and considers matches in word order through examination of higher
order n-grams. These changes increase score based on recall. It also considers
best matches against multiple reference translations when evaluating the SMT
output. TER [20] compares the SMT and reference translations to determine the
minimum number of edits a human would need to make for the translations to
be equivalent in both fluency and semantics. The closest match to a reference
translation is used in this metric. There are several types of edits considered:
Word deletion, word insertion, word order, word substitution and phrase order.
138 Machine Learning in Translation Corpora Processing
5.1.5.6 Results
The experiment results were gathered in Tables 45, 46 and 47. BASE stands
for baseline systems settings and BEST for translation systems with modified
training settings (in accordance to Section 4.2). Table 45 contains translation
results for written texts and Table 46 for spoken language. The decision was
made to use EU Bookshop31 (EUB) document-based corpus as an example
of written language and the QCRI Educational Domain Corpus32 (QED)
(open multilingual collection of subtitles for educational videos and lectures).
Both corpora are comparable examples of spoken and written language. The
corpora specification is showed in Table 44.
31
http://bookshop.europa.eu
32
http://alt.qcri.org/resources/qedcorpus/
Results and Conclusions 139
5.1.6.4 Experiments
Experiments were conducted to evaluate different SMT systems using various
data. In general, each corpus was tokenized, cleaned, factorized, converted to
lowercase, and split. A final cleaning was performed. For each experiment,
the SMT system was trained, a language model was applied and tuning was
performed. The experiments were then performed. For OCR purposes we
used well known from good quality Tesseract engine [140], by this we also
evaluated an impact of OCR mistakes on translation. It also must be noted
that the OCR system was not adapted to any specific types of images or texts
which would most probably improve its quality. It was not done because it
was not goal of this research.
The Moses toolkit, its Experiment Management System and the KenLM
language modeling library [14] were used to train the SMT systems and
conduct the experiments. Training included use of a 5-gram language model
based on Kneser-Ney discounting. SyMGIZA++ [237], a multi-threaded and
symmetrized version of the popular GIZA++ tool [8], was used to apply a
symmetrizing method in order to ensure appropriate word alignment. Two-
way alignments were obtained and structured, leaving the alignment points
that appear in both alignments. Additional points of alignment that appear in
their union were then combined. The points of alignment between the words
in which at least one was unaligned were then combined (grow-diag-final).
This approach facilitates an optimal determination of points of alignment
[11]. The language model was binarized by applying the KenLM tool [234].
146 Machine Learning in Translation Corpora Processing
5.1.6.5 Discussion
AR systems would greatly benefit from the application of state-of-the-art
SMT systems. For example, translation and display of text in a smart phone
image would enable a traveler to read medical documents, signs and restaurant
menus in a foreign language. In this monograph, we have reviewed the state
148 Machine Learning in Translation Corpora Processing
However, some applications require much more than this. For example, the
beauty and correctness of writing may not be important in the medical field,
but the adequacy and precision of the translated message is very important. A
communication or translation error between a patient and a physician in regard
to a diagnosis may have serious consequences on the patient’s health. Progress
in SMT research has recently slowed down. As a result, new translation
methods are needed. Neural networks provide a promising approach for
translation [139] due to the their rapid progress in terms of methodology and
computation power. They also bring the opportunity to overcome limits of
statistical-based methods that are not context-aware.
Machine translation has been applied to the medical domain due to the
recent growth in the interest in and success of language technologies. As an
example, a study was done on local and national public health websites in
the USA with an analysis of the feasibility of edited machine translations for
health promotional documents [173]. It was previously assumed that machine
translation was not able to deliver high quality documents that can used
for official purposes. However, language technologies have been steadily
advancing in quality. In the not-too-distant future, it is expected that machine
translation will be capable of translating any text in any domain at the required
quality.
The medical data field is a bit narrow, but very relevant and a promising
research area for language technologies. Medical records can be translated by
use of machine translation systems. Access to translations of a foreign patient’s
medical data might even save their life. Direct speech-to-speech translation
systems are also possible. An automated speech recognition (ASR) system
can be used to recognize a foreign patient’s speech. After it is recognized, the
speech could be translated into another language with synmonograph in real
time. As an example, the EU-BRIDGE project intends to develop automatic
transcription and translation technology. The project desires innovative
multimedia translation services for audiovisual materials between European
and non-European languages.33
Making medical information understandable is relevant to both
physicians and patients [238]. As an example, Healthcare Technologies for
the World Traveler emphasizes that a foreign patient may need a description
and explanation of their diagnosis, along with a related and comprehensive set
of information. In most countries, residents and immigrants communicate in
languages other than the official one [239].
Karliner et al. [67] talks of the necessity of human translators obtaining
access to healthcare information and, in turn, improving its quality. However,
33
http://www.eu-bridge.eu
Results and Conclusions 151
34
http://opus.lingfil.uu.se/EMEA.php
Results and Conclusions 153
35
https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis
154 Machine Learning in Translation Corpora Processing
word and phrase alignment, the MGIZA++ tool [9], a multithreaded version
of the GIZA++ tool, was employed. KenLM [10] was employed in order
to ensure high quality binaries of the language model. Lexical reordering
was placed at the mid-bidirectional-fe model, and the phrase probabilities
were reordered according to their lexical values. This included three unique
reordering orientation types applied to source and target phrases: Monotone
(M), swap (S) and discontinuous (D). The bidirectional model’s reordering
includes the probabilities of positions in relation to the actual and subsequent
phrases. The probability distribution of English phrases is evaluated by “e,”
and foreign phrase distribution by “f”. For appropriate word alignment, a
method of symmetrizing the text was developed. At first, two-way alignments
from GIZA++ were structured, which resulted in leaving only the points
of alignments appearing in both. The next phase involved combination
of additional alignment points appearing in the union. Additional steps
contributed to potential point alignment of neighboring and unaligned words.
Neighboring can be positioned to the left or right, top or bottom, with an
additional diagonal (grow dialog) position. In the final phase, a combination
of alignment points between words is considered, where some are unaligned.
The application of the grow dialog method will determine points of alignment
between two unaligned words [18].
The neural network was implemented using the Groundhog and Theano
tools. Most of the neural machine translation models being proposed belong
to the encoder-decoder family [243] with use of an encoder and a decoder
for every language, or use of a language-specific encoder to each sentence
application whose outputs are compared [243]. A translation is the output a
decoder gives from the encoded vector. The entire encoder-decoder system,
consisting of an encoder and decoder for each language pair, is jointly trained
for maximization of the correct translation.
A potential downside to this approach is that a neural network will need
to have the capability of compressing all necessary information from a source
sentence into a vector of a fixed length. A challenge in dealing with long
sentences may arise. Cho et al. showed that an increase in length of an input
sentence will result in the deterioration of basic encoder performance [243].
That is the reason behind the encoder-decoder model, which learns to
jointly translate and align. After generating a word when translating, the
model searches for position sets in the source sentence that contains all the
required information. A target word is then predicted by the model based on
context vectors [78].
A significant and unique feature of this model approach is that it does not
attempt to encode an input sentence into a vector of fixed length. Instead, the
sentence is mapped to a vector sequence, and the model adaptively chooses
Results and Conclusions 155
a vector subset as it decodes the translation. This gets rid of the burden of
a neural translation model compressing all source sentence information,
regardless of length, into a fixed length vector [78].
Two types of neural models were trained. The first is an RNN Encoder–
Decoder [78], and the other is the model proposed in [78]. Each model was
trained with the sentences of length up to 50 words. The encoder and decoder
had 1000 hidden units. Multilayer network with a single maxout [244] hidden
layer to compute the conditional probability of each target word [245] was
used. In addition, a minibatch stochastic gradient descent (SGD) algorithm
together with Adadelta [246] was used to train each model. Each SGD update
direction was computed using a minibatch of 80 sentences. When models
were trained, a beam search algorithm was used in order to find a translation
that approximately maximizes the conditional probability [247, 248].
5.1.7.4 Results
The experiments were performed in order to evaluate the optimal translation
methods for English to Polish and vice versa. The experiments involved
the running of a number of tests with use of the developed language data.
Random selection was used for data collection, accumulating 1000 sentences
for each case. Sentences composed of 50 words or fewer were used, due to
hardware limits, with 500,000 training iterations and neural networks having
750 hidden layers. The NIST, BLEU, TER and METEOR metrics were used
for evaluation of the results. The TER metric tool is considered as the best
one, showing a low value for high quality, while other metrics use high scores
to indicate high quality. For comprehension and comparison, all metrics were
made to fit into the 0 to 100 range.
Scores lower than 15 BLEU points mean that the machine translation
engine is unable to provide satisfactory quality, as reported by Lavie [144]
and a commercial software manufacturer.36 A high level of post-editing will be
required in order to finalize output translations and reach publishable quality. A
system score greater than 30 means that translations should be understandable
without problems. Scores over 50 reflect good and fluent translations.
The computation time for statistical model was 1 day, whereas for each
neural model it was about 4–5 days. Statistical model was computed using
single Intel Core i7 CPU (8 threads). For neural model calculation the power
of GPU units was used (one GeForce 980 card).
The results presented in Table 51 are for Polish-to-English and in
Table 52 are for English-to-Polish translation results. Statistical translation
36
http://www.kantanmt.com/
156 Machine Learning in Translation Corpora Processing
results are annotated as SMT in the tables. Translation results from the
most popular neural model are annotated as ENDEC, and SEARCH indicates
the neural network-trained systems. The results are visualized on the
Figures 33 and 34.
Polish-to-English
System BLEU NIST METEOR TER
Google 18,11 36,89 66,14 62,76
SMT 36,73 55,81 60,01 60,94
ENDEC 21,43 35,23 47,10 47,17
SEARCH 24,32 42,15 56,23 51,78
English-to-Polish
System BLEU NIST METEOR TER
Google 12,54 24,74 55,73 67,14
SMT 25,74 43,68 58,08 53,42
ENDEC 15,96 31,70 62,10 42,14
SEARCH 17,50 36,03 64,36 48,46
60
40
20
0
SMT EndEc SEarch
60
40
20
Classifier Value PL EN
TED Size in MB 41.0 41.2
No. of sentences 357,931 357,931
No. of words 5,677,504 6,372,017
No. of unique words 812,370 741,463
BTEC Size in MB 3.2 3.2
No. of sentences 41,737 41,737
No. of words 439,550 473,084
No. of unique words 139,454 127,820
EMEA Size in MB 0.15 0.14
No. of sentences 1,507 1,507
No. of words 18,301 21,616
No. of unique words 7,162 5,352
EUP Size in MB 8.0 8.1
No. of sentences 74,295 74,295
No. of words 1,118,167 1,203,307
No. of unique words 257,338 242,899
OPEN Size in MB 5.8 5.7
No. of sentences 25,704 25,704
No. of words 779,420 854,106
No. of unique words 219,965 198,599
To be more specific, the BLEU, METEOR and TER results for the TED
corpus, found in Tables 45 and 46, were evaluated in order to determine if the
differences in methods were relevant. The variance due to the BASE and MML
162 Machine Learning in Translation Corpora Processing
set selection was measured. It was calculated using bootstrap resampling37 for
each test run. The variances were 0.5 for BLEU, 0.3 for METEOR, and 0.6 for
TER. The positive variances mean that there are significant (to some extent)
differences between the test sets. It also indicates that a difference of this
magnitude is likely to be generated again by some random translation process,
which would most likely lead to better translation results in general [83].
To verify this analysis, an SMT system using only data extracted from
comparable corpora (not using the original domain data) was trained. The
37
https://github.com/jhclark/multeval
Results and Conclusions 163
mined data was also used as a language model. The evaluation was conducted
on the same test sets that were used in Tables 55 and 56. This was done
intentionally in order to determine how such a system would cope with the
translation of domain-specific text samples. Doing so provided an assessment
of the influence of additional data on translation quality and of the similarity
between mined data and in-domain data. The results are presented in
Tables 57 and 58. The BASE rows show the results for baseline systems
trained on original, in-domain data; the MONO rows indicate systems trained
only on mined data in one direction; and the BI rows show results for a system
trained on data mined in two directions with duplicate segments removed.
The results for the SMT systems based only on mined data are not very
surprising. First, they confirm the quality and high level of parallelism of the
corpora. This can be concluded from the translation quality, especially for
the TED data set. Only two BLEU scoring anomalies were observed when
comparing systems strictly trained on in-domain (TED) data and mined data for
EN-to-PL translation. It also seems reasonable that the best SMT scores were
obtained on TED data. This data set is the most similar to Wikipedia articles,
overlapping with it on many topics. In addition, the Yalign classifier trained on
the TED data set recognized most of the parallel sentences. The results show
that the METEOR metric, in some cases, increases when the other metrics
decrease. The most likely explanation for this is that other metrics suffer, in
comparison to METEOR, from the lack of a scoring mechanism for synonyms.
Wikipedia is a very wide domain, not only in terms of its topics, but also
its vocabulary. This leads to the conclusion that mined corpora are a good
source for extending sparse text domains. It is also the reason why test sets
originating from wide domains outscore those of narrow domains and, also,
why training on a larger mined data set sometimes slightly decreases the
results on very specific domains. Nonetheless, in many cases, after manual
analysis was conducted, the translations were good but the automatic metrics
were lower due to the usage of synonyms. The results also confirm that bi-
directional mining has a positive influence on the output corpora.
Bi-sentence extraction has become more and more popular in unsupervised
learning for numerous specific tasks. This method overcomes disparities
between English and Polish or any other West-Slavic language. It is a language-
independent method that can easily be adjusted to a new environment, and it
only requires parallel corpora for initial training. The experiments show that
the method performs well. The resulting corpora increased MT quality in wide
text domains. Decreased or very small score differences in narrow domains
are understandable, because a wide text domain such as Wikipedia most likely
adds unnecessary n-grams that do not exist in test sets from a very specific
domain. Nonetheless, it can be assumed that even small differences can make
a positive influence on real-life, rare translation scenarios.
164 Machine Learning in Translation Corpora Processing
In addition, it was proven that mining data using two classifiers trained
from a foreign to a native language and vice versa can significantly improve
data quantity, even though some repetition is possible. Such bi-directional
mining, which is logical, found additional data, mostly for wide domains.
In narrow text domains, the potential gain is small. From a practical point of
view, the method requires neither expensive training nor language-specific
grammatical resources, but it produces satisfying results. It is possible
to replicate such mining for any language pair or text domain, or for any
reasonably comparable input data.
Tables 61 and 62 show the results of the data quality experiments using
the following methods: The native Yalign implementation (Yalign), the multi-
threaded implementation (M Yalign), Yalign with the Needleman-Wunsch
algorithm (NW Yalign), and Yalign with a GPU-accelerated Needleman-
Wunsch algorithm (GNW Yalign). In the tables, BASE represents the baseline
system; MONO, the system enhanced with a mono-directional classifier; BI, a
system with bi-directional mining; NW, a system mined bi-directionally using
the Needleman-Wunsch algorithm; and TNW, a system with additionally
tuned parameters.
The results indicate that multi-threading significantly improved speed,
which is very important for large-scale mining. As anticipated, the Needleman-
Wunsch algorithm decreases speed (which is why authors of the Yalign did
not use it, in the first place). However, GPU acceleration makes it possible to
obtain performance almost as fast as that of the multi-threaded A* version. It
must be noted that the mining time may significantly differ when the alignment
matrix is large (i.e., text is long). The experiments were conducted on a hyper-
Table 61. Results of SMT enhanced comparable corpora for PL-to-EN translation.
threaded Intel Core i7 CPU and a GeForce GTX 660 GPU. The quality of the
data obtained with the NW algorithm version, as well as the TNW version,
seems promising. Slight improvements in translation quality were observed,
but more importantly, much more parallel data was obtained.
Statistical significance tests were conducted in order to evaluate how the
improvements differ from each other. For each text domain, it was tested how
significant the obtained results were from each other.
Changes with low significance were marked with “*”, significant
changes were marked with “**”, and very significant with “***” in Tables 64
and 65. The significance test results in Table 63 showed visible differences
in most row-to-row comparisons. In Table 64, all results were determined to
be significant. Tables 65 and 66 show the results for an SMT system trained
with only data extracted from comparable corpora (not the original, in-domain
data). In the tables, BASE indicates the results for the baseline system trained
on the original in-domain data; MONO, a system trained only on mined data
in one direction; BI, a system trained on data mined in two directions with
168 Machine Learning in Translation Corpora Processing
P-value
TED MONO 0.0622*
BI 0.0496**
NW 0.0454**
TNW 0.3450
EUP MONO 0.3744
BI 0.0302**
NW 0.0032***
TNW 0.0030***
EMEA MONO 0.4193
BI 0.0496**
NW 0.0429**
TNW 0.0186**
OPEN MONO 0.0346**
BI 0.0722*
NW 0.0350**
TNW 0.0259**
P-value
TED MONO 0.0305**
BI 0.0296**
NW 0.0099***
TNW 0.0084***
EUP MONO 0.0404**
BI 0.0560*
NW 0.0428**
TNW 0.0623*
EMEA MONO 0.0081***
BI 0.0211**
NW 0.0195**
TNW 0.0075***
OPEN MONO 0.0346**
BI 0.0722*
NW 0.0350**
TNW 0.0259**
Results and Conclusions 169
Table 65. SMT results using only comparable corpora for PL-to-EN translation.
Table 66. SMT results using only comparable corpora for EN-to-PL translation.
P-value
TED MONO 0.0167**
BI 0.0209**
NW 0.0177**
TNW 0.0027***
EUP MONO 0.000***
BI 0.000***
NW 0.000***
TNW 0.000***
EMEA MONO 0.000***
BI 0.000***
NW 0.000***
TNW 0.000***
OPEN MONO 0.000***
BI 0.000***
NW 0.000***
TNW 0.000***
Results and Conclusions 171
P-value
TED MONO 0.0215**
BI 0.0263**
NW 0.0265**
TNW 0.0072***
EUP MONO 0.000***
BI 0.000***
NW 0.000***
TNW 0.000***
EMEA MONO 0.0186**
BI 0.0134**
NW 0.0224**
TNW 0.0224**
OPEN MONO 0.000***
BI 0.000***
NW 0.000***
TNW 0.000***
classifier was only trained on text samples. Regardless, the text domain tuning
algorithm proved to always improve translation quality.
With the help of techniques explained earlier, we were capable of
creating comparable corpora for many PL-* language pairs and, later, probe
them for parallel phrases. We paired Polish (PL) with Arabic (AR), Czech
(CS), German (DE), Greek (EL), English (EN), Spanish (ES), Persian (FA),
French (FR), Hebrew (HE), Hungarian (HU), Italian (IT), Dutch (NL),
Portuguese (PT), Romanian (RO), Russian (RU), Slovenian (SL), Turkish
(TR) and Vietnamese (VI). Statistics of the resulting corpora are presented
in Table 69.
In order to assess the corpora quality and usefulness, we trained the
baseline SMT systems by utilizing the WIT38 data (BASE). We also augmented
them with resulting mined corpora both as parallel data as well as the language
models (EXT). The additional corpora were domain-adapted through the linear
interpolation and Modified Moore-Lewis filtering [250]. Tuning of the system
was not executed during experiments due to the volatility of the MERT [21].
However, usage of the MERT would have an overall positive impact on MT
system in general [21]. The results are showed in Table 70.
38
https://wit3.fbk.eu/mt.php?release=2013-01
172 Machine Learning in Translation Corpora Processing
The assessment was based on sets of official test sets from IWSLT 201339
conference. Bilingual Evaluation Understudy (BLEU) measurement was used
to score the progress. As it was expected earlier, sets of supplementary data
enhance the general quality of translation for each and every language.
In order to verify the importance of our results, we conducted the
significance tests for 4 diverse languages. The decision was made to use the
Wilcoxon test. The Wilcoxon test (also known as the signed-rank test or the
matched-pairs test) is one of the most popular alternatives for the Student’s
t-test for dependent samples. It belongs to the group of non-parametric
tests. It is used to compare two (and only two) dependent groups, that is,
two measurement variables. The significance tests were conducted in order
to evaluate how the improvements differ from each other. Changes with low
significance were marked with *, significant changes were marked with **
and very significant with *** in presented Tables 71 and 72.
39
Iwslt.org
Table 70. Results of MT experiments.
LANGUAGE SYSTEM DIRECTION BLEU LANGUAGE SYSTEM DIRECTION BLEU LANGUAGE SYSTEM DIRECTION BLEU
PL-AR BASE àPL 19.67 PL-FA BASE àPL 14.21 PL-PT BASE àPL 27.07
EXT àPL 21.78 EXT àPL 14.32 EXT àPL 29.14
BASE PL 20.98 BASE PL 16.87 BASE PL 30.11
EXT PL 23.12 EXT PL 17.03 EXT PL 31.33
PL-CS BASE àPL 12.21 PL-FR BASE àPL 19.07 PL-RO BASE àPL 22.16
EXT àPL 12.98 EXT àPL 20.01 EXT àPL 22.26
BASE PL 13.44 BASE PL 21.13 BASE PL 25.01
EXT PL 14.21 EXT PL 21.56 EXT PL 25.67
PL-DE BASE àPL 23.68 PL-HE BASE àPL 17.03 PL-RU BASE àPL 12.36
EXT àPL 24.91 EXT àPL 17.65 EXT àPL 13.51
BASE PL 26.61 BASE PL 18.18 BASE PL 13.58
EXT PL 26.87 EXT PL 18.54 EXT PL 14.32
PL-EL BASE àPL 14.27 PL-HU BASE àPL 14.62 PL-SL BASE àPL 12.11
EXT àPL 14.67 EXT àPL 15.23 EXT àPL 12.57
BASE PL 17.22 BASE PL 17.18 BASE PL 14.26
EXT PL 17.28 EXT PL 17.81 EXT PL 14.61
PL-EN BASE àPL 15.91 PL-IT BASE àPL 18.83 PL-TR BASE àPL 11.59
EXT àPL 17.01 EXT àPL 19.87 EXT àPL 12.68
BASE PL 17.09 BASE PL 21.19 BASE PL 13.07
EXT PL 18.43 EXT PL 21.34 EXT PL 13.44
PL-ES BASE àPL 16.35 PL-NL BASE àPL 18.29 PL-VI BASE àPL 12.66
EXT àPL 17.92 EXT àPL 20.13 EXT àPL 14.12
BASE PL 18.34 BASE PL 20.79 BASE PL 14.11
EXT PL 18.65 EXT PL 21.45 EXT PL 15.17
Results and Conclusions 173
174 Machine Learning in Translation Corpora Processing
40
http://opus.lingfil.uu.se/Wikipedia.php
41
https://github.com/krzwolk/yalign
Results and Conclusions 175
In the “Vocab Count” column, the number of distinct words and their
forms are presented, in “Sentences,” the number of recognized sentences in
each language, and in “Human Aligned,” the number of sentence pairs aligned
by a human.
The same articles were processed with the described pipeline.
Table 75. Results on TED corpus trained with additional analogy-based corpus.
The most likely reason for such results is that the analogy method is
designed to extend existing parallel corpora from available non-parallel data.
However, in order to establish a meaningful baseline, it was decided to test a
noisy-parallel corpus independently mined using this method. Therefore, the
results are less favorable than those obtained using a Yalign-based method.
It seems that the problem is that the analogy-based method, in contrast with
Yalign-based methods, does not mine domain-specific data.
Additionally, after manual data analysis, it was noticed that the analogy-
based method suffers from duplicates and a relatively large amount of noisy
data, or low-order phrases. As a solution to this problem, it was decided to
apply two different methods of filtering. The first one is simple, based on
length of the sentences in a corpus. In addition, all duplicates and very short
(less than 10 characters) sentences were removed. As a result, it was possible
to retain 58,590 sentences in the final corpus. The results are denoted in
Table 76 as FL1. Second, the filtration method described in Section 4.7.2 was
used (FL2). The number of unique EN tokens before filtration was 137,262,
and there were 139,408 unique PL tokens. After filtration, there were 28,054
and 22,084 unique EN and PL tokens, respectively.
42
https://www.ted.com/
178 Machine Learning in Translation Corpora Processing
In this part, methodologies that obtain parallel corpora from data sources
that are not sentence-aligned and very non-parallel, such as quasi-comparable
corpora, are presented. The results of initial experiments on text samples
obtained from the Internet crawled data are presented. The quality of the
approach used was measured by improvements in MT systems translations.
For the experiments in data mining, the TED corpora prepared for the
IWSLT 2014 evaluation campaign by the FBK43 was chosen. This domain is
very wide and covers many unrelated subject areas. The data contains almost
2.5 M untokenized words [15]. The experiments were conducted on PL-EN
(Polish-English) corpora.
The solution can be divided into three main steps. First, the quasi-
comparable data is collected, then it is aligned based on keywords, and finally
the aligned results are mined for parallel sentences. The last two steps are not
trivial, because there are great disparities between polish-english documents.
Text samples in English corpus are mostly misaligned, with translation lines
whose placement does not correspond to any text lines in the source language.
Moreover, most sentences have no corresponding translations in the corpus
at all. The corpus might also contain poor or indirect translations, making
alignment difficult.
Thus, alignment is crucial for accuracy. Sentence alignment must also be
computationally feasible to be of practical use in various applications.
Before a mining tool processes the data, it must be prepared. Firstly, all
the data is downloaded and saved in a database. In order to obtain data, a web
crawler was developed. As input, our tool requires bi-lingual dictionary or a
phrase table with probability of such translation. Such input can be obtained
using parallel corpora and tools like GIZA++ [148]. Based on such bi-lingual
translation equivalents it is possible to make a query to the Google Search
engine. Secondly, we applied filtering of not likely translations and limiting
number of crawled search results. In this research we used only 1-gram
dictionary of words that for 70% were translations of each other. For each
keyword, we crawled only 5 pages. Such strict limits were necessary because
of the time required for web crawling, however, it is obvious that much more
data with better precision and domain adaptation will be obtained when
crawling higher order n-grams. In summary, 43,899 pairs were used from the
dictionary, which produced almost 3.4 GB of data that contained 45,035,931
lines in english texts and 16,492,246 in polish texts. The average length of EN
articles was equal to 3,724,007 tokens and of PL to 4,855,009.
Secondly, our tool aligns article pairs and unifies the encoding of articles
that do not exist in UTF-8. These keyword-aligned articles are filtered to
43
http://www.fbk.eu/
Results and Conclusions 179
remove any HTML tags, XML tags or noisy data (tables, references, figures,
etc.). Finally, bilingual documents are tagged with a unique ID as a quasi-
comparable corpus. To extract the parallel sentence pairs, a decision was made
to try a strategy designed to automate the parallel text mining process by finding
sentences that are close translation matches from quasi-comparable corpus.
This presents opportunities for harvesting parallel corpora from sources, like
translated documents and the web, that are not limited to a particular language
pair. However, alignment models for two selected languages must first
be created. For this, the same methodology as for the comparable corpora
was used.
As already mentioned, some methods for improving the performance of
the native classifier were developed. First, speed improvements were made
by introducing multi-threading to the algorithm, using a database instead
of plain text files or Internet links, and using GPU acceleration in sequence
comparison. More importantly, two improvements were made to the quality
and quantity of the mined data. The A* search algorithm was modified to use
Needleman-Wunsch, and a tuning script of mining parameters was developed.
In this section, the PL-EN TED corpus will be used to demonstrate the impact
of the improvements (it was the only classifier used in the mining phase).
The data mining approaches used were: Directional (PL->EN classifier)
mining (MONO), bi-directional (additional EN->PL classifier) mining (BI),
bi-directional mining using a GPU-accelerated version of the Needleman-
Wunsch algorithm (NW), and mining using the NW version of the classifier
that was tuned (NWT). The results of such mining are shown in Table 77.
Understudy (BLEU) metric was used. The results of the experiments are
shown in Table 80. BASE in Table 80 stands for baseline system and EXT for
enriched systems.
As anticipated, additional data sets improved overall translation quality
for each language and in both translation directions. The gain in quality was
observed mostly in the English to foreign language direction.
5.4.1.4 Results
Backward linear regression has been used in analysis for the following
reasons:
• The data is linear (correlation analysis present linear relation)
• The data is ratio level data, therefore, good for linear regression
• The regression analysis provides more than one regression result, thus,
the best variables can be found.
Reliable variables are extracted by excluding irrelevant variables from
the 1st to the last stage (give the most reliable variables).
For this analysis, the Standardized Coefficients would have the strength
of relationship extracted from the Unstandardized Coefficients. The sig
Results and Conclusions 185
(p-value) was judged with the alpha (0.05) in order to investigate the most
significant variables that explain the NER metrics. The Adjusted R-square
(R2) would provide the information, how much the variables are explaining
the NER variances.
Table 82 represents the regression summary for NER with HMEANT
and REDUCTION metrics. Here, at first, the model has both HMEANT
186 Machine Learning in Translation Corpora Processing
Table 82. Regression result summary for NER and HMEANT, REDUCTION metrics.
From the two models, it has been found that HMEANT is a significant
predictor of NER (which means it correlates with human judgments
correctly). The regression equation that can compute the value of NER based
on HMEANT statistical metrics as:
NER = 81.82 + 0.162 * HMEANT
Moving forward, regression residual plots from the above regression
for the significant metrics are presented in Figure 35. The histogram and
normality plots show the distribution of residual, and the scatter plot shows
relation between dependent metric with regression residual. The closer the
dots in the plot to the regression line, the better the R-square value and the
better the relationship.
Figure 35, shows the regression residual analysis for HMEANT, and
it is clear that the histogram and normality plot show that the residuals are
distributed normally.
Lastly, we analysed the IAA, dividing it into annotation and alignment
steps. The agreement in annotation was quite high, equal to 0.87, but in the
alignment step the agreement was 0.63, most likely because of the subjective
human feelings about the meaning. In addition, we measured that the average
Results and Conclusions 187
8 0.8
4 0.4
2
0.2
0
–2 –1 0 1 2 3 0.0
regression Standardized residual 0.0 0.2 0.4 0.6 0.8 1.0
observed cum Prob
1
0
–1
r Sq linear = 0.4
–2 r square
= 0.4
–3
80.00 85.00 90.00 95.00 100.00
nEr
Table 83. Evaluation of human made text transcription and original text (RED—reduction rate).
SPKR BLEU NIST TER METEOR METEOR- EBLEU RIBES NER RED
PL
1 56.20 6.90 29.10 79.62 67.07 63.32 85.59 92.38 10.15
2 56.58 6.42 30.96 78.38 67.44 58.82 86.13 94.86 17.77
3 71.56 7.86 18.27 88.28 76.19 79.58 92.48 94.71 12.01
4 76.64 8.27 13.03 90.34 79.29 87.38 93.07 93.1 3.72
5 34.95 5.32 44.50 61.86 47.74 37.06 71.60 91.41 17.03
6 61.73 7.53 20.47 83.10 72.55 69.11 92.43 92.33 4.89
7 61.74 6.93 28.26 78.26 69.78 63.99 78.32 95.3 10.29
8 33.52 4.28 46.02 63.06 47.61 36.75 77.55 93.95 26.81
9 68.97 7.46 22.50 83.15 76.56 71.83 88.78 94.73 4.05
10 70.02 7.80 18.78 86.12 78.71 75.16 88.15 95.23 6.41
11 47.07 5.56 33.84 76.10 62.02 47.86 85.05 93.61 22.77
12 53.49 6.65 30.63 77.93 65.27 55.90 86.20 94.09 14.33
13 75.71 7.95 16.07 89.72 83.13 77.74 91.62 94.78 9.31
14 66.46 7.60 18.44 84.34 76.20 66.53 90.76 94.82 6.09
15 25.77 1.85 54.65 58.62 40.82 31.08 68.90 85.26 32.83
16 88.82 8.66 5.75 96.15 90.12 95.20 95.76 95.96 2.71
17 63.26 7.25 25.72 81.95 73.16 65.83 88.99 94.77 10.15
18 60.69 7.18 26.23 79.41 70.86 66.77 87.56 95.15 5.75
19 59.13 7.2 25.04 80.00 70.32 62.77 89.17 95.78 4.74
20 86.24 8.43 7.11 94.60 90.07 92.88 95.39 95.58 1.52
21 20.61 2.08 65.14 47.56 33.31 27.42 55.63 91.62 36.89
22 64.40 7.43 22.84 82.69 72.82 66.93 89.44 93.46 10.15
23 27.30 2.69 52.62 58.96 43.19 35.78 68.06 90.07 31.81
24 82.43 8.33 10,15 92.40 86.04 85.61 94.63 97.18 12.02
25 82.22 8.44 9.48 93.75 87.00 91.59 95.12 97.01 3.89
26 76.17 8.25 13.54 91.33 81.54 82.35 91.95 95.83 8.63
27 35.01 4.39 49.41 62.64 47.81 46.64 72.79 87.56 25.21
28 29.50 3.53 53.13 60.73 42.98 40.87 74.25 85.76 28.09
29 70.04 7.78 17.26 87.22 80.63 73.58 91.56 93.98 8.63
30 56.75 6.89 26.06 79.60 70.52 58.43 90.78 95.79 11.68
31 63.18 6.90 26.57 83.47 71.80 67.80 84.51 94.29 17.77
32 31.74 5.14 43.15 63.58 49.37 33.07 83.05 94.77 19.12
...Table 83 cont.
SPKR BLEU NIST TER METEOR METEOR- EBLEU RIBES NER RED
PL
33 89.09 8.54 5.41 95.69 91.62 92.62 95.94 97.41 0.34
34 81.04 8.42 8.29 93.60 88.51 89.68 95.18 93.76 5.75
35 73.72 8.11 15.40 88.62 78.32 83.17 92.48 93.89 4.40
36 69.73 7.90 15.06 87.90 81.65 78.10 93.91 93.97 7.45
37 57.00 7.24 26.40 81.28 68.54 65.83 86.63 92.74 8.46
38 35.26 3.68 46.70 66.04 47.84 40.73 74.61 89.15 27.07
39 46.76 4.92 39.59 72.95 57.66 54.28 72.77 88.07 25.21
40 16.79 1.74 65.48 44.61 29.04 19.62 58.77 86.68 32.66
41 19.68 0.63 61.42 56.13 39.15 41.59 49.67 84.36 52.12
42 39.19 5.04 41.62 68.38 50.85 42.41 79.79 91.23 23.35
43 67.13 7.61 19.12 86.53 75.63 68.60 93.51 95.01 11.84
44 49.85 6.26 33.84 75.31 64.34 59.44 79.28 92.23 11.84
45 37.38 4.2 43.49 68.66 54.30 44.51 75.36 93.11 27.58
46 79.11 8.25 12.01 91.48 85.72 81.66 94.84 95.36 6.09
47 40.73 4.93 41.96 68.61 54.34 42.96 83.09 89.97 21.15
48 29.03 2.65 51.27 61.24 47.21 39.53 67.21 86.07 31.98
49 68.75 7.78 18.78 86.24 77.18 72.40 90.65 94.95 5.41
50 75.24 7.97 16.07 88.88 81.51 81.27 90.88 95.75 7.45
51 78.71 8.24 11.51 91.33 83.99 86.08 94.77 98.69 4.23
52 37.60 4.31 44.84 66.32 51.59 42.22 78.73 89.37 -6.6
53 73.20 8.07 14.38 88.78 79.55 77.39 93.93 93.73 2.2
54 67.43 7.67 20.30 85.64 75.90 70.28 87.57 94.91 12.07
55 70.06 7.90 18.44 87.29 76.67 78.93 90.79 93.49 7.11
56 71.88 7.83 17.77 88.24 78.07 77.51 89.21 97.74 8.46
57 80.05 8.3 11.00 91.81 85.72 83.94 95.03 96.12 2.88
In the 4th model, EBLEU, BLEU and NIST are significant and the level
of significance for EBLEU and BLEU has increased, as well as their beta
values, nonetheless the level of significance for NIST remains the same.
However, at this stage of the model, METEOR becomes the most insignificant
metric, therefore, it has been removed for the final (5th model). In comparison
to the 2nd and 3rd models, in 4th model it has been observed that the BLEU
is becoming more significant as the p-value (p = 0.005) is less than the
alpha (0.05). Finally, in the last stage (5th model), the remaining metrics are
190 Machine Learning in Translation Corpora Processing
SPKR BLEU NIST TER METEOR METEOR- EBLEU RIBES NER RED
PL
1 41.89 6.05 44.33 66.10 54.05 44.77 78.94 92.38 10.15
2 48.94 5.94 37.39 71.14 60.24 49.79 81.29 94.86 17.77
3 57.38 7.11 27.24 78.41 67.08 62.87 89.42 94.71 12.01
4 59.15 7.07 27.24 77.21 67.94 65.31 87.71 93.1 3.72
5 26.08 4.57 55.33 52.33 39.14 26.89 69.22 91.41 17.03
6 44.17 6.32 36.38 69.16 60.15 47.97 86.04 92.33 4.89
7 51.79 6.39 34.86 71.42 65.47 52.31 79.19 95.3 10.29
8 22.03 3.17 61.93 45.27 33.90 22.14 59.93 93.95 26.81
9 52.35 6.09 39.93 68.02 63.07 53.87 78.53 94.73 4.05
10 54.44 6.65 33.50 73.16 65.42 57.28 82.11 95.23 6.41
11 65.95 7.57 19.63 84.68 76.30 72.76 92.45 97.01 3.89
12 59.12 7.26 24.53 81.63 69.57 61.59 89.13 95.83 8.63
13 17.08 2.96 68.19 42.14 30.55 21.97 59.69 85.76 28.09
14 49.78 6.53 32.32 72.88 64.10 51.98 86.56 93.98 8.63
15 46.01 6.3 34.69 71.10 61.70 46.11 87.96 95.79 11.68
16 35.50 5.03 44.33 65.64 50.58 36.53 79.16 93.61 22.77
17 34.42 4.51 56.01 52.80 41.15 34.17 63.30 94.09 14.33
18 58.58 6.95 28.93 77.96 69.47 59.22 85.73 94.78 9.31
19 49.06 6.50 31.64 72.94 63.35 47.49 85.64 94.82 6.09
20 19.86 2.58 65.48 46.48 31.29 21.38 60.96 85.26 32.83
significant, as all of them have p-values less than alpha. Therefore, no next
step of models has been executed. Here, the first model explains 75.9% of the
variance for NER, however, the second model explains 76.3% of the variance
of the NER, the 3rd model explains 76.6% of the variance of the NER, the 4th
model explains 76.0% of the variance of the NER and the final model explains
76.1% of the variance of the NER. All these are statistically accepted Rsquare
values.
From the regression, it has been found that BLEU is the most significant
predictor of NER, after BLEU, NIST is significant and, finally, EBLEU is
also a significant metric that can predict the NER better, thus, these can be
alternatives to the NER metric. The regression equation that can compute the
value of NER based on these three statistical metrics is:
NER = 86.55 + 0.254 * BLEU + 0.924 * NIST – 0.221 * EBLEU
Results and Conclusions 191
Table 85. Regression result summary for NER and the six metrics.
5.00
2.50
blEu
0.00
–2.50
r Sq linear = 0.132
–5.00
1.00
0.50
niST
0.00
–0.50
r Sq linear = 0.09
–1.00
–1.50
–4.00 –2.00 0.00 2.00 4.00
nEr
10.00
5.00
EblEu
r Sq linear = 0.176
0.00
–5.00
–10.00
–4.00 –2.00 0.00 2.00 4.00 6.00
nEr
Moving forward, regression plots from the above regression for the
significant metrics are presented below. These plots show the relation between
dependent metric and each significant metric. The closer the dots in the plot
Results and Conclusions 193
are to the regression line, the better the R-square value and the better the
relationship.
It is worth noting that the METEOR metric should be fine-tuned in order
to work properly. This would, however, require more data; more research into
this is planned for the future. Finally, the results presented in this monograph
are derived from a very small data sample and may not be very representative
of the task in general. The process of acquiring more data is still ongoing, so
these experiments are going to be repeated once more data becomes available.
44
LSTM is a sequence learning technique that uses memory cells to preserve states over a long period
of time, which allows distributed representations of sentences through distributed representations of
words
45
www.nihpromis.org
Results and Conclusions 195
Source text
Employment of
external translators
1. Two translations
into native language.
Translator1 Translator2
Medical translator
2. reconcile a single (our human
translation version resources) Translator3
3. reverse Employment of
translation external translator
Translator4
4. review of reverse
translation
Manager
5. Experts review
Expert1 Expert2 Expert3
7. Final Phase
language coordinator
9. Text formatting
corrector1 corrector2
Prepared document
process can be divided into 11 steps. First, items in English are treated as
source texts and translated simultaneously by two independent translators.
It must be noted that the translators are native target language speakers. In
the second step, a third independent translator (target language speaker as
196 Machine Learning in Translation Corpora Processing
well) tries to generate a single translation based on the translations in the first
step. This involves unifying translations into a hybrid version; in rare cases,
the translator can recommend his/her own version. This translator also needs
to outline reasons for the judgment and explain why the chosen translation
conveys the correct meaning in the best possible way. Third, the reconciled
version is backwardly translated into English by a native English translator.
This person does not have knowledge of the original source items in English.
In the fourth step the backward translation is reviewed. In this process, any
possible discrepancy is identified and the intent of the translation is clarified.
This is the initial step of the harmonization of languages. In the fifth step,
three independent experts (native speakers of the target language) examine all
previously taken actions in order to select the most appropriate translation for
each item or provide their own proposals. The experts must be either linguists
or healthcare professionals (preferably a mixed group). In the sixth step, the
translation project manager evaluates the merit of expert comments and, if
necessary, identifies problems. Using this information, the manager formulates
guidance for the target language coordinator. This coordinator in the seventh
step determines the final translation by examining all information gathered
in the previous steps. He/She is required to explain and justify the decisions,
especially if the final version is different from the reconciled one. In order to
ensure cross-language harmonization and quality assurance, the translation
project manager in the eighth step performs a preliminary assessment of the
accuracy and equivalence of the final translation with respect to the original.
A quality review is also conducted by the PROMIS Statistical Center in order
to verify consistency with other languages, if necessary. In the ninth step, the
items are formatted and proofread independently by two native speakers and
the results are reconciled. The target language version is pre-tested within
the group of target language native speakers in the tenth step. Each is
debriefed and interviewed in order to check if his/her intended meaning was
correctly understood. In the final step, comments following pre-testing are
analyzed and any possible issues summarized. If necessary, new solutions are
proposed [263].
provided as pure text encoded in UTF-8, and the transcripts were prepared
by the FBK team. They also separated the transcripts into sentences (one per
line) and aligned the language pairs. It should be emphasized that both the
automatic and manual pre-processing of these training data were required. The
extraction of transcription data from the XML files ensured an equal number of
lines for English and Polish. However, some discrepancies in text parallelism
could not be avoided. These discrepancies were mainly repetitions of
the Polish text not included in the English text. Another problem was that
TED data contained many errors. We first considered spelling errors that
artificially increased dictionary size and worsened the statistics of the
translation phrase table. This problem was solved using the tool proposed
in [18] and manual proofreading. The final TED corpus consisted of 92,135
unique Polish words.
The quality of machine translation is highly dependent on that of the input
data and their similarity with the domain of the given topic. We conducted
domain adaptation on the TED data using the Modified Moore Levis filtering
and linear interpolation in order to adapt it for the assessment of translation
in PROMIS [264].
The problem of data sparsity, if not addressed, can lead to low translation
accuracy, false errors in evaluation and low classifier scores. The quality of
domain adaptation depends heavily on the training data used to optimize
the language and translation models in an SMT system. The selection and
extraction of domain-specific training data from a large, general corpus
addresses this issue [113]. This process uses a parallel, general domain corpus
and a general domain monolingual corpus in the target language. The result
is a pseudo-in-domain sub-corpus. As described by Wang et al. [265], there
are in general three processing stages in data selection for domain adaptation.
First, sentence pairs from the parallel, general domain corpus are scored for
relevance to the target domain. Second, resampling is performed to select
the best-scoring sentence pairs to retain in the pseudo-in-domain sub-corpus.
These two steps can also be applied to the general-domain monolingual
corpus in order to select sentences for use in a language model. Third, after
collecting a substantial number of sentence pairs (for the translation model) or
sentences (for the language model), the models are trained on the sub-corpus
that represents the target domain [265]. Similarity measurement is required
in order to select sentences for the pseudo-in-domain sub-corpus. Three
state-of-the art approaches were used for similarity measurement. The cosine
tf-idf criterion looks for word overlap in order to determine similarity. This
technique is specifically helpful in reducing the number of out-of-vocabulary
(OOV) words, but is sensitive to noise in the data. A perplexity-based criterion
considers the n-gram word order in addition to collocation. Lastly, edit
198 Machine Learning in Translation Corpora Processing
46
https://github.com/machinalis/yalign/issues/3
Results and Conclusions 201
5.4.2.6 Results
In order to verify the predictive power of each metric, we evaluated (using
the official English version) each using two candidate translations and a third
reconciled version. We also analyzed the coverage of the results of the metrics
given the final translation following the entire PROMIS evaluation process.
The assessment was conducted on 150 official PROMIS questions concerning
physical capabilities, sleep disturbance and sleep-related impairments.
The weighted index method [271] was used to rank the different pools of
the translation process in comparison with human judgment. Pools were
compared with human judgment in terms of how the translations were judged
by humans and assigned a weight for each case. The weighted index value
based on the relative weights of the chosen scale was computed using the
following formula:
PIx = ∑ (Wi fi)/N …… (Formula 1)
where,
PIx = pool index value
fi = frequency of cases of match or mismatch
Wi = weight of the rating from lowest to highest
N = summation of the frequency of cases (N = 150)
An appropriate weight was assigned to the different attributes given by
the translation; each pool had three columns, and was matched with the three
columns representing human judgment. The weight was assigned based on the
level of matching, and is summarized in Table 86. Note that double matches
occurred when the reconciled version was identical to the proposed one.
202 Machine Learning in Translation Corpora Processing
Category Weight
Double match with human judgment 4
One match with human judgment 3
One mismatch with human judgment 3
Double mismatch with human judgment 4
Based on the level of match, the frequency of each pool as matched with
human judgment was found for double matches, single matches or mismatches
with human judgment. The frequency of each translation pool is presented in
Table 87.
Table 87. Frequency of matches with human judgment for different translations.
Using the frequency and the formula, the weighted index for each pool
was estimated; for example, for BLEU, the weighted index was:
PIBLEU = [(3 × 1) + (27 × 2) + (112 × 3) + (8 × 4)]/150 = 2.42
Using the same process, the index values for each pool were calculated
and are shown in Figure 40.
We know that HMEANT best matches human judgment and, therefore,
was the best evaluation pool. However, we explored the translation pools
that yielded the best prediction of HMEANT. In order to find this relation,
backward stepwise linear regression [272] was applied to matched values of
each pool. The results are presented in Table 88.
As shown in Table 88, in the first model, NIST was the most insignificant
metric in predicting HMEANT, and was, therefore, removed from the second
model. In the second model, SVM was the most insignificant and was removed
in the next model. In the third model, neural was the most insignificant matrix;
finally, in the fourth model, METEOR was found to be insignificant for the
prediction of HMEANT; therefore, in the fifth model, only BLEU and TER
Results and Conclusions 203
hMEanT 2,83
nEural 2,66
SVM 2,5
blEu 2,42
METEor 2,37
niST 2,366
TEr 2,36
were retained, and had p-values of less than 0.05, indicating that they were
significant for the prediction of HMEANT.
According to the index calculation, the closer the value to 4 (the highest
weight), the better the match and, therefore, the better the translation quality.
As shown in Figure 40, the highest index value was observed for HMEANT,
which confirmed that it is the best of all translation pools and closest to human
judgment. After HMEANT, neural had the second-highest score, whereas
SVM had the third-highest. This implied that after HMEANT, neural best
matched human judgment (the second pool). In addition, SVM was the third-
best match with human judgment. In this case, BLEU and METEOR had
moderate levels of matches, and NIST and TER had the lowest match with
human judgment.
From the stepwise linear regression model, we found that both BLEU
and TER were significant predictors of HMEANT. Between BLEU and TER,
the former had a stronger positive relation with HMEANT with a beta value
(B BLEU = 0.209). TER also showed a strong relation with the beta value
(B TER = –0.385). This shows that with increasing TER values, the HMEANT
values decreased. In this case, the values were computed based on the
matching score, values in relation to human judgment. The regression equation
to compute the value of HMEANT based on TER and BLEU statistical
metrics is:
HMEANT = 3.235 + 0.209 * BLEU – 0.385 * TER ... ... . (Equation 1)
In summation, we proved that automatic and semi-automatic evaluation
metrics can properly predict human judgments. HMEANT is a very good
predictor of human judgments. The slight difference between the metric and
human judgements were obtained because of human translation habits and
knowledge-dependent decisions, since the choices made by translators were
controversial. Even though the SVM and the neural network yielded much
204 Machine Learning in Translation Corpora Processing
lower accuracy, it should be noted that both methods are dependent on the
reference training corpus, which was not well-suited for such an evaluation
(but was the best we could obtain). Training such metrics on larger amounts
of in-domain data can significantly alter the situation.
Unfortunately, predicting HMEANT is only semi-automatic, and is not
time or cost effective. While it provides reliable and human habit-independent
results, it still requires a lot of annotation work. This is why the prediction
of this metric, along with the BLEU- and TER-dependent equation, was an
Results and Conclusions 205
interesting finding. These automatic metrics can provide robust and accurate
results or, at least, initial results that can assist human reviewers.
Table 89. Top languages by population: Asterisks mark the 2010 estimates for the top dozen
languages.
...Table 89 cont.
47
http://www.statmt.org/wmt16/translation-task.html
Results and Conclusions 209
210 Machine
measures the diversity Learning
between in Translation
two strings. Corpora Processing
Moreover, it also indicates the edit distance and is
closely linked to5.4.3.2
the paired arrangement
Generating of strings
virtual parallel data [289].
To generate new data, we trained three SMT systems based on TED, QED and
News Commentary corpora. The Experiment Management System [8] from the
open source Moses SMT toolkit was utilized to carry out the experimentation.
A 6-gram language model was trained using the SRI Language Modeling
toolkit (SRILM) [9]. Word and phrase alignment was performed using the
SyMGIZA++ symmetric word alignment tool [266] instead of GIZA++.
Out-of-vocabulary (OOV) words were monitored using the Unsupervised
Transliteration Model [79]. Working with the Czech (CS) and English (EN)
language pair, the first SMT system was trained on TED [15], the second on
the Qatar Computing Research Institute’s Educational Domain Corpus (QED)
[193], and the third using the News Commentary corpora provided for the
WMT1648 translation task. Official WMT16 test sets were used for system
evaluation. Translation engine performance was measured by the BLEU
metric [27]. The performance of the engines is shown in Table 90.
All engines worked in accordance with Figure 41, and the Levenshtein
Figure 41. English language
distance wasknowledge
used to measure the compatibility between translation results.
The Levenshtein distance measures the diversity between two strings.
Mathematically,
Moreover, itthe Levenshtein
also indicates distance
the edit distance between
and is closely linked to the two
paired strings a, b [of
arrangement of strings [289].
length |a| and |b|, respectively]
Mathematically,isthe
given by levdistance
Levenshtein b|] where:
a,b [|a|, |between two strings a, b [of
length |a| and |b|, respectively] is given by leva,b [|a|, |b|] where:
In this equation, 1[ai ≠bj] is the display function, equal to 0 when ai = bj and equal to 1
48
http://www.statmt.org/wmt16/
otherwise, and leva,b [i, j] is the distance between the first i characters of a and the
first j characters of b.
Using the combined methodology and monolingual data, parallel corpora were built.
Statistical information on the data is provided in Table 91.
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
y
Fr n
lg c
be urg
la ce
nm s
lu in us
au um
rla a
un a
cy rk
hu ria
er ia
m nd
ro ith nia
Sl tria
m d
ov l
ia
Po tvia
ec S ia
E ce
re in
Po ly
ar
bu ubli
G ny
io
Sl uga
de nd
he alt
an ni
ro lan
en
G ven
a
ita
ak
an
h pa
an
pr
ng
a
xe la
e
a
pe ua
bo
Eu l sto
i
s
M
lg
re
ed
p
m
rt
o
F
Sw
t
ne
cz
Figure 41. English language knowledge.
In this equation, 1[ai ≠ bj] is the display function, equal to 0 when ai = bj and
equal to 1 otherwise, and leva,b [i, j] is the distance between the first i characters
of a and the first j characters of b.
Using the combined methodology and monolingual data, parallel corpora
were built. Statistical information on the data is provided in Table 91.
The purpose of this research was to create synthetic parallel data to train
a machine translation system by translating monolingual texts with multiple
machine translation systems and various filtering steps. This objective is
not new; synthetic data have been created in the past. However, the novel
aspect of the present monograph is its use of three MT systems, application
212 Machine Learning in Translation Corpora Processing
incorporates the best of the first two, is the strictest. Moreover, Wang et al.
observed that a combination of these approaches provides the best possible
solution for domain adaptation for Chinese-English corpora [265]. Thus,
inspired by Wang et al.’s approach, we utilized a combination of these models.
Similarly, the three measurements were combined for domain adaptation.
Wang et al. found that the performance of this process yields approximately
20 percent of the domain analogous data.
5.4.3.5 Evaluation
Numerous human languages are used around the world and millions of
translation systems have been introduced for the possible language pairs.
However, these translation systems struggle with high quality performance,
largely due to the limited availability of language resources, such as parallel
data.
In this study, we have attempted to supplement these limited resources.
Additional parallel corpora can be utilized to improve the quality and
performance of linguistic resources, as well as individual NLP systems. In
the MT application (Table 92), our data generation approach has increased
translation performance. Although the results appear very promising, there
remains a great deal of room for improvement. Performance improvements
can be attained by applying more sophisticated algorithms in order to quantify
the comparison among different MT engines. In Table 94, we present the
baseline (BASE) outcomes for the MT systems we obtained for three diverse
domains (news, IT, and biomedical—using official WMT16 test sets). Second,
we generated a virtual corpus and adapted it to the domain (FINAL). The
generated corpora demonstrate improvements in SMT quality and utility as
NLP resources. From Table 91, it can be concluded that a generated virtual
corpus is morphologically rich, which makes it acceptable as a linguistic
resource. In addition, by retraining with a virtual corpus SMT system and
repeating all the steps, it is possible to obtain more virtual data of higher
quality. Statistically significant results in accordance with the Wilcoxon test
are marked with * and those that are very significant with **.
Next, in Table 96, we replicate the same quality experiment but using
generated data without the back-translation step. As shown in Table 92, more
data can be obtained in such a manner. However, the SMT results are not as
good as those obtained using back-translation. This means that the generated
data must be noisy and most likely contain incomplete sentences that are
removed after back-translation.
Next, in Table 97, we replicate the same quality experiment but using
generated data from Table 93. As shown in Table 97, augmenting virtual
Results and Conclusions 217
corpora with semantic information makes a positive impact on not only the
data volume but also data quality. Semantic relations improve the MT quality
even more.
Finally, in Table 98, we replicate the same quality experiment but using
generated data from Table 94 (LSA). As shown in Table 98, augmenting
virtual corpora with semantic information by facilitating LSA makes an even
more positive impact on data quality. LSA-based semantic relations improve
the MT quality even more. It is worth mentioning that LSA provided us with
218 Machine Learning in Translation Corpora Processing
Table 97. Evaluation of semantically generated corpora without the back-translation step.
less data but we believe that it was more accurate and more domain-specific
than the data generated using Wordnet.
Summing up, in this study, we successfully built parallel corpora of
satisfying quality from monolingual resources. This method is very time and
cost effective and can be applied to any bilingual pair. In addition, it might
prove very useful for rare and under-resourced languages. However, there is
still room for improvement, for example, by using better alignment models,
neural machine translation or adding more machine translation engines to our
Results and Conclusions 219
Phrase tables often grow enormously large, because they contain a lot of
noisy data and store many unlikely hypotheses. This poses difficulties when
implementing a real-time translation system that has to be loaded into memory.
It might also result in the loss of quality. Quirk and Menezes [123] argue
that extracting only minimal phrases (the smallest phrase alignments, each of
which map to an entire sentence pair) will negatively affect translation quality.
This idea is also the basis of the n-gram translation model [124, 125]. On
the other hand, the authors of [126] postulate that discarding unlikely phrase
pairs based on significance tests drastically reduces the phrase table size and
might even result in performance improvement. Wu and Wang [127] propose
a method for filtering the noise in the phrase translation table based on a
logarithmic likelihood ratio. Kutsumi et al. [128] use a support vector machine
for cleaning phrase tables. Eck et al. [129] suggest pruning of the translation
table based on how often a phrase pair occurred during the decoding step and
how often it was used in the best translation. In this study, we implemented the
Moses-based method introduced by Johnson et al. in [126].
In our experiments, absolute and relative filtration was performed based
on in-domain dictionary filtration rules. Different rules were used for the sake
of comparison. For example, one rule (absolute) was to retain a sentence if
at least a minimum number of words from the dictionary appeared in it. A
second rule (relative) was to retain a sentence if at least a minimum percentage
Results and Conclusions 223
of words from the dictionary appeared in it. A third rule was to retain only
a set number of the most probable translations. Finally, the experiment was
concluded using a combination of pruning and absolute pre-filtering.
In Table 102, results are provided for the experiment that implemented
the Moses-based pruning method, along with data pre-processing performed
by filtering irrelevant sentences. The absolute and relative filtration results are
indicated using the following terminology.
Absolute N—Retain a sentence if at least N words from the dictionary
appear in it.
Relative N—Retain a sentence if at least N% of the sentence is built from
the dictionary.
Pruning N—Remove only the N least probable translations.
The Table 102 also contains results for the significance tests for the
pruning experiments.
again. This process was repeated iteratively 10 times. After each iteration,
the SMT system was trained on the resulting WIKI-F corpus. As seen in
Table 103, each iteration slightly improved the results by reducing the data
loss. The first few iterations showed more rapid improvement. The TED
corpus was used as a baseline system.
phrase extraction. The predominant criterion for data selection is cosine tf-
idf, which originated in the area of IR. Hildebrand et al. [102] utilized this IR
technique to choose the most similar sentence—albeit with a lower quantity—
for translation model (TM) and LM adaptation. The results strengthen the
significance of the methodology for enhancing translation quality, particularly
for LM adaptation.
In a study much closer to the present research, Lü et al. [105] suggested
reorganizing the methodology for offline, as well as online, TM optimization.
The results are much closer to those of a realistic SMT system. Moreover,
their conclusions revealed that repetitive sentences in the data can affect the
translation quality. By utilizing approximately 60% of the complete data, they
increased the BLEU score by almost one point.
The second technique in the literature is a perplexity approach, which is
common in language modeling. This approach was used by Lin et al. [299]
and Gao et al. [300]. In that research, perplexity was utilized as a standard
in testing parts of the text in accordance with an in-domain LM approach.
Other researchers, such as Moore and Lewis [269], derived the unique
approach of a cross-entropy difference metric from a simpler version of the
Bayes rule. This methodology was further examined by Axelrod et al. [113],
particularly for SMT adaptation, and they additionally introduced an exclusive
unique bilingual methodology and compared its results with contemporary
approaches. Results of their experiments revealed that, if the system was kept
simple yet sufficiently fast, it discarded as much as 99% of the general corpus,
which resulted in an improvement of almost 1.8 BLEU points.
Early works discuss separately applying the methodology to either a TM
[113] or an LM [105]; however, in [105], Lü suggests that a combination
of LM and TM adaptation will actually enhance the overall performance.
Therefore, in the present study, TM and LM optimization was investigated
through a combined data selection method.
where tf iji is the term frequency (TF) of the j-th word in the vocabulary in
document Di, and idfj is the inverse document frequency (IDF) of the j-th
word. The similarity between the two texts is the cosine of the angle between
the two vectors. This formula is applied in accordance with Lü et al. [105] and
Hildebrand et al. [102]. The approach supposes that M is the size of the query
set and N is the number of similar sentences from the general corpus for each
query. Thus, the size of the cosine tf-idf-based quasi-in-domain sub-corpus is
defined as:
SizeCos–IR = M × N
5.4.5.1.2 Perplexity
Perplexity focuses on cross-entropy [301], which is the average of the negative
logarithm of word probabilities. Consider:
n
1 n
−∑ p( wi ) log q( wi ) =
H ( p,q) = − ∑ log q( wi )
=i 1 = Ni 1
For instance, candidates with lower scores [298, 27] have a higher relevance to the
specific target domain. The sizes of the perplexity-based quasi-in-domain subsets must be
Results and Conclusions 229
equal. In practice, we work with the SRI Language Modeling (SRILM) toolkit to train 5-gram
LMs sentences
with interpolated modified
in the source Kneser–Ney
language. discounting
Moreover, Axelrod[9, et
32].
al. [113] proposed a
metric that adds cross-entropy differences to both sides:
[HI – src (p, q) – HG – src (p, q)] + [HI – tgt (p, q) – HG – tgt (p, q)]
For instance,
5.4.5.1.3. candidates
Levenshtein with lower scores [298, 27] have a higher
distance
relevance to the specific target domain. The sizes of the perplexity-based
quasi-in-domain subsets must be equal. In practice, we work with the SRI
In information
Language theory(SRILM)
Modeling and computer
toolkitscience,
to train the Levenshtein
5-gram LMs withdistance is regarded as a
interpolated
stringmodified Kneser–Ney
metric for discounting
the measurement [9, 32]. between two sequences. The Levenshtein
of dissimilarity
distance between points or words is the minimum possible number of unique edits to the data
5.4.5.1.3 Levenshtein distance
(e.g. insertions or deletions) that are required to replace one word with another.
In information theory and computer science, the Levenshtein distance is
regarded as a string metric for the measurement of dissimilarity between two
The Levenshtein
sequences. distance can
The Levenshtein additionally
distance between be points
appliedorto a wider
words range
is the of subjects as a
minimum
possible
distance number
metric. of unique
Moreover, it has edits to association
a close the data (e.g., withinsertions
pairwise or deletions)
string that
arrangement.
are required to replace one word with another.
The Levenshtein distance can additionally be applied to a wider range
Mathematically, the Levenshtein distance between two strings 𝑎𝑎, 𝑏𝑏 (of
of subjects as a distance metric. Moreover, it has a close association with
lengthpairwise
|𝑎𝑎| and string
|𝑏𝑏|, respectively)
arrangement. is given by 𝑙𝑙𝑙𝑙𝑙𝑙𝑎𝑎,𝑏𝑏 (|𝑎𝑎|, |𝑏𝑏|), where
Mathematically, the Levenshtein distance between two strings a, b (of
length |a| and |b|, respectively) is given by leva, b (|a|, |b|), where
49
https://github.com/krzwolk/Text-Corpora-Adaptation-Tool
49
https://github.com/krzwolk/Text-Corpora-Adaptation-Tool
50
https://github.com/krzwolk/Text-Corpora-Adaptation-Tool
Results and Conclusions 231
CORPORA PL EN PAIRS
TED 218,426 104,117 151,228
OPEN 1,236,088 749,300 33,570,553
In Table 105, the corpora statistics are presented for the average sentence
lengths for each language and corpus. Both tables expose large disparities
between the text domains.
CORPORA PL EN
TED 13 17
OPEN 6 7
Additional data were used for training both the bilingual translation
phrase tables and language models. The Moses SMT system was used for
tokenization, cleaning, factorization, conversion to lower case, splitting,
and final cleaning of corpora after splitting. Training of a 6-gram language
model was accomplished using the KenLM Modeling Toolkit [9]. Word
and phrase alignment was performed using the SyMGIZA++ tool [266].
232 Machine Learning in Translation Corpora Processing
SYSTEM BLEU
PL->EN EN->PL
BASE 17.43 10.70
NONE 17.89* 10.63*
MML 18.21** 11.13*
TF-IDF 17.92* 10.71
PP 18.13** 10.88*
LEV 17.66* 10.63*
COMB 18.97** 11.84**
As shown by Table 107, ignoring the adaptation step only slightly improves
PL->EN translation and degrades EN<-PL translation. As anticipated, other
adaptation methods have a rather positive impact on translation quality;
however, in some cases, the enhancement is only minor.
The most significant improvement in translation quality was obtained
using the proposed method combining all three metrics. It should be noted,
however, that the proposed method was not computationally feasible in some
cases, even though it produced satisfactory results. In the best-case scenario,
fast comparison metrics, such as perplexity, will filter most irrelevant data;
however, in the worst-case scenario, most data would be processed by slow
metrics.
Summing up, we successfully introduced a new combined approach for
the in-domain data adaptation task. In the general case, it provides better
adaptation results than those of state of the art methods separately in a
reasonable amount of time.
51
iwslt.org
Results and Conclusions 233
the ASR results have no punctuation or proper casing. Therefore, the translation
would be rather poor if the ASR results are directly translated by MT systems.
Automatic sentence segmentation of speech is important in making speech
recognition (ASR) output more readable and easier for downstream language
processing modules. Various techniques have been studied for automatic
sentence boundary detection in speech, including hidden Markov models
(HMMs), maximum entropy, neural networks and Gaussian mixture models,
utilizing both textual and prosodic information.
This part of research addresses the problem of identifying sentence
boundaries in the transcriptions produced by automatic speech recognition
systems. These differ from the sorts of texts normally used in NLP in a
number of ways: The text is generally in single case, unpunctuated, and may
contain transcription errors. In addition, such text is usually normalized, what
is unnatural to read. A lot of important linguistic information is lost as well
(e.g., for machine translation or text-to-speech). Table 108 compares a short
text in the format that would be produced by an ASR system with a fully
punctuated version that includes casing information and de-normalization.
There are many possible situations in which an NLP system may be
required to process ASR text. The most obvious examples are text to speech
systems, machine translation or real-time speech-to-speech translation
systems. Dictation software programs do not punctuate or capitalize their
output. However, if this information could be added to ASR text, the results
would be far more useful.
One of the most important pieces of information that is not available in
ASR output is sentence boundary information. The knowledge of sentence
boundaries is required by many NLP technologies. Parts of speech taggers
typically require input in the format of a single sentence per line, and parsers
generally aim to produce a tree spanning each sentence. Sentence boundary
is very important in order not to lose the context or even the meaning of a
sentence. Only the most trivial linguistic analysis can be carried out on text
that is not split into sentences.
It is worth mentioning that not all transcribed speech can be sensibly
divided into sentences. It has been argued by Gotoh and Renals [302], that the
main unit in spoken language is the phrase, rather than the sentence. However,
there are situations in which it is appropriate to consider spoken language
to be composed of sentences. The need for such tools is also well-known in
other NLP branches. For example, the spoken portion of the British National
Corpus [303] contains 10 million words and was manually marked with
sentence boundaries. A technology that identifies sentence boundaries could
be used to speed up such processes.
234 Machine Learning in Translation Corpora Processing
Before ile kobiet mających czterdzieści cztery lata wygląda tak jak amerykańska gwiazda na okładce
swej nowej płyty jennifer lopez nie bierze jednak udziału w konkursie piękności czy więc rownież
mocno jak do rzeźbienia swego ciała przyłożyła się do stworzenia dobrych piosenek ostatnie lata
nie były dobre dla latynoskiej piosenkarki między innymi rozwody i rozstania zdewastowały jej
poczucie kobiecej wartości a klapy ostatnich albumow o mały włos nie zrujnowały piosenkarskiej
kariery
After Ile kobiet mających 44 lata wygląda tak, jak amerykańska gwiazda na okładce swej nowej
płyty Jennifer Lopez? Nie bierze jednak udziału w konkursie piękności. Czy więc rownież mocno,
jak do rzeźbienia swego ciała, przyłożyła się do stworzenia dobrych piosenek?. Ostatnie lata nie
były dobre dla latynoskiej piosenkarki, m.in. rozwody i rozstania zdewastowały jej poczucie
kobiecej wartości, a klapy ostatnich albumow o mały włos nie zrujnowały piosenkarskiej kariery.
13. Wu, Dekai and Fung, Pascale. 2005. Inversion transduction grammar constraints for
mining parallel sentences from quasi-comparable corpora. In: Natural Language
Processing–IJCNLP 2005. Springer Berlin Heidelberg, pp. 257–268.
14. Tiedemann, Jörg. 2012. Parallel data, tools and interfaces in OPUS. In: LREC,
pp. 2214–2218.
15. Cettolo, Mauro, Girardi, Christian and Federico, Marcello. 2012. Wit3: Web inventory
of transcribed and translated talks. In: Proceedings of the 16th Conference of the
European Association for Machine Translation (EAMT), pp. 261–268.
16. Tiedemann, Jörg. 2009. News from OPUS-A collection of multilingual parallel corpora
with tools and interfaces. In: Recent Advances in Natural Language Processing,
pp. 237–248.
17. Marasek, Krzysztof. 2012. TED Polish-to-English translation system for the IWSLT
2012. In: IWSLT, pp. 126–129.
18. Wołk, Krzysztof and Marasek, Krzysztof. 2013. Polish–english speech statistical
machine translation systems for the IWSLT 2013. In: Proceedings of the 10th
International Workshop on Spoken Language Translation, Heidelberg, Germany,
pp. 113–119.
19. Berrotarán, Gonzalo Garcia, Carrascosa, Rafael and Vine, Andrew. 2015. Yalign
documentation, http://yalign.readthedocs.org/en/latest/, retrieved on June 17, 2015.
20. Musso, Gabe. 2015. Sequence alignment (Needleman-Wunsch, Smith-Waterman).
Available: http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf, retrieved on June
17, 2015.
21. Joachims, Thorsten. 1998. Text categorization with support vector machines: Learning
with many relevant features. Springer Berlin Heidelberg, pp. 137–142.
22. Wołk, Krzysztof and Marasek, Krzysztof. 2014. Real-time statistical speech translation.
In: New Perspectives in Information Systems and Technologies, Volume 1. Springer
International Publishing, pp. 107–113.
23. Varga, Dániel et al. 2007. Parallel corpora for medium density languages. Amsterdam
studies in the theory and history of linguistic science series 4. 292: 247.
24. Wołk, Krzysztof and Marasek, Krzysztof. 2014. A sentence meaning based alignment
method for parallel text corpora preparation. In: New Perspectives in Information
Systems and Technologies, Volume 1. Springer International Publishing, pp. 229–237.
25. Hovy, Eduard. 1999. Toward finely differentiated evaluation metrics for machine
translation. In: Proceedings of the EAGLES Workshop on Standards and Evaluation.
Pisa, Italy.
26. Reeder, F. 2001. Additional mt-eval references. International Standards for Language
Engineering, Evaluation Working Group.
27. Papineni, Kishore et al. 2002. BLEU: A method for automatic evaluation of machine
translation. In: Proceedings of the 40th annual meeting on association for computational
linguistics. Association for Computational Linguistics, pp. 311–318.
28. Axelrod, Amittai. 2006. Factored language model for statistical machine translation. M.
Sc. Monograph University of Edinburgh.
29. Doddington, George. 2002. Automatic evaluation of machine translation quality using
n-gram co-occurrence statistics. In: Proceedings of the second international conference
on Human Language Technology Research. Morgan Kaufmann Publishers Inc.,
pp. 138–145.
30. Olive, Joseph. 2005. Global autonomous language exploitation (GALE). DARPA/IPTO
Proposer Information Pamphlet.
31. Banerjee, Satanjeev and Lavie, Alon. 2005. Meteor: An automatic metric for MT
evaluation with improved correlation with human judgments. In: Proceedings of the
ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation
and/or Summarization, pp. 65–72.
242 Machine Learning in Translation Corpora Processing
32. Goodman, Joshua and Chen, Stanley. 1998. An empirical study of smoothing techniques
for language modeling. Computer Science Group, Harvard University,
33. Perplexity [Online]. Hidden Markov Model Toolkit website. Cambridge University
Engineering Dept. Available: http://www1.icsi.berkeley.edu/Speech/docs/HTKBook3.2/
node188_mn.html, retrieved on November 29, 2015.
34. Isozaki, Hideki et al. 2010. Automatic evaluation of translation quality for distant
language pairs. In: Proceedings of the 2010 Conference on Empirical Methods in
Natural Language Processing. Association for Computational Linguistics, pp. 944–952.
35. European Medicines Agency (EMEA) [Online]. Available: http://opus.lingfil.uu.se/
EMEA.php, retrieved on August 7, 2013.
36. European Parliament Proceedings (Europarl) [Online]. Available: http://www.statmt.
org/europarl/, retrieved on August 7, 2013.
37. Jinseok, Kim. Lambda N. Gamma [Online]. 2004. Available: http://www.utexas.edu/
courses/schwab/sw318_spring_2004/SolvingProblems/Class11_LambdaNGamma.ppt,
retrieved on August 17, 2014.
38. Gautheir, Thomas D. 2001. Detecting trends using spearman’s rank correlation
coefficient. Environmental Forensics 2: 359–362.
39. Deng, Yonggang, Kumar, Shankar and Byrne, William. 2007. Segmentation and
alignment of parallel text for statistical machine translation. Natural Language
Engineering 13: 235–260.
40. Braune, Fabienn and Fraser, Alexander. 2010. Improved unsupervised sentence
alignment for symmetrical and asymmetrical parallel corpora. In: Proceedings of the
23rd International Conference on Computational Linguistics: Posters. Association for
Computational Linguistics, pp. 81–89.
41. Santos, André. 2011. A survey on parallel corpora alignment. MI-STAR, pp. 117–128.
42. Gale, William A. and Church, Kenneth Ward. 1991. Identifying word correspondences
in parallel texts. In: HLT, pp. 152–157.
43. Brown, Peter F., Lai, Jennifer C. and Mercer, Robert L. 1991. Aligning sentences
in parallel corpora. In: Proceedings of the 29th annual meeting on Association for
Computational Linguistics. Association for Computational Linguistics, pp. 169–176.
44. Filtering and aligning tool. Available: http://wolk.pl/tool-for-parallel-corpora-filtering-
and-aligning/, retrieved on June 10, 2015.
45. Cettolo, Mauro, Bertoldi, Nicola and Federico, Marcello. 2011. Methods for smoothing
the optimizer instability in SMT. MT Summit XIII: The Thirteenth Machine Translation
Summit, pp. 32–39.
46. Snover, Matthew et al. 2006. A study of translation edit rate with targeted human
annotation. In: Proceedings of Association for Machine Translation in the Americas,
pp. 223–231.
47. Costa-Jussà, Marta R. and Fonollosa, José A.R. 2010. Using linear interpolation and
weighted reordering hypotheses in the moses system. In: LREC.
48. International Workshop on Spoken Language Translation (IWSLT). Available: http://
www.iwslt2013.org/, retrieved on August 7, 2013.
49. Abbyy Aligner. Available: http://www.abbyy.com/aligner/, retrieved on August 7, 2013.
50. Unitex/Gramlab, Available: http://www-igm.univ-mlv.fr/~unitex, retrieved on
August 7, 2013.
51. Hunalign–sentence aligner, Available: http://mokk.bme.hu/resources/hunalign/,
retrieved on August 8, 2013.
52. Bleualign, https://github.com/rsennrich/Bleualign, retrieved on August 8, 2013.
53. Paumier, Sébastien, Nakamura, Takuya and Voyatzi, Stavroula. 2009. UNITEX, a Corpus
Processing System with Multi-Lingual Linguistic Resources. eLEX2009, pp. 173.
54. Bonhomme, Patrice and Romary, Laurent. 1995. The lingua parallel concordancing
project: Managing multilingual texts for educational purpose. Proceedings of Language
Engineering, 95: 26–30.
References 243
55. Hsu, Bo-June and Glass, James. 2008. Iterative language model estimation: Efficient
data structure & algorithms. In: Proceedings of Interspeech, pp. 1–4.
56. Proceedings of International Workshop on Spoken Language Translation IWSLT 2014,
Tahoe Lake, USA, 2014. Available Online: http://workshop2014.iwslt.org, retrieved on
August 8, 2013.
57. Bisazza, Arianna et al. 2011. Fill-up versus interpolation methods for phrase-based SMT
adaptation. In: IWSLT, pp. 136–143.
58. Bojar, Ondrej. 2011. Rich morphology and what can we expect from hybrid approaches
to MT. Invited talk at International Workshop on Using Linguistic Information for
Hybrid Machine Translation LIHMT-2011.
59. Graliński, Filip, Jassem, Krzysztof and Junczys-Dowmunt, Marcin. 2013. PSI-toolkit:
A natural language processing pipeline. In: Computational Linguistics. Springer Berlin
Heidelberg, pp. 27–39.
60. Wołk, Krzysztof and Marasek, Krzysztof. 2015. Polish-english statistical machine
translation of medical texts. In: New Research in Multimedia and Internet Systems.
Springer International Publishing, pp. 169–179.
61. Chiang, David. 2007. Hierarchical phrase-based translation. Computational Linguistics,
33.2: 201–228.
62. Durrani, Nadir, Schmid, Helmut and Fraser, Alexander. 2011. A joint sequence translation
model with integrated reordering. In: Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies-Volume 1.
Association for Computational Linguistics, pp. 1045–1054.
63. Goeuriot, Lorraine et al. 2012. Report on and prototype of the translation support.
Khresmoi Public Deliverable, p. 3.
64. Pletneva, Natalia and Vargas, Alejandro. 2011. D8. 1.1. Requirements for the general
public health search. Khresmoi Project public deliverable.
65. Gschwandtner, M., Kritz, M. and Boyer, C. 2011. Requirements of the health professional
research. Technical Report.
66. GGH Benefits. Medical phrases and terms translation demo. 2014, retrieved on August
7, 2013.
67. Karliner, Leah S. et al. 2007. Do professional interpreters improve clinical care for
patients with limited English proficiency? A systematic review of the literature. Health
Services Research 42.2: 727–754.
68. Randhawa, Gurdeeshpal et al. 2013. Using machine translation in clinical practice.
Canadian Family Physicia 59: 382–383.
69. Deschenes, S. 5 benefits of healthcare translation technology [Online]. Healthcare
Finance News. October 16, 2012. Available: http://www.healthcarefinancenews.com/
news/5-benefits-healthcare-translation-technology, retrieved on June 10, 2015.
70. Zadon, Cruuz. Man vs. machine: The benefits of medical translation services [Online].
Ezine Articles: Healthcare Systems, July 25, 2013. Available: http://ezinearticles.
com/?Man-Vs-Machine:-The-Benefits-of-Medical-Translation-Services&id=7890538,
retrieved on August 7, 2013.
71. Durrani, Nadir et al. 2013. Munich-Edinburgh-Stuttgart submissions of OSM systems
at WMT13. In: Proceedings of the Eighth Workshop on Statistical Machine Translation,
pp. 120–125.
72. Koehn, Philipp and Hoang, Hieu. 2007. Factored translation models. Proceedings of
the 2007 Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning EMNLP-CoNLL, pp. 868–876.
73. Bikel, Daniel M. 2004. Intricacies of Collins’ parsing model. Computational Linguistics
30: 479–511.
74. Dyer, Chris, Chahuneau, Victor and Smith, Noah A. 2013. A simple, fast, and effective
reparameterization of IBM Model 2. Association for Computational Linguistics.
244 Machine Learning in Translation Corpora Processing
75. Bojar, Ondrej et al. 2014. Findings of the 2014 workshop on statistical machine
translation. In: Proceedings of the Ninth Workshop on Statistical Machine Translation.
Association for Computational Linguistics, Baltimore, MD, USA, pp. 12–58.
76. Hasan, A., Islam, Saria and Rahman, M. 2012. A comparative study of Witten Bell
and Kneser-Ney smoothing methods for statistical machine translation. Journal of
Information Technology 1: 1–6.
77. Radziszewski, Adam and Sniatowski, Tomasz. 2011. Maca-a configurable tool to
integrate Polish morphological data [Online]. In: Proceedings of the Second International
Workshop on Free/Open-Source Rule-Based Machine Translation (2011: Barcelona).
Available: http://hdl. handle. net/10609/5645.
78. Bahdanau, Dzmitry, Cho, Kyunghyun and Bengio, Yoshua. 2014. Neural Machine
Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473.
79. Durrani, Nadir et al. 2014. Investigating the usefulness of generalized word representations
in SMT. In: Proceedings of the 25th Annual Conference on Computational Linguistics
(COLING), Dublin, Ireland, pp. 421–432.
80. Durrani, Nadir et al. 2014. Integrating an unsupervised transliteration model into
statistical machine translation. Proceedings of the 14th Conference of the European
Chapter of the Association for Computational Linguistics, EACL, pp. 148–153.
81. Wołk, Krzysztof and Marasek, Krzysztof. 2013. Alignment of the polish-english parallel
text for a statistical machine translation. Computer Technology and Application 4. David
Publishing, ISSN:1934-7332 (Print), ISSN: 1934-7340 (Online), pp. 575–583.
82. Koehn, Philipp and Haddow, Barry. 2012. Towards effective use of training data in
statistical machine translation. In: Proceedings of the Seventh Workshop on Statistical
Machine Translation. Association for Computational Linguistics, pp. 317–321.
83. Clark, Jonathan H. et al. 2011. Better hypomonograph testing for statistical machine
translation: Controlling for optimizer instability. In: Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics: Human Language
Technologies: Short monographs-Volume 2. Association for Computational Linguistics,
pp. 176–181.
84. Durrani, Nadir et al. 2013. Edinburgh’s machine translation systems for European
language pairs. In: Proceedings of the Eighth Workshop on Statistical Machine
Translation, pp. 114–121.
85. Koehn, Philipp. 2009. Statistical machine translation. Cambridge University Press.
86. Askarieh, Sona. 2014. Cohesion and Comprehensibility in Swedish-English Machine
Translated Texts.
87. Wojak, Aleksandra and Gralinski, Filip. 2010. Matura evaluation experiment based on
human evaluation of machine translation. In: IMCSIT, pp. 547–551.
88. Lopez, Adam. 2008. Statistical machine translation. ACM Computing Surveys (CSUR)
40: 8.
89. Hutchins, John. 2003. Has machine translation improved? Some historical comparisons.
In: Proceedings of the 9th MT Summit, pp. 181–188.
90. Hutchins, W. John. 2003. Machine translation: half a century of research and use. Proc.
UNED Summer School, pp. 1–24.
91. Agarwal, Abhaya and Lavie, Alon. 2008. Meteor, M-BLEU and M-TER: Evaluation
metrics for high-correlation with human rankings of machine translation output.
In: Proceedings of the Third Workshop on Statistical Machine Translation. Association
for Computational Linguistics, pp. 115–118.
92. Dugast, Loïc, Senellart, Jean and Koehn, Philipp. 2007. Statistical post-editing
on systran’s rule-based translation system. In: Proceedings of the Second Workshop
on Statistical Machine Translation. Association for Computational Linguistics,
pp. 220–223.
References 245
111. Wu, Hua, Wang, Haifeng and Zong, Chengqing. 2008. Domain adaptation for statistical
machine translation with domain dictionary and monolingual corpora. In: Proceedings of
the 22nd International Conference on Computational Linguistics-Volume 1. Association
for Computational Linguistics, pp. 993–1000.
112. Cheng, Pu-Jen, et al. 2004. Creating multilingual translation lexicons with regional
variations using web corpora. In: Proceedings of the 42nd Annual Meeting on
Association for Computational Linguistics. Association for Computational Linguistics,
p. 534.
113. Axelrod, Amittai, He, Xiaodong and Gao, Jianfeng. 2011. Domain adaptation via pseudo
in-domain data selection. In: Proceedings of the Conference on Empirical Methods in
Natural Language Processing. Association for Computational Linguistics, pp. 355–362.
114. Tillmann, Christoph. 2004. A unigram orientation model for statistical machine
translation. In: Proceedings of HLT-NAACL 2004: Short Monographs. Association for
Computational Linguistics, pp. 101–104.
115. Galley, Michel and Manning, Christopher D. 2008. A simple and effective hierarchical
phrase reordering model. In: Proceedings of the Conference on Empirical Methods in
Natural Language Processing. Association for Computational Linguistics, pp. 848–856.
116. Och, Franz Josef and Ney, Hermann. 2003. A systematic comparison of various statistical
alignment models. Computational Linguistics 29: 19–51.
117. Och, Franz Josef. 1995. Maximum-likelihood-schätzung von Wortkategorien mit
verfahren der kombinatorischen optimierung. Studienarbeit, Friedrich-Alexander-
Universität, Erlangen-Nürnberg, Germany.
118. Och, Franz Josef. 1999. An efficient method for determining bilingual word classes.
In: Proceedings of the Ninth Conference on European Chapter of the Association for
Computational Linguistics. Association for Computational Linguistics, pp. 71–76.
119. KantanMT—a sophisticated and powerful Machine Translation solution in an easy-to-
use package [http://www.kantanmt.com].
120. Linguistic Intelligence Research Group, NTT Communication Science Laboratories.
RIBES: Rank-based Intuitive Bilingual Evaluation Score [Online]. Available: http://
www.kecl.ntt.co.jp/icl/lirg/ribes/, retrieved on August 7, 2013.
121. Chahuneau, Victor, Smith, Noah and Dyer, Chris. 2012. Pycdec: A python interface to
cdec. The Prague Bulletin of Mathematical Linguistics 98: 51–61.
122. Dyer, Chris et al. 2010. cdec: A decoder, alignment, and learning framework for finite-
state and context-free translation models. In: Proceedings of the ACL 2010 System
Demonstrations. Association for Computational Linguistics, pp. 7–12.
123. Quirk, Chris and Menezes, Arul. 2006. Do we need phrases? Challenging the conventional
wisdom in statistical machine translation. In: Proceedings of the main Conference
on Human Language Technology Conference of the North American Chapter of the
Association of Computational Linguistics. Association for Computational Linguistics,
pp. 9–16.
124. Marino, José B. et al. 2006. N-gram-based machine translation. Computational
Linguistic 32s: 527–549.
125. Costa-jussà, Marta R. et al. 2007. Analysis and system combination of phrase-and n-gram-
based statistical machine translation systems. In: Human Language Technologies 2007:
The Conference of the North American Chapter of the Association for Computational
Linguistics; Companion Volume, Short Monographs. Association for Computational
Linguistics, pp. 137–140.
126. Johnson, John Howard et al. 2007. Improving translation quality by discarding most
of the phrasetable. Proceedings of the 2007 Joint Conference on Empirical Methods in
Natural Language Processing and Computational Natural Language Learning (EMNLP-
CoNLL).
References 247
127. Wu, Hua and Wang, Haifeng. 2007. Comparative study of word alignment heuristics and
phrase-based SMT. Proceedings of the MT Summit XI.
128. Kutsumi, Takeshi et al. 2005. Selection of entries for a bilingual dictionary from aligned
translation equivalents using support vector machines. In: Proceedings of Pacific
Association for Computational Linguistics.
129. Eck, Matthias, Vogel, Stephan and Waibel, Alex. 2007. Translation model pruning via
usage statistics for statistical machine translation. In: Human Language Technologies
2007: The Conference of the North American Chapter of the Association for
Computational Linguistics; Companion Volume, Short Monographs. Association for
Computational Linguistics, pp. 21–24.
130. Eck, Stephen Vogal Matthias and Waibel, Alex. 2007. Estimating phrase pair relevance
for translation model pruning. MT Summit XI.
131. Durrani, Nadir and Koehn, Philipp. 2014. Improving machine translation via
triangulation and transliteration. In: Proceedings of the 17th Annual Conference of the
European Association for Machine Translation (EAMT), Dubrovnik, Croatia.
132. Durrani, Nadir et al. 2014. Investigating the usefulness of generalized word
representations in SMT. In: Proceedings of the 25th Annual Conference on Computational
Linguistics (COLING), Dublin, Ireland, pp. 421–432.
133. Aiken, Milam and Balan, Shilpa. 2011. An analysis of Google Translate accuracy.
Translation Journal 16: 1–3.
134. Wesley-Tanaskovic, Ines, Tocatlian, Jacques and Roberts, Kenneth H. 1994. Expanding
access to science and technology: The role of information technologies. Proceedings of
the Second International Symposium on the Frontiers of Science and Technology Held
in Kyoto, Japan, 12–14 May 1992. United Nations University Press.
135. Helft, Miguel. 2010. Google’s computing power refines translation tool. The New York
Times.
136. Lehrberger, John and Bourbeau, Laurent. 1988. Machine translation: Linguistic
characteristics of MT systems and general methodology of evaluation. John Benjamins
Publishing.
137. Costa-Jussa, Marta R. et al. 2012. Study and comparison of rule-based and statistical
catalan-spanish machine translation systems. Computing and Informatics 31: 245–270.
138. Weiss, Sandra. 2011. Cohesion and Comprehensibility in Polish-English Machine
Translated Texts.
139. Costa-Jussà, Marta R., Farrús, Mireia and Pons, Jordi Serrano. 2012. Machine translation
in medicine. Proceedings in ARSA-Advanced Research in Scientific Areas, p. 1.
140. Smith, Ray. 2007. An overview of the Tesseract OCR engine. In: icdar. IEEE,
pp. 629–633.
141. Jurafsky, Daniel and Martin, James H. 2000. Speech and language processing: An
introduction to natural language processing. Computational Linguistics, and Speech
Recognition, Pearson Education India.
142. Maccartney, Bill. 2005. NLP lunch tutorial: Smoothing.
143. Tian, Liang, Wong, Fai and Chao, Sam. 2011. Word alignment using GIZA++ on
Windows. Machine Translation.
144. Lavie, Alon. 2010. Evaluating the output of machine translation systems. AMTA
Tutorial.
145. Jassem, Krzysztof. 1997. Poleng–A machine translation system based on an electronic
dictionary. Speech and Language Technology 1: 161–194.
146. Jassem, Krzysztof, Graliński, Filip and Krynicki, Grzegorz. 2000. Poleng-adjusting
a rule-based polish-english machine translation system by means of corpus analysis.
In: Proceedings of the 5th European Association for Machine Translation Conference,
pp. 75–82.
248 Machine Learning in Translation Corpora Processing
167. Steinberger, Ralf et al. 2013. Dgt-tm: A freely available translation memory in 22
languages. arXiv preprint arXiv:1309, 5226.
168. Jassem, Krzysztof. 2004. Applying Oxford-PWN english-polish dictionary to machine
translation. In: Proceedings of the 9th European Association for Machine Translation
Workshop on Broadening Horizons of Machine Translation and its Applications.
169. Besacier, Laurent et al. 2014. Word confidence estimation for speech translation.
In: International Workshop on Spoken Language Translation.
170. Wołk, Krzysztof and Marasek, Krzysztof. 2014. Building subject-aligned
comparable corpora and mining it for truly parallel sentence pairs. Procedia Technology
18: 126–132.
171. Moore, Robert C. and Quirk, Chris. 2009. Improved smoothing for N-gram language
models based on ordinary counts. In: Proceedings of the ACL-IJCNLP 2009 Conference
Short Monographs. Association for Computational Linguistics, pp. 349–352.
172. Hsu, Bo-June and Glass, James. 2008. Iterative language model estimation: Efficient
data structure & algorithms. In: Proceedings of Interspeech, pp. 1–4.
173. Kirchhoff, Katrin et al. 2011. Application of statistical machine translation to public
health information: A feasibility study. Journal of the American Medical Informatics
Association 18.4: 473–478.
174. Malouf, Robert. 2002. A comparison of algorithms for maximum entropy parameter
estimation. In: Proceedings of the 6th conference on Natural language learning-Volume
20. Association for Computational Linguistics, pp. 1–7.
175. Rottmann, Kay and Vogel, Stephan. 2007. Word reordering in statistical machine
translation with a POS-based distortion model. Proc. of TMI, pp. 171–180.
176. Borman, Sean. 2004. The expectation maximization algorithm-a short tutorial.
Submitted for Publication, pp. 1–9.
177. Collins, Michael. 2011. Statistical machine translation: IBM models 1 and 2. Columbia:
Columbia University.
178. Och, Franz Josef and Ney, Hermann. 2003. A systematic comparison of various statistical
alignment models. Computational Linguistics 29: 19–51.
179. Vulić, I., Term Alignment: State of the Art Overview [Online]. Katholieke Universiteit
Leuven, 2010, Available: http://people.cs.kuleuven.be/~ivan.vulic/Files/TASOA.pdf,
retrieved on August 7, 2013.
180. Schoenemann, Thomas. 2010. Computing optimal alignments for the IBM-3 translation
model. In: Proceedings of the Fourteenth Conference on Computational Natural
Language Learning. Association for Computational Linguistics, pp. 98–106.
181. Fernández, Pablo Malvar. 2008. Improving word-to-word alignments using
morphological information. 2008. PhD Monograph. San Diego State University.
182. Brown, Peter F. et al. 1993. The mathematics of statistical machine translation: Parameter
estimation. Computational Linguistics 19: 263–311.
183. Knight, Kevin. 1999. A statistical MT tutorial workbook. Manuscript prepared for the
1999 JHU Summer Workshop.
184. Koehn, Philipp. 2010. Moses, statistical machine translation system, user manual and
code guide.
185. Zhao, Shaojun and Gildea, Daniel. 2010. A fast fertility hidden Markov model for
word alignment using MCMC. In: Proceedings of the 2010 Conference on Empirical
Methods in Natural Language Processing. Association for Computational Linguistics,
pp. 596–605.
186. Ma, Yanjun. 2009. Out of GIZA—Efficient Word Alignment Models for SMT. NCLT
Seminar Series.
187. Federico, Marcello, Bertoldi, Nicola and Cettolo, Mauro. 2008. IRSTLM: An open
source toolkit for handling large scale language models. In: Interspeech, pp. 1618–1621.
250 Machine Learning in Translation Corpora Processing
188. Snover, Matthew et al. 2006. A study of translation edit rate with targeted human
annotation. In: Proceedings of association for machine translation in the Americas,
pp. 223–231.
189. Denkowski, Michael and Lavie, Alon. 2011. Meteor 1.3: Automatic metric for reliable
optimization and evaluation of machine translation systems. In: Proceedings of the
Sixth Workshop on Statistical Machine Translation. Association for Computational
Linguistics, pp. 85–91.
190. Jurafsky, Dan. 2015. Language modeling: Introduction to n-grams [Online]. Stanford
University. Available: https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf,
retrieved on November 29, 2015.
191. Borisov, Alexey and Galinskaya, Irina. 2014. Yandex school of data analysis russian-
english machine translation system for wmt14. ACL 2014, p. 66.
192. Michie, Donald, Spiegelhalter, David J. and Taylor, Charles C. 1994. Machine learning,
neural and statistical classification.
193. Abdelali, Ahmed et al. 2014. The AMARA corpus: Building parallel language
resources for the educational domain. In: Proceedings of the 9th International Conference
on Language Resources and Evaluation, Reykjavik, Iceland. May.
194. Li, Da and Becchi, Michela. 2012. Multiple pairwise sequence alignments with the
needleman-wunsch algorithm on GPU. In: High Performance Computing, Networking,
Storage and Analysis (SCC), 2012 SC Companion. IEEE, pp. 1471–1472.
195. Yang, Wei and Lepage, Yves. 2014. Inflating a training corpus for SMT by using
unrelated unaligned monolingual data. In: Advances in Natural Language Processing.
Springer International Publishing, pp. 236–248.
196. Bergroth, Lasse, Hakonen, Harri and Raita, Timo. 2000. A survey of longest common
subsequence algorithms. In: String Processing and Information Retrieval, 2000. SPIRE
2000. Proceedings. Seventh International Symposium on. IEEE, pp. 39–48.
197. Sharp, Bernadette, Carl, Michael, Zock, Michael and Jakobsen Arnt Lykke (eds.). 2011.
Human-Machine Interaction in Translation: Proceedings of the 8th International NLPCS
Workshop. Samfundslitteratur.
198. Cherry, Colin and Foster, George. 2012. Batch tuning strategies for statistical machine
translation. In: Proceedings of the 2012 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies.
Association for Computational Linguistics, pp. 427–436.
199. Sakti, Sakriani et al. 2013. A-STAR: Toward translating Asian spoken languages.
Computer Speech & Language 27: 509–527.
200. Sakti, Sakriani et al. 2008. Development of Indonesian large vocabulary continuous
speech recognition system within A-STAR project. In: IJCNLP, pp. 19–24.
201. Gale, William A. and Church, Kenneth W. 1993. A program for aligning sentences in
bilingual corpora. Computational Linguistics 19: 75–102.
202. Oyeka, Ikewelugo Cyprian Anaene and Ebuh, Godday Uwawunkonye. 2012. Modified
Wilcoxon signed-rank test. Open Journal of Statistics 2: 172.
203. Och, Franz Josef. 2003. Minimum error rate training in statistical machine translation.
Proceedings of the 41st Annual Meeting on Association for Computational Linguistics,
Volume 1. Association for Computational Linguistics.
204. Nakamura, S. et al. 2007. A-star: Asia speech translation consortium. In: Proc. ASJ
Autumn Meeting, page to appear, Yamanashi, Japan.
205. Choong, Careemah. 2014. The difference between written and spoken english.
Assignment Unit 1 A in fulfillment of Graduate Diploma in English.
206. Koehn, Philipp, Och, Franz Josef and Marcu, Daniel. 2003. Statistical phrase-based
translation. In: Proceedings of the 2003 Conference of the North American Chapter
of the Association for Computational Linguistics on Human Language Technology-
Volume 1. Association for Computational Linguistics, pp. 48–54.
References 251
207. Frederick, Jelinek. 1997. Statistical methods for speech recognition. MIT Press.
208. Zespół Przetwarzania Sygnałów Agh. Ngram [Online]. Available: http://www.dsp.agh.
edu.pl/doku.php?id=pl:resources:ngram, retrieved on August 7, 2015.
209. Hasan, A., Islam, Saria and Rahman, M. 2012. A comparative study of Witten Bell
and Kneser-Ney smoothing methods for statistical machine translation. Journal of
Information Technology 1: 1–6.
210. Bertoldi, Nicola, Haddow, Barry and Fouet, Jean-Baptiste. 2009. Improved minimum
error rate training in Moses. The Prague Bulletin of Mathematical Linguistics 91: 7–16.
211. Collins, Michael. 1999. Head-driven statistical models for natural language parsing.
PhD monograph. University of Pennsylvania.
212. Shannon, C.E. 1948. A Mathematical theory of communication. Bell System Technical
Journal 27(3): 379–423.
213. Gerber, Laurie and Yang, Jin. 1997. Systran mt dictionary development. Machine
Translation: Past, Present and Future. In: Proceedings of Machine Translation Summit
VI. October.
214. Birch, Alexandra, Osborne, Miles and Koehn, Philipp. 2008. Predicting success in
machine translation. Proceedings of the Conference on Empirical Methods in Natural
Language Processing. Association for Computational Linguistics.
215. Grefenstette, Gregory and Tapanainen, Pasi. 1994. What is a word, what is a sentence?
Problems of tokenisation. Rank Xerox Research Centre.
216. Trim, Craig. The Art Of Tokenization [Online]. Jan, 23, 2013. Available: https://www.
ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en, retrieved
on November 30, 2015.
217. Wieczorek, M. 2011. Linguistic aspects of text-to-speech synmonograph. Uniwersytet
Im. Adama Mickiewicza, Poznań.
218. Wołk, Krzysztof, Rejmund, Emilia and Marasek, Krzysztof. 2015. Harvesting
comparable corpora and mining them for equivalent bilingual sentences using statistical
classification and analogy-based heuristics. In: Foundations of Intelligent Systems.
Springer International Publishing, pp. 433–441.
219. Langa, Natalia and Wojak, Aleksandra. 2011. Ewaluacja systemów tłumaczenia
automatycznego., Uniwersytet im. Adama Mickiewicza, Poznań.
220. Lo, Chi-Kiu and Wu, Dekai. 2011. MEANT: An inexpensive, high-accuracy, semi-
automatic metric for evaluating translation utility via semantic frames. In: Proceedings
of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies-Volume 1. Association for Computational Linguistics,
pp. 220–229.
221. Bojar, Ondřej and Wu, Dekai. 2012. Towards a predicate-argument evaluation for MT.
In: Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical
Translation. Association for Computational Linguistics, pp. 30–38.
222. Pradhan, Sameer S. et al. 2004. Shallow semantic parsing using support vector machines.
In: HLT-NAACL, pp. 233–240.
223. Tiedemann, Jörg. 2009. News from OPUS-A collection of multilingual parallel corpora
with tools and interfaces. In: Recent Advances in Natural Language Processing,
pp. 237–248.
224. Carmigniani, Julie. 2011. Augmented reality methods and algorithms for hearing
augmentation. Florida Atlantic University.
225. NeWo LLC. “EnWo—English cam translator.” Last modified January 12, 2014. https://
play.google.com/store/apps/details?id=com.newo.enwo.
226. Cui, Lei et al. 2013. Multi-domain adaptation for SMT using multi-task learning. In:
EMNLP, pp. 1055–1065.
227. Google. “Translate API.” Last modified April 6, 2015. https://cloud.google.com/
translate/v2/pricing.
252 Machine Learning in Translation Corpora Processing
228. Martedi, Sandy, Uchiyama, Hideaki and Saito, Hideo. 2010. Clickable augmented
documents. In: 2010 IEEE International Workshop on Multimedia Signal Processing
(MMSP). IEEE, pp. 162–166. https://ieeexplore.ieee.org/xpl/mostRecentIssue.
jsp?punumber=5656074
229. Abbyy. “TextGrabber + Translator.” Last modified February 16, 2015. http://abbyy-
textgrabber.android.informer.com/.
230. Du, Jun et al. 2011. Snap and translate using windows phone. In: 2011 International
Conference on Document Analysis and Recognition (ICDAR). IEEE, pp. 809–813.
https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6065245
231. The Economist. Word Lens: This changes everything. Last modified December 18,
2010. http://www.economist.com/blogs/gulliver/2010/12/instant_translation.
232. Datamark, Inc. “OCR as a Real-Time Language Translator.” Last modified July 30,
2012. https://www.datamark.net/blog/ocr-as-a-real-time-language-translator.
233. Khan, Tabish et al. 2014. Augmented reality based word translator. In International
Journal of Innovative Research in Computer Science & Technology (IJIRCST), pp. 2(2).
234. Mahbub-Uz-Zaman, S. and Islam, Tanjina. 2012. Application of augmented reality:
Mobile camera based bangla text detection and translation. 2012. PhD Thesis. BRAC
University.
235. Emmanuel Ashish S. and Nithyanandam S. 2014. An optimal text recognition
and translation system for smart phones using genetic programming and cloud. In
International Journal of Engineering Science and Innovative Technology (IJESIT)
3(2): 437–443.
236. Toyama, Takumi et al. 2014. A mixed reality head-mounted text translation system using
eye gaze input. In: Proceedings of the 19th international conference on Intelligent User
Interfaces. ACM, pp. 329–334.
237. Information Systems Laboratory, Adam Mickiewicz University. “SyMGIZA++.” Last
modified April 27, 2011. http://psi.amu.edu.pl/en/index.php?title=SyMGIZA.
238. Dušek, Ondrej et al. 2014. Machine translation of medical texts in the khresmoi project.
In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 221–228.
239. Worldwide, H. 1998. Medical Phrases and Terms Translation Demo.
240. Rojas, Raúl. 2013. Neural networks: A systematic introduction. Springer Science &
Business Media.
241. Jesan, John Peter and Lauro, Donald M. 2003. Human brain and neural network
behavior: A comparison. Ubiquity, 2003, November: 2–2.
242. Tiedemann, Jörg. 2009. News from OPUS-A collection of multilingual parallel corpora
with tools and interfaces. In: Recent Advances in Natural Language Processing,
pp. 237–248.
243. Cho, Kyunghyun et al. 2014. Learning phrase representations using RNN encoder-
decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
244. Goodfellow, Ian J. et al. 2013. Maxout networks. arXiv preprint arXiv:1302.4389.
245. Pascanu, Razvan et al. 2013. How to construct deep recurrent neural networks. arXiv
preprint arXiv:1312.6026.
246. Zeiler, Matthew D. 2012. Adadelta: An adaptive learning rate method. arXiv preprint
arXiv:1212.5701.
247. Graves, Alex. 2012. Sequence transduction with recurrent neural networks. arXiv
preprint arXiv:1211.3711.
248. Boulanger-Lewandowski, Nicolas, Bengio, Yoshua and Vincent, Pascal. 2013. Audio
chord recognition with recurrent neural networks. In: ISMIR, pp. 335–340.
249. Mikolov, Tomas et al. 2011. Rnnlm-recurrent neural network language modeling toolkit.
In: Proc. of the 2011 ASRU Workshop, pp. 196–201.
250. Roessler, Ross. 2010. A GPU implementation of needleman-wunsch, specifically for use
in the program pyronoise 2. Computer Science & Engineering.
References 253
251. European Federation of Hard of Hearing People: State of subtitling access inEU.
2011 Report. http://ec:europa:eu/internal market/consultations/2011/audiovisual/non-
registered-organisations/european-federation-of-hard-of-hearing-people-efhoh-en:pdf,
2011. [Online; accessed 30 Jan. 2016].
252. Romero-Fresco, Pablo and Pérez, Juan Martínez. 2015. Accuracy rate in live subtitling:
The NER model. In: Audiovisual Translation in a Global Context. Palgrave Macmillan,
London, pp. 28–50.
253. Dutka L., Szarkowska A., Chmiel A., Lijewska A., Krejtz K., Marasek K. and Brocki
L. 2015. Are interpreters better respeakers? An exploratory study on respeaking
competences. 5th International Symposium on respeaking, live subtitling and
accessibility, Rome, 12 June. https://www.unint.eu/it/ricerca/gruppi-di-ricerca/8-
pagina/494-respeaking-live-subtitling-and-accessibility.html
254. Hovy, Eduard H. 1999. Toward finely differentiated evaluation metrics for machine
translation. In: Proceedings of the EAGLES Workshop on Standards and Evaluation.
Pisa, Italy.
255. Wołk, Krzysztof and Koržinek, Danijel. 2016. Comparison and adaptation of
automatic evaluation metrics for quality assessment of re-speaking. arXiv preprint
arXiv:1601.02789.
256. Birch, Alexandra et al. 2013. The feasibility of HMEANT as a human MT evaluation
metric. In: Proceedings of the Eighth Workshop on Statistical Machine Translation,
pp. 52–61.
257. FUREY, Terrence S., et al. Support vector machine classification and validation of
cancer tissue samples using microarray expression data. Bioinformatics, 2000, 16.10:
906–914.
258. Graves, Alex and Schmidhuber, Jürgen. 2005. Framewise phoneme classification
with bidirectional LSTM and other neural network architectures. Neural Networks
18.5-6: 602–610.
259. Tai, Kai Sheng, Socher, Richard and Manning, Christopher D. 2015. Improved semantic
representations from tree-structured long short-term memory networks. arXiv preprint
arXiv:1503.00075.
260. US Department of Health and Human Services. 2012. PROMIS: Instrument Development
and Psychometric Evaluation Scientific Standards.
261. Wołk, Krzysztof and Marasek, Krzysztof. 2015. Unsupervised comparable corpora
preparation and exploration for bi-lingual translation equivalents. arXiv preprint
arXiv:1512.01641.
262. Bonomi, A.E. et al. 1996. Multilingual translation of the Functional Assessment of
Cancer Therapy (FACT) quality of life measurement system. Quality of Life Research
5.3: 309–320.
263. Wild, Diane et al. 2009. Multinational trials—recommendations on the translations
required, approaches to using the same language in different countries, and the
approaches to support pooling the data: The ISPOR patient-reported outcomes
translation and linguistic validation good research practices task force report. Value in
Health 12.4: 430–440.
264. Wołk, Krzysztof and Marasek, Krzysztof. 2015. Polish-English speech statistical
machine translation systems for the IWSLT 2013. arXiv preprint arXiv:1509.09097.
265. Wang, Longyue et al. 2014. A systematic comparison of data selection criteria for smt
domain adaptation. The Scientific World Journal, 2014.
266. Junczys-Dowmunt, Marcin and Szał, Arkadiusz. 2012. Symgiza++: Symmetrized
word alignment models for statistical machine translation. In: Security and Intelligent
Information Systems. Springer, Berlin, Heidelberg, pp. 379–390.
267. Moses statistical machine translation, “OOVs.” 2015. http://www.statmt.org/
moses/?n=Advanced.OOVs#ntoc2. Accessed 27 September 2015.
254 Machine Learning in Translation Corpora Processing
268. Moses statistical machine translation, “Build reordering model.” 2013. http://www.
statmt.org/moses/?n=FactoredTraining. Build Reordering Model. Accessed 10 October
2015.
269. Moore, Robert C. and Lewis, William. 2010. Intelligent selection of language model
training data. In: Proceedings of the ACL 2010 conference short papers. Association for
Computational Linguistics, pp. 220–224.
270. Dieny, Romain et al. 2011. Bioinformatics inspired algorithm for stereo correspondence.
International Conference on Computer Vision Theory and Application, 465–473.
271. Robertson, Michael J. et al. 1989. The weighted index method: A new technique
for analyzing planar optical waveguides. Journal of Lightwave Technology
7.12: 2105–2111.
272. Derksen, Shelley and Keselman, Harvey J. 1992. Backward, forward and stepwise
automated subset selection algorithms: Frequency of obtaining authentic and noise
variables. British Journal of Mathematical and Statistical Psychology 45.2: 265–282.
273. Wołk, Krzysztof, Marasek, Krzysztof and Wołk, Agnieszka. 2016. Exploration for
Polish-* bi-lingual translation equivalents from comparable and quasi-comparable
corpora. In: Computer Science and Information Systems (FedCSIS), 2016 Federated
Conference on. IEEE, pp. 517–525.
274. Anderson, Stephen R. 2004. How many languages are there in the world. Linguistic
Society of America.
275. List of languages by number of native speakers. 2016. Wikipedia https://en.wikipedia.
org/wiki/List_of_languages_by_number_of_native_speakers. Accessed 16.02.2016.
276. Paolillo, John C. and Das, Anupam. 2006. Evaluating language statistics: The ethnologue
and beyond. Contract report for UNESCO Institute for Statistics.
277. English language in Europe 2016. Wikipedia. https://en.wikipedia.org/wiki/English_
language_in_Europe. Accessed 16 February 2017.
278. Munteanu, Dragos Stefan, Fraser, Alexander and Marcu, Daniel. 2004. Improved machine
translation performance via parallel sentence extraction from comparable corpora. In:
Proceedings of the Human Language Technology Conference of the North American
Chapter of the Association for Computational Linguistics: HLT-NAACL, 2004.
279. Callison-Burch, Chris and Osborne, Miles. 2002. Co-Training for statistical machine
translation. 2002. PhD Thesis. Master’s thesis, School of Informatics, University of
Edinburgh.
280. Ueffing N., Haffari G. and Sarkar A. 2009. Semisupervised learning for machine
translation. pp. 237–256. In: Cyril Goutte, Nicola Cancedda, Marc Dymetman
and George Foster (eds.). Learning Machine Translation. MIT Press, Pittsburgh,
pp. 237–256.
281. Mann, Gideon S. and Yarowsky, David. 2001. Multipath translation lexicon induction
via bridge languages. In: Proceedings of the second meeting of the North American
Chapter of the Association for Computational Linguistics on Language technologies.
Association for Computational Linguistics, pp. 1–8.
282. Kumar, Shankar, Och, Franz J. and Macherey, Wolfgang. 2007. Improving word
alignment with bridge languages. In: Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL).
283. Wu, Hua and Wang, Haifeng. 2007. Pivot language approach for phrase-based statistical
machine translation. Machine Translation 21.3: 165–181.
284. Habash, Nizar and Hu, Jun. 2009. Improving Arabic-Chinese statistical machine
translation using English as pivot language. In: Proceedings of the Fourth Workshop
on Statistical Machine Translation. Association for Computational Linguistics,
pp. 173–181.
References 255
285. Eisele, Andreas et al. 2008. Hybrid machine translation architectures within and beyond
the EuroMatrix project. In: Proceedings of the 12th Annual Conference of the European
Association for Machine Translation (EAMT 2008), pp. 27–34.
286. Cohn, Trevor and Lapata, Mirella. 2007. Machine translation by triangulation: Making
effective use of multi-parallel corpora. In: Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics, pp. 728–735.
287. Leusch, Gregor et al. 2010. Multi-pivot translation by system combination. In:
International Workshop on Spoken Language Translation (IWSLT 2010).
288. Bertoldi, Nicola et al. 2008. Phrase-based statistical machine translation with pivot
languages. In: International Workshop on Spoken Language Translation (IWSLT).
289. Yujian, Li and Bo, Liu. 2007. A normalized Levenshtein distance metric. IEEE
Transactions on Pattern Analysis and Machine Intelligence 29.6: 1091–1095.
290. Cao G., Nie J. and Bai, J. 2005. Integrating term relationships into language models. In:
Proceedings of the 28th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval. Salvador pp. 298–305.
291. Bellegarda, J. 2000. Data-driven semantic language modeling. In: Institute for
Mathematics and its Applications Workshop.
292. Thomo, A. 2009. Latent semantic analysis (LSA) tutorial. http://webhome.cs.uvic.
ca/~thomo/svd.pdf. Accessed 16 February 2007.
293. Verspoor, Karin et al. 2008. A semantics-enhanced language model for unsupervised
word sense disambiguation. In: International Conference on Intelligent Text Processing
and Computational Linguistics. Springer, Berlin, Heidelberg, pp. 287–298.
294. Lison, Pierre and Tiedemann, Jörg. Opensubtitles 2016: Extracting large parallel corpora
from movie and tv subtitles. 2016.
295. Fujita, Atsushi and Isabelle, Pierre. 2015. Expanding paraphrase lexicons by exploiting
lexical variants. In: Proceedings of the 2015 Conference of the North American Chapter
of the Association for Computational Linguistics. Human Language Technologies,
pp. 630–640.
296. Shen, Libin, Sarkar, Anoop and Och, Franz Josef. 2004. Discriminative reranking for
machine translation. In: Proceedings of the Human Language Technology Conference
of the North American Chapter of the Association for Computational Linguistics. HLT-
NAACL.
297. Devlin, Jacob et al. 2014. Fast and robust neural network joint models for statistical
machine translation. In: Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pp. 1370–1380.
298. Daumé Iii, Hal and Jagarlamudi, Jagadeesh. 2011. Domain adaptation for machine
translation by mining unseen words. In: Proceedings of the 49th Annual Meeting of
the Association for Computational Linguistics. Human Language Technologies: Short
papers-Volume 2. Association for Computational Linguistics, pp. 407–412.
299. Lin, Sung-Chien et al. 1997. Chinese language model adaptation based on document
classification and multiple domain-specific language models. In: Fifth European
Conference on Speech Communication and Technology.
300. Gao, Jianfeng et al. 2002. Toward a unified approach to statistical language modeling
for Chinese. ACM Transactions on Asian Language Information Processing (TALIP)
1.1: 3–33.
301. Koehn, Philipp. 2004. Pharaoh: A beam search decoder for phrase-based statistical
machine translation models. In: Conference of the Association for Machine Translation
in the Americas. Springer, Berlin, Heidelberg, pp. 115–124.
302. Gotoh, Yoshihiko and Renals, Steve. 2000. Sentence boundary detection in broadcast
speech transcripts. In: ASR2000-Automatic Speech Recognition: Challenges for the
new Millenium ISCA Tutorial and Research Workshop (ITRW).
303. Burnard, Lou. 1995. Users Reference Guide British National Corpus Version 1.0.
Index
A B
A* algorithm 65, 66, 76, 78 back-off 26, 29, 30
A* search 39, 65, 67, 116, 179 Batch-MIRA 20, 113, 125, 126, 136
A* search algorithm 65, 67, 116, 179 Bayes rule 39, 227
ABBYY Aligner 71, 104–106 Bayes’ theorem 11
Absolute 30, 32, 34, 57, 115, 130, 131, 147, Berkeley Aligner 109
182, 222, 223 bidirectional 42, 107, 113, 115, 125, 136, 137,
absolute alignment 32 152, 154, 159, 163, 199, 200, 215
acquisition 215, 219 Bidirectional reordering 42, 199
adaptation 1, 46, 47, 51, 63, 72, 83, 113, 116– bilingual 10, 17, 19, 22, 48, 62, 64, 70, 73, 74,
118, 125, 132, 143, 144, 151, 177, 178, 79, 80, 83, 84, 109, 117–119, 126, 137,
193, 197, 198, 209, 215, 216, 219, 220, 143, 158, 159, 172, 176, 179, 180, 209,
224–227, 229–232 218, 220, 226, 227, 230, 231
Adaptation of parallel corpora 47 Bilingual Evaluation Understudy (BLEU) 48,
Alignment 4, 10, 11, 17, 19, 22, 28, 32–38, 70, 137, 172, 226
41–45, 47, 51, 52, 54–56, 60, 62–65, 68, Bing 98, 142
69, 71, 72, 74–76, 78–80, 82–84, 95, 98, Bleualign 69, 70, 104–106
100, 101, 104–109, 114, 115, 117, 118, Bootstrapping 62, 219, 223, 225
125, 126, 129, 136, 145, 154, 159, 166, brevity penalty 48, 50–52, 88, 137
175, 176, 178–180, 186, 198–200, 209, BTEC 73, 80, 159–162, 164, 220, 221, 223
210, 215, 218–222, 225, 231
alignment methods 37, 51, 62, 69, 72, 76, 84, C
105, 106
C++ 101
Analogy-based method 84, 117, 176, 177
Character Edit Rate 101
AR 9, 140, 141, 143, 147–149, 171–174
Coefficient 52, 53, 93, 94, 182–184, 186, 191,
AR systems 140, 141, 143, 147–149
204
ASR 1, 8, 150, 181, 183, 184, 187, 190, 213,
Cohesion 4
232–234, 236, 237
Combined 21, 32, 37, 53, 72, 125, 135, 145,
Assessment 53, 93, 163, 172, 181–183, 196,
159, 183, 198, 209, 211, 214, 216, 226,
197, 201
227, 229, 230, 232
A-STAR 6, 8
Comma Rules 15
Asymmetric lambda 92
Comma Splice 14, 15
Augmenting 142, 205, 216, 217
comparable articles 116, 180
augumented reality 140
Comparable corpora 1, 10, 17, 19, 59, 61, 63,
automatic speech recognition 1, 181, 213, 233
72, 83, 110, 115, 117, 118, 159, 162, 166,
automatically-aligned 80
258 Machine Learning in Translation Corpora Processing
MML 2, 116, 161, 162, 224, 231, 232 N-gram language models 27, 29, 226
Model 4 32, 34–37, 220 N-gram smoothing 28
Modified Moore-Lewis 2, 47, 113, 116, 117, NIST 48, 50–53, 89–92, 94, 102, 103, 106, 114,
126, 161, 171, 226 116, 120, 121, 123–129, 137–139, 146,
Modified Moore-Lewis filtering 2, 113, 116, 155–157, 161, 162, 164, 166, 167, 169,
117, 126, 161, 171, 226 170, 176, 181, 187–192, 194, 196, 198,
MongoDB 74, 79 202–204, 221–224, 236
monolingual 17, 21, 25, 46, 85, 109, 116–119, NKT 53, 183
126, 158, 197, 208, 209, 211–215, 218, NLP 110, 111, 118, 208, 216, 233, 236
230 NLTK 100
monolingual language model 21, 117 Noisy 17, 31, 38, 39, 62, 63, 72, 74, 83, 95, 96,
monotone 41–43, 154, 199, 215 103, 104, 115, 117, 147, 158, 174, 176,
Moore 2, 47, 113, 116, 117, 126, 161, 171, 180, 177, 179, 216, 219, 220, 222, 225
193, 197, 198, 224, 226–228, 231 Noisy channel model 31, 39
Moore-Levis 180, 224 noisy-parallel 63, 117, 176, 177, 219
Moore-Lewis 2, 47, 113, 116, 117, 126, 161, non-lexical expressions 23
171, 226, 228, 231 non-terminals 40, 114
Moore-Lewis filtering 2, 113, 116, 117, 126, Noun phrases 13, 15
161, 171, 226, 231 NP-complete complexity 39
Moses 2, 10, 18–23, 25, 26, 30, 41, 106, 111, NSR 53, 183
115, 127, 130, 132, 136, 144, 145, 153, NW 65, 76, 78, 165–171, 179
198, 199, 210, 215, 221, 222, 223, 226,
231 O
Moses decoder 22, 26
OCR 9, 140, 142, 143, 145–148, 151
Moses Statistical Machine Translation Toolkit 2
OOV 27, 28, 127, 131, 146, 158, 197–199, 210,
Moses tool 10, 18–20, 23, 111, 127, 145, 153,
215, 232
199, 215
OOV words 28, 127, 131, 146, 158, 197–199,
msd-bidirectional-fe 107, 113, 136, 137, 159,
210, 215, 232
199
open-source Moses toolkit 19
MT evaluation 48, 50, 51, 53, 86, 89, 141, 193
OpenSubtitles 73, 110, 119, 120, 127, 220, 230,
Multeval 162
236
operation sequence model (OSM) 22, 113, 114,
N
125, 126, 129, 136, 137, 158
native speaker 12, 132, 196, 205–207 OPUS 84, 109, 119, 152, 174, 235
Native Yalign 59, 63, 115, 159, 161, 166 orientation model 43
n-best 8, 113, 125, 126, 136, 137, 220, 221 Out of vocabulary words 16, 27, 226, 230
Needleman-Wunsch 2, 64, 65, 67, 69, 75, 77,
80, 116, 165, 166, 168, 179, 180 P
Needleman-Wunsch algorithm 64, 67, 75, 80,
parallel corpora 1, 4, 11, 17, 38, 47, 61–63, 72,
116, 165, 166, 168, 179, 180
79, 81, 83, 86, 95, 96, 104, 106, 110, 111,
NER 54, 181–192
117, 131, 144, 159, 163, 177–179, 205,
NER accuracy 183
211, 216, 218–220
NER model 182, 183
parallel corpus 7, 10, 17, 18, 46, 73, 84–86,
Neural 149–157, 193, 194, 196, 200–204, 218,
89, 95, 103, 152, 177, 180, 196, 208, 219,
225, 233, 239
220, 229
Neural machine translation 149, 151, 154, 157,
Parallel data mining 2, 62, 65, 72, 83, 117, 175
200, 218, 239
parallel text 7, 10, 11, 17, 62, 63, 70, 81, 103,
neural network 149–157, 193, 194, 196, 200,
131, 149, 179, 220, 225
201, 203, 225, 233, 239
parallel text corpora 103, 225
n-grams 11, 12, 22, 26–31, 48, 50–52, 87, 115,
parallel threads 75, 77
137, 163, 178, 197, 214, 222, 226, 230,
235
262 Machine Learning in Translation Corpora Processing