Module 5 Notes
Module 5 Notes
MODULE-5
CHAPTER-11
MACHINE TRANSLATION
11.1 INTRODUCTION
This chapter introduces Machine Translation (MT) — meaning using computers to translate from one
language to another.
Human translation (like translating literature or poetry) is extremely creative and complex.
Machine Translation, however, is mostly focused on practical tasks where full human creativity isn't
necessary.
The most common use of MT today is for information access.
For example:
Translating web instructions (like recipes, furniture steps)
Reading foreign news articles or government pages
Tools like Google Translate handle hundreds of billions of words every day across over 100
languages.
Another important use of MT is to assist human translators:
MT systems produce a rough draft, and then human translators fix it — this phase is called post-
editing.
This process is often called Computer-Aided Translation (CAT).
CAT is commonly used for localization — adapting content (like apps, websites, products) for
different languages and cultures.
A newer application of MT is real-time communication:
Example: Translating speech on-the-fly during conversations (before a sentence is even finished).
Image-based translation (like using a phone camera to translate a menu or street sign) also falls here.
The standard model for MT is the encoder-decoder network (also called a sequence-to-sequence
network).
These networks can be built using RNNs (Recurrent Neural Networks) or Transformers.
Such architectures are used for tasks like:
➔ In Japanese:
• Some aspects of all human languages are universal (true for every language) or statistical
universals (true for most languages).
• These arise because language is used for communication.
• Examples:
• Every language has words for people, eating, drinking, being polite, etc.
• Structural universals: Most languages have nouns, verbs, ways to ask questions, give
Commands, or show agreement/disagreement.
Despite universal features, languages differ a lot.
Translation divergences — differences between languages — are important to understand to
build better MT systems.
There are two types of differences:
Idiosyncratic differences:
Unique to each language.
Example: The word for "dog" is completely different in English, Spanish, Japanese, etc.
Systematic differences:
Patterns we can model across many languages.
Example: Some languages place the verb before the object, others place it after.
The study of these systematic similarities and differences between languages is called
linguistic typology.
Linguistic typology helps understand language structures and improve machine translation.
The World Atlas of Language Structures (WALS) is a resource that lists many such facts
about languages.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
Languages differ in the basic word order of Subject, Verb, and Object:
Notice how in Japanese and Arabic, verb placement and preposition/postposition usage are
different compared to English.
Other kinds of ordering preferences vary idiosyncratically from language to language. In some
SVO languages (like English and Mandarin) adjectives tend to appear before verbs, while in
others languages like Spanish and Modern Hebrew, adjectives appear after the noun:
(11.4) Spanish bruja verde English green witch
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
We only have to make one slight change to turn this language model with autoregressive generation into a
translation model that can translate from a source text in one language to a target text in a second: add an
sentence separation marker at the end of the source text, and then simply concatenate the target text.
Fig. 11.4 shows an English source text (“the green witch arrived”), a sentence separator token (<s>,
and a Spanish target text (“llego la bruja verde ´”).
To translate a source text, we run it through the network performing forward inference to generate
hidden states until we get to the end of the source.
Then we begin autoregressive generation, asking for a word in the context of the hidden layer from the
end of the source input as well as the end-of-sentence marker. Subsequent words are conditioned on
the previous hidden state and the embedding for the last word generated.
While our simplified figure shows only a single network layer for the encoder, stacked architectures
are the norm, where the output states from the top layer of the stack are taken as the final
representation.
A widely used encoder design makes use of stacked biLSTMs where the hidden states from top
layers from the forward and backward passes are concatenated as described to provide the
contextualized representations for each time step.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
Encoder-Decoder Architecture
Encoder
Takes the input sentence: “the green witch arrived”
Each word is converted to embeddings and passed through RNN layers to produce hidden
states.
Decoder
Generates the translated sentence: “llegó la bruja verde”
Teacher Forcing:
During training, instead of feeding the decoder's own predicted output at each time step, we
force the decoder to use the correct (gold) word from the training data as input for the
next step.
This helps the model learn faster and prevents it from compounding errors.
o We compare this predicted distribution with the true word using cross-entropy
loss:
The true previous word (not the model’s guess) is fed (teacher forcing).
Model predicts a distribution over possible next words.
Cross-entropy loss is computed using softmax outputs vs. gold target.
ATTENTION
The encoder processes the entire input sentence and condenses all its information into a
single context vector (final hidden state).
The decoder relies only on this context to generate all words in the output.
Limitation: This single vector can be a bottleneck, especially for long or complex sentences.
The attention mechanism allows the decoder to look at all encoder hidden states while
generating each word.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
Instead of:
We briefly introduce two commonly used approaches for dealing with this data sparsity: back
translation, which is a special case of the general statistical technique called data augmentation, and
multilingual models, and also discuss some socio-technical issues.
DATA AUGMENTATION
Data augmentation is a statistical technique for dealing with insufficient training data, by adding
new synthetic data that is generated from the current natural data.
The most common data augmentation technique for machine translation is called back translation.
Back translation relies on the intuition that while parallel corpora may be limited for particular
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
languages or domains, we can often find a large (or at least larger) monolingual corpus, to add to
the smaller parallel corpora that are available. The algorithm makes use of monolingual corpora in
the target language to generate synthetic bitexts.
In back translation, our goal is to improve source-to-target MT, given a small parallel text (a bitext)
in the source/target languages, and some monolingual data in the target language. We first use the
bitext to train a MT system in the reverse direction: a target-to-source MT system. We then use it to
translate the monolingual target data to the source language. Now we can add this synthetic bitext
(natural target sentences, aligned with MT-produced source sentences) to our training data, and
retrain our source-to-target MT model. For example suppose we want to translate from Navajo to
English but only have a small Navajo-English bitext, although of course we can find lots of
monolingual English data. We use the small bitext to build an MT engine going the other way (from
English to Navajo). Once we translate the monolingual English text to Navajo, we can add this
synthetic Navajo/English bitext to our training data.
Back translation has various parameters. One is how we generate the back translations; for example,
we can decode in greedy inference, or use beam search. Another parameter is the ratio of back
translated data to natural bitext data; we can oversample the bitext data (include multiple copies of
each sentence).
Multilingual Models
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
MT EVALUVATION
Translations are evaluated along two dimensions:
1. Adequacy:
How well the translation captures the exact meaning of the source sentence.
Sometimes called faithfulness or fidelity.
2. Fluency:
How fluent the translation is in the target language (is it grammatical, clear, readable,
natural).
Using humans to evaluate is most accurate, but automatic metrics are also used for convenience.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
The most accurate evaluations use human raters, such as online crowdworkers, to evaluate each
translation along the two dimensions. For example, along the dimension of fluency, we can ask how
intelligible, how clear, how readable, or how natural the MT output (the target text) is. We can give
the raters a scale, for example:
We can do the same thing to judge the second dimension, adequacy, using raters to assign
scores on a scale. If we have bilingual raters, we can give them the source sentence and a
proposed target sentence, and rate, on a 5-point or 100-point scale, how much of the
information in the source was preserved in the target.
If we only have monolingual raters but we have a good human translation of the source text,
we can give the monolingual raters the human reference translation and a target machine
translation and again rate how much information is preserved. If we use a fine-grained enough
scale, we can normalize raters by subtracting the mean from their scores and dividing by the
variance.
An alternative is to do ranking: give the raters a pair of candidate translations, and ask them
which one they prefer.
While humans produce the best evaluations of machine translation output, running a human
evaluation can be time consuming and expensive.
Automatic Evaluation
While humans produce the best evaluations of machine translation output, running a human
evaluation can be time consuming and expensive. For this reason automatic metrics are often used as
temporary proxies. Automatic metrics are less accurate than human evaluation, but can help test
potential system improvements, and even be used as an automatic loss function for training. In this
section we introduce two families of such metrics, those based on character- or word-overlap and
those based on embedding similarity.
translation we’d like to evaluate. The chrF metric ranks each MT target sentence by a function of the
number of character n-gram overlaps with the human translation.
Given the hypothesis and the reference, chrF is given a parameter k indicating the length of character
n-grams to be considered, and computes the average of the k precisions (unigram precision, bigram,
and so on) and the average of the k recalls (unigram recall, bigram recall, etc.).
Intuition:
How precisely did the system generate correct content?
“Out of what I generated, how much was right?”
Intuition:
How much of the correct answer did the system actually generate?
“Out of the true answer, how much did I get?”
The metric then computes an F-score by combining chrP and chrR using a weighting parameter β.
It is common to set β = 2, thus weighing recall twice as much as precision:
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
It works by comparing the output of the machine (candidate translation) with a human reference
translation.
BLEU is word-based: it checks how many n-grams (word sequences like "the cat", "cat sat",
etc.) from the system’s translation are found in the reference.
It calculates precision: out of what the system generated, how much matched?
Uses up to 4-gram precision.
Has a brevity penalty: penalizes translations that are too short.
It requires careful tokenization (splitting text into words correctly), otherwise it gives
unreliable results.
Imagine system A gives chrF = 0.67 and system B gives chrF = 0.66.
Is A really better than B? Or is that just random?
🔹 chrF is “local”:
It works at the character level, so small changes in word order or long phrases being moved
may not affect the score, even if the meaning changes.
But what if the translation uses a synonym or paraphrase? That might still be a good translation,
but chrF will penalize it.
🔹 Step-by-step:
1. Let:
o x = (x₁, ..., xₙ) be the reference translation
o x̃= (x̃ ₁, ..., x̃
ₘ) be the candidate (machine) translation
o r be a human quality rating for how close x̃is to x
2. Metrics like COMET and BLEURT:
o Train a model (often based on BERT) to predict human ratings.
o They pass both x and x̃through BERT, use the embeddings, and fine-tune on human-
judged examples.
o A final layer predicts how good the translation is.
3. Metrics like BERTScore:
o No human labels required.
o They compute semantic similarity between words or tokens using cosine similarity
of BERT embeddings.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
But when translating into English, which requires a gendered pronoun (he/she), MT systems
must choose a gender.
Female gender is assigned for nurturing roles (e.g. nurse, wedding organizer).
Male gender is assigned for high-status roles (e.g. CEO, scientist, engineer).
This shows stereotypical gender mapping in translations — nurse = she, CEO = he — even
when no gender is specified in the source language.
Evidence of Amplification
This shows they perform worse when gender roles don’t match stereotypes.
Ethical Takeaway:
********END OF MODULE-5******