0% found this document useful (0 votes)
26 views22 pages

Module 5 Notes

This chapter discusses Machine Translation (MT), focusing on its practical applications such as information access and assisting human translators through Computer-Aided Translation (CAT). It covers the encoder-decoder model used in MT, highlighting the importance of understanding language divergences and typology for effective translation. Additionally, it addresses challenges in low-resource situations and introduces techniques like back translation and multilingual models to enhance MT performance.

Uploaded by

akshaynayak771w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views22 pages

Module 5 Notes

This chapter discusses Machine Translation (MT), focusing on its practical applications such as information access and assisting human translators through Computer-Aided Translation (CAT). It covers the encoder-decoder model used in MT, highlighting the importance of understanding language divergences and typology for effective translation. Additionally, it addresses challenges in low-resource situations and introduces techniques like back translation and multilingual models to enhance MT performance.

Uploaded by

akshaynayak771w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

NATURAL LANGUGAE PROCESSING BAI601/BAD613B

MODULE-5
CHAPTER-11
MACHINE TRANSLATION
11.1 INTRODUCTION

This chapter introduces Machine Translation (MT) — meaning using computers to translate from one
language to another.
Human translation (like translating literature or poetry) is extremely creative and complex.
Machine Translation, however, is mostly focused on practical tasks where full human creativity isn't
necessary.
The most common use of MT today is for information access.
For example:
 Translating web instructions (like recipes, furniture steps)
 Reading foreign news articles or government pages
 Tools like Google Translate handle hundreds of billions of words every day across over 100
languages.
Another important use of MT is to assist human translators:

MT systems produce a rough draft, and then human translators fix it — this phase is called post-
editing.
This process is often called Computer-Aided Translation (CAT).

CAT is commonly used for localization — adapting content (like apps, websites, products) for
different languages and cultures.
A newer application of MT is real-time communication:
Example: Translating speech on-the-fly during conversations (before a sentence is even finished).
Image-based translation (like using a phone camera to translate a menu or street sign) also falls here.

The standard model for MT is the encoder-decoder network (also called a sequence-to-sequence
network).
These networks can be built using RNNs (Recurrent Neural Networks) or Transformers.
Such architectures are used for tasks like:

 Classification (e.g., predicting sentiment as positive/negative)


 Sequence labeling (e.g., tagging each word with its part of speech)
In sequence labeling, each input word xi is associated with an output tag yi(e.g., part of speech like
noun, verb, adjective).
In machine translation, you can’t just map words one-by-one between languages because word order
and sentence structure can be very different across languages.

 Example 1: English to Japanese


NATURAL LANGUGAE PROCESSING BAI601/BAD613B

English: He wrote a letter to a friend


Japanese: tomodachi ni tegami-o kaita (friend to letter wrote)

➔ In Japanese:

The verb ("wrote") comes at the end, not the middle.


Subjects like "he" can be dropped (no explicit "he" in Japanese).
This shows how MT must learn to rearrange and adjust when translating between languages

 Example 2: English to Chinese (more complex)


Given a real United Nations sentence translated from Chinese to English, you can see:

 The order of words is very different.


 Chinese combines ideas that English splits into separate phrases.
 No grammatical plural ("-s") in Chinese; so a special word (like 各项/various) is used.
 Dates are written differently (year/month/day).
 Articles like "the" are needed in English but not in Chinese.
 English adds connecting words ("in which", "it") that are unnecessary in Chinese.
 Capitalization differs.
 Machine translation needs to handle these structural differences, not just word-by-word
mapping.
11.2 LANGUAGE DIVERGENCIES AND TYPOLOGY

• Some aspects of all human languages are universal (true for every language) or statistical
universals (true for most languages).
• These arise because language is used for communication.
• Examples:
• Every language has words for people, eating, drinking, being polite, etc.
• Structural universals: Most languages have nouns, verbs, ways to ask questions, give
Commands, or show agreement/disagreement.
 Despite universal features, languages differ a lot.
 Translation divergences — differences between languages — are important to understand to
build better MT systems.
 There are two types of differences:
 Idiosyncratic differences:
 Unique to each language.
 Example: The word for "dog" is completely different in English, Spanish, Japanese, etc.
 Systematic differences:
 Patterns we can model across many languages.
 Example: Some languages place the verb before the object, others place it after.
 The study of these systematic similarities and differences between languages is called
linguistic typology.
 Linguistic typology helps understand language structures and improve machine translation.
 The World Atlas of Language Structures (WALS) is a resource that lists many such facts
about languages.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

11.1.1 WORD ORDER TYPOLOGY.

Languages differ in the basic word order of Subject, Verb, and Object:

 Notice how in Japanese and Arabic, verb placement and preposition/postposition usage are
different compared to English.
 Other kinds of ordering preferences vary idiosyncratically from language to language. In some
SVO languages (like English and Mandarin) adjectives tend to appear before verbs, while in
others languages like Spanish and Modern Hebrew, adjectives appear after the noun:
(11.4) Spanish bruja verde English green witch
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

11.1.2 LEXICAL DIVERGENCIES


 Lexical divergence refers to how different languages represent the same concepts using different
words or structures, depending on context, culture, or grammar.
 Translation between languages is not always straightforward due to these differences.
 Contextual Word Differences Words like “bass” in English can refer to either a fish or a musical
instrument.
 In Spanish, these are translated as lubina (fish) and bajo (instrument).
 Similarly, “wall” in English can be: Wand in German (interior wall)Mauer in German (exterior
wall).
 English uses “brother” for all male siblings, but: Mandarin distinguishes between gege (older
brother) and didi (younger brother).
 These examples highlight the need for disambiguation, making Word Sense Disambiguation
important in translation and NLP (Natural Language Processing).

 Grammatical Constraints English marks singular/plural nouns.


 Mandarin does not. French and Spanish require gender agreement on adjectives, which English
does not.
 This grammatical divergence adds complexity to accurate translation.
 Languages divide conceptual spaces differently.
 Example: Translating “leg” from English to French“Leg” in English could be:Jambe (leg of a
person/animal)
 Étape (stage/leg of a journey)Pied (leg/foot of a chair)Patte (animal’s leg/paw)
 Thus, the same word in English could map to multiple, more specific words in French.
 This is shown in Figure 11.2 as overlapping ovals for words like leg, paw, foot and their French
equivalents.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

11.1.3 MORPHOLOGICAL TYPOLOGY


Morphological typology classifies languages based on how they use morphemes (the smallest units
of meaning) in words.
It focuses on two dimensions:
1. Number of morphemes per word:
 Isolating languages (e.g., Vietnamese, Cantonese): Each word usually consists of a
single morpheme.
 Polysynthetic languages (e.g., Siberian Yupik “Eskimo”): Words may contain many
morphemes, equating to full sentences in English.
Example: The bottle floated out.
2. Morpheme segmentation:
 Agglutinative languages (e.g., Turkish):
Morphemes are clearly separable with distinct boundaries.
 Fusion languages (e.g., Russian):

A single affix may express multiple grammatical categories.


Example: "-om" in the Russian word stolom fuses instrumental case, singular number, and first
declension.
Referential Density
Referential density refers to how languages handle the expression of pronouns and references:
English requires explicit pronouns to refer back to known subjects.
Pro-drop languages (e.g., Spanish, Chinese, Japanese):
Often omit pronouns when the context is clear.
Example (Spanish):
[El jefe] dio con un libro. Ø Mostró a un descifrador ambulante.
[The boss] came upon a book. [He] showed it to a wandering decoder.
In Spanish, “Ø” marks the omitted subject pronoun (e.g., he).

Hot vs. Cold Languages:


Hot languages (e.g., English): Use explicit cues → easier for the listener.
Cold languages (e.g., Chinese, Japanese): Leave out cues → listener does more inference to
understand.
This classification was inspired by McLuhan's media theory:
Hot media = fully detailed (like movies).
Cold media = minimal detail (like comics), require more interpretation.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

11.3 ENCODER- DECODER MODEL

Encoder-decoder networks, or sequence-to-sequence networks, are models capable of generating


contextually appropriate, arbitrary length, output sequences.
Encoder-decoder networks have been applied to a very wide range of applications including machine
translation, summarization, question answering, and dialogue.
The key idea underlying these networks is the use of an encoder network that takes an input sequence
and creates a contextualized representation of it, often called the context.
This representation is then passed to a decoder which generates a task specific output sequence.
Fig.

Encoder-decoder networks consist of three components:


1. An encoder that accepts an input sequence, x n 1 , and generates a corresponding sequence of
contextualized representations, h n 1 . LSTMs, GRUs, convolutional networks, and Transformers
can all be employed as encoders.
2. A context vector, c, which is a function of h n 1 , and conveys the essence of the input to the
decoder.
3. A decoder, which accepts c as input and generates an arbitrary length sequence of hidden states h
m 1 , from which a corresponding sequence of output states y m 1 , can be obtained. Just as with
encoders, decoders can be realized by any kind of sequence architecture.

11.3.1 The Encoder-Decoder with RNN


• Like any language model, we can break down the probability as follows:
• p(y) = p(y1)p(y2|y1)p(y3|y1, y2)...P(ym|y1,..., ym−1)
• At a particular time t, we pass the prefix of t − 1 tokens through the language model, using
forward inference to produce a sequence of hidden states, ending with the hidden state
corresponding to the last word of the prefix. We then use the final hidden state of the prefix
as our starting point to generate the next token.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

We only have to make one slight change to turn this language model with autoregressive generation into a
translation model that can translate from a source text in one language to a target text in a second: add an
sentence separation marker at the end of the source text, and then simply concatenate the target text.

Fig. 11.4 shows an English source text (“the green witch arrived”), a sentence separator token (<s>,
and a Spanish target text (“llego la bruja verde ´”).
To translate a source text, we run it through the network performing forward inference to generate
hidden states until we get to the end of the source.
Then we begin autoregressive generation, asking for a word in the context of the hidden layer from the
end of the source input as well as the end-of-sentence marker. Subsequent words are conditioned on
the previous hidden state and the embedding for the last word generated.

While our simplified figure shows only a single network layer for the encoder, stacked architectures
are the norm, where the output states from the top layer of the stack are taken as the final
representation.
A widely used encoder design makes use of stacked biLSTMs where the hidden states from top
layers from the forward and backward passes are concatenated as described to provide the
contextualized representations for each time step.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

11.3.2 Training the Encoder-Decoder Model

Encoder-Decoder Architecture

 Encoder
Takes the input sentence: “the green witch arrived”
Each word is converted to embeddings and passed through RNN layers to produce hidden
states.
 Decoder
Generates the translated sentence: “llegó la bruja verde”

2. Training with Teacher Forcing

 Teacher Forcing:
During training, instead of feeding the decoder's own predicted output at each time step, we
force the decoder to use the correct (gold) word from the training data as input for the
next step.

This helps the model learn faster and prevents it from compounding errors.

3. Softmax & Loss Calculation

 At each decoder time step:


o The model outputs a distribution over vocabulary using a softmax layer

o We compare this predicted distribution with the true word using cross-entropy
loss:

These losses are then averaged across all time steps:


NATURAL LANGUGAE PROCESSING BAI601/BAD613B

 Input sentence goes into the encoder.


 The encoder outputs hidden states.
 Decoder receives <s> (start token) and predicts the first word.
 At each time step:

 The true previous word (not the model’s guess) is fed (teacher forcing).
 Model predicts a distribution over possible next words.
 Cross-entropy loss is computed using softmax outputs vs. gold target.

 Average loss over all words is minimized during training.

ATTENTION

In a basic encoder-decoder model :

 The encoder processes the entire input sentence and condenses all its information into a
single context vector (final hidden state).
 The decoder relies only on this context to generate all words in the output.

Limitation: This single vector can be a bottleneck, especially for long or complex sentences.

What Attention Does:

The attention mechanism allows the decoder to look at all encoder hidden states while
generating each word.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

Instead of:

11.4 TRANSLATING LOW RESOURCE SITAUATIONS

We briefly introduce two commonly used approaches for dealing with this data sparsity: back
translation, which is a special case of the general statistical technique called data augmentation, and
multilingual models, and also discuss some socio-technical issues.

DATA AUGMENTATION

Data augmentation is a statistical technique for dealing with insufficient training data, by adding
new synthetic data that is generated from the current natural data.

The most common data augmentation technique for machine translation is called back translation.
Back translation relies on the intuition that while parallel corpora may be limited for particular
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

languages or domains, we can often find a large (or at least larger) monolingual corpus, to add to
the smaller parallel corpora that are available. The algorithm makes use of monolingual corpora in
the target language to generate synthetic bitexts.

In back translation, our goal is to improve source-to-target MT, given a small parallel text (a bitext)
in the source/target languages, and some monolingual data in the target language. We first use the
bitext to train a MT system in the reverse direction: a target-to-source MT system. We then use it to
translate the monolingual target data to the source language. Now we can add this synthetic bitext
(natural target sentences, aligned with MT-produced source sentences) to our training data, and
retrain our source-to-target MT model. For example suppose we want to translate from Navajo to
English but only have a small Navajo-English bitext, although of course we can find lots of
monolingual English data. We use the small bitext to build an MT engine going the other way (from
English to Navajo). Once we translate the monolingual English text to Navajo, we can add this
synthetic Navajo/English bitext to our training data.

Back translation has various parameters. One is how we generate the back translations; for example,
we can decode in greedy inference, or use beam search. Another parameter is the ratio of back
translated data to natural bitext data; we can oversample the bitext data (include multiple copies of
each sentence).

Multilingual Models
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

MT EVALUVATION
Translations are evaluated along two dimensions:

1. Adequacy:
How well the translation captures the exact meaning of the source sentence.
Sometimes called faithfulness or fidelity.
2. Fluency:
How fluent the translation is in the target language (is it grammatical, clear, readable,
natural).

Using humans to evaluate is most accurate, but automatic metrics are also used for convenience.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

Using Human Raters to Evaluate MT

The most accurate evaluations use human raters, such as online crowdworkers, to evaluate each
translation along the two dimensions. For example, along the dimension of fluency, we can ask how
intelligible, how clear, how readable, or how natural the MT output (the target text) is. We can give
the raters a scale, for example:

 from 1 (totally unintelligible) to 5 (totally intelligible),


 or 1 to 100,

and ask them to rate each sentence or paragraph of the MT output.

 We can do the same thing to judge the second dimension, adequacy, using raters to assign
scores on a scale. If we have bilingual raters, we can give them the source sentence and a
proposed target sentence, and rate, on a 5-point or 100-point scale, how much of the
information in the source was preserved in the target.
 If we only have monolingual raters but we have a good human translation of the source text,
we can give the monolingual raters the human reference translation and a target machine
translation and again rate how much information is preserved. If we use a fine-grained enough
scale, we can normalize raters by subtracting the mean from their scores and dividing by the
variance.
 An alternative is to do ranking: give the raters a pair of candidate translations, and ask them
which one they prefer.
 While humans produce the best evaluations of machine translation output, running a human
evaluation can be time consuming and expensive.

Automatic Evaluation

While humans produce the best evaluations of machine translation output, running a human
evaluation can be time consuming and expensive. For this reason automatic metrics are often used as
temporary proxies. Automatic metrics are less accurate than human evaluation, but can help test
potential system improvements, and even be used as an automatic loss function for training. In this
section we introduce two families of such metrics, those based on character- or word-overlap and
those based on embedding similarity.

Automatic Evaluation by Character Overlap: chrF


The simplest and most robust metric for MT evaluation is called chrF, which stands for
character F-score (Popović, 2015). chrF (along with many other earlier related metrics like BLEU,
METEOR, TER, and others) is based on a simple intuition derived from the pioneering work of
Miller and Beebe-Center (1956): a good machine translation will tend to contain characters and
words that occur in a human translation of the same sentence. Consider a test set from a parallel
corpus, in which each source sentence has both a gold human target translation and a candidate MT
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

translation we’d like to evaluate. The chrF metric ranks each MT target sentence by a function of the
number of character n-gram overlaps with the human translation.
Given the hypothesis and the reference, chrF is given a parameter k indicating the length of character
n-grams to be considered, and computes the average of the k precisions (unigram precision, bigram,
and so on) and the average of the k recalls (unigram recall, bigram recall, etc.).

chrP: Character-level Precision


 Definition: Measures how many character n-grams (like unigrams and bigrams) in the
hypothesis are correct (i.e. found in the reference).
 Formula:

 Intuition:
How precisely did the system generate correct content?
“Out of what I generated, how much was right?”

chrR: Character-level Recall


 Definition: Measures how many character n-grams in the reference are correctly predicted
by the hypothesis.

Intuition:
How much of the correct answer did the system actually generate?
“Out of the true answer, how much did I get?”

The metric then computes an F-score by combining chrP and chrR using a weighting parameter β.
It is common to set β = 2, thus weighing recall twice as much as precision:
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

Alternative overlap metric: BLEU


Before chrF was invented, BLEU was the most common metric for machine translation (MT)
evaluation. BLEU stands for:

BiLingual Evaluation Understudy

It works by comparing the output of the machine (candidate translation) with a human reference
translation.

🔹 Key points about BLEU:

 BLEU is word-based: it checks how many n-grams (word sequences like "the cat", "cat sat",
etc.) from the system’s translation are found in the reference.
 It calculates precision: out of what the system generated, how much matched?
 Uses up to 4-gram precision.
 Has a brevity penalty: penalizes translations that are too short.
 It requires careful tokenization (splitting text into words correctly), otherwise it gives
unreliable results.

🔷 Section 2: Statistical Significance Testing for MT evals.


How to measure whether the performance difference between two MT systems is statistically
real (and not just due to chance).

🔹 Why do we need it?

Imagine system A gives chrF = 0.67 and system B gives chrF = 0.66.
Is A really better than B? Or is that just random?

To know for sure, we do significance testing:

🔹 Bootstrap Testing (step-by-step):

1. Take your test set (set of sentences).


2. Create many pseudo-test-sets by randomly resampling from it.
3. Calculate chrF scores on each pseudo-set.
4. Drop top & bottom 2.5% of scores → This gives a 95% confidence interval.
5. If this interval is narrow and A > B in most samples, A is statistically better.

This helps avoid false claims about which system is better.

🔷 Section 3: chrF: Limitations


Even though chrF is useful, it has some drawbacks:
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

🔹 chrF is “local”:

 It works at the character level, so small changes in word order or long phrases being moved
may not affect the score, even if the meaning changes.

🔹 chrF doesn’t understand:

 Discourse structure (how ideas connect across a paragraph or document)


 Contextual coherence
 Long-distance dependencies (e.g., pronouns referring to earlier nouns)

🔹 chrF is not good for:

 Comparing very different types of systems (like human-in-the-loop vs fully machine).


 Evaluating overall translation quality at the document or paragraph level.

✅ When is chrF best?

 When comparing small improvements in a single system.


 Works well for morphologically complex languages (like Tamil, Turkish, Finish).

Automatic Evaluation: BLEU


Traditional metrics like chrF are too strict — they only give credit when the characters exactly
match.

But what if the translation uses a synonym or paraphrase? That might still be a good translation,
but chrF will penalize it.

 Embedding-based methods (like BERTScore, COMET, and BLEURT) solve this by


using semantic similarity instead of exact word/character matching.

How These Metrics Work

🔹 Step-by-step:

1. Let:
o x = (x₁, ..., xₙ) be the reference translation
o x̃= (x̃ ₁, ..., x̃
ₘ) be the candidate (machine) translation
o r be a human quality rating for how close x̃is to x
2. Metrics like COMET and BLEURT:
o Train a model (often based on BERT) to predict human ratings.
o They pass both x and x̃through BERT, use the embeddings, and fine-tune on human-
judged examples.
o A final layer predicts how good the translation is.
3. Metrics like BERTScore:
o No human labels required.
o They compute semantic similarity between words or tokens using cosine similarity
of BERT embeddings.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

o Recall: how well reference words are found in the candidate.


o Precision: how well candidate words are found in the reference.

 Advantages of Embedding-Based Metrics

 Can match synonyms or paraphrases (e.g., "big" and "large").


 Are more flexible and human-like in evaluating meaning.
 Capture semantic similarity, not just surface overlap.

BIAS AND ETHICAL ISSUES


Machine translation systems are not just technical tools—they also reflect, reinforce, or amplify
biases present in the data they’re trained on. This section highlights how gender bias is one major
ethical concern in MT systems.

🔹 Problem of Gender Bias in Translation

Some languages, like:

 Hungarian (which uses the gender-neutral pronoun “ő”), and


 Spanish (which often drops pronouns),

...do not always specify the gender of a person.


NATURAL LANGUGAE PROCESSING BAI601/BAD613B

But when translating into English, which requires a gendered pronoun (he/she), MT systems
must choose a gender.

🔹 Example from the text:

Hungarian: “ő egy ápoló” (gender-neutral)


MT Output: “she is a nurse”

Hungarian: “ő egy vezérigazgató”


MT Output: “he is a CEO”

This demonstrates that:

 Female gender is assigned for nurturing roles (e.g. nurse, wedding organizer).
 Male gender is assigned for high-status roles (e.g. CEO, scientist, engineer).

This shows stereotypical gender mapping in translations — nurse = she, CEO = he — even
when no gender is specified in the source language.

This highlights how occupational gender stereotypes influence translation output.

 Evidence of Amplification

 These stereotypes aren’t fully explained by labor market data.


 Studies show that MT amplifies bias beyond actual real-world gender distributions.
 It maps roles to cultural stereotypes with higher probability than justified.

 WinoMT Dataset (Stanovsky et al., 2019)

A dataset designed to test whether MT systems misgender people in non-stereotypical roles:

“The doctor asked the nurse to help in the operation.”


NATURAL LANGUGAE PROCESSING BAI601/BAD613B

⚠️ Systems often guess based on stereotypes:


→ doctor = he, nurse = she, regardless of context.

This shows they perform worse when gender roles don’t match stereotypes.

 Ethical Takeaway:

 Machine translation systems, especially in health, legal, or urgent communication, must


be accurate and bias-free.
 Translating a gender-neutral sentence with wrong gender can lead to serious ethical
consequences.

 Limitations and Open Problems

 We don’t always know the internal reasoning of MT systems.


 The principle of “do no harm” becomes difficult when MT systems:
o Add assumptions (e.g., gender).
o Translate with biases learned from data.
 Need for careful design, evaluation, and ethics-aware AI mod

********END OF MODULE-5******

You might also like