Machine Translation
Machine Translation
By
Dr. Pankaj Dadure
Assistant Professor
SoCS, UPES Dehradun
Machine Translation (MT)
• Machine translation is the process of using artificial intelligence to automatically
translate text from one language to another without human involvement.
• Modern machine translation goes beyond simple word-to-word translation to
communicate the full meaning of the original language text in the target
language.
How does machine translation work?
1. First, the input text or speech is prepared via filtering, cleaning and organizing.
2. Then, the machine translation system is trained using examples of texts in multiple
languages and their respective translations.
3. The system learns and analyzes examples to understand patterns and probabilities of
how words or phrases are translated.
4. When a new text to translate is inputted, the system uses what it has learned to
generate the translated version.
5. After generating the translation, some additional adjustments may be added to refine
the results.
Basic terminologies
Preprocessing in MT
Tokenization, named entity recognition, stemming.
Post-Processing in MT
It is the process of proofreading texts translated by a machine engine. This process aims to improve
translations to achieve the same level of output quality as human translation can give.
Parallel Corpus
A parallel corpus is essentially a set of sentences in a language L1 and the corresponding sentences
in another language L2. A parallel text translation corpus is a large and structured set of translated
texts between two languages.
Type of MT
• Rule-based machine translation
Language experts develop built-in linguistic rules and bilingual dictionaries for
specific industries or topics. Rule-based machine translation uses these
dictionaries to translate specific content accurately. The steps in the process
are:
1. The machine translation software parses the input text and creates a
transitional representation
2. It converts the representation into target language using the grammar rules
and dictionaries as a reference
Type of MT
• Statistical machine translation
Instead of relying on linguistic rules, statistical machine translation uses
machine learning to translate text. The machine learning algorithms analyze
large amounts of human translations that already exist and look for statistical
patterns. The software then makes an intelligent guess when asked to translate
a new source text. It makes predictions on the basis of the statistical likelihood
that a specific word or phrase will be with another word or phrase in the target
language.
• Pros and cons
Statistical methods require training on millions of words for every language
pair. However, with sufficient data the machine translations are accurate.
Type of MT
• Neural machine translation
continuously improve that knowledge using a specific machine learning method called
• The fundamental idea behind NMT is to model the entire translation process using neural
networks, allowing the system to learn complex patterns and dependencies in language
data.
Neural machine translation
1. Input and Output: NMT takes a sentence in one language (the source language) as input and
produces a translated sentence in another language (the target language) as output.
2. Encoder and Decoder: NMT uses an "encoder-decoder" architecture. The encoder reads the
input sentence and converts it into a fixed-size vector representation. The decoder then takes
this representation and generates the translated sentence in the target language.
3. Learning from Data: To make accurate translations, NMT needs to be trained on large datasets
containing pairs of sentences in both source and target languages. During training, the model
learns to associate input sentences with their corresponding translations, adjusting its
parameters to minimize errors.
Type of MT
• Hybrid machine translation
Hybrid machine translation tools use two or more machine translation models
on one piece of software. You can use the hybrid approach to improve the
effectiveness of a single translation model.
This machine translation process commonly uses rule-based and statistical
machine translation subsystems. The final translation output is the
combination of the output of all subsystems.
Rule-based MT vs Statistical MT
Statistical Machine Translation
Feature Rule-Based Machine Translation (RBMT)
(SMT)
Uses statistical models based on
Uses predefined linguistic rules,
Approach probabilities derived from bilingual
grammar, and dictionaries.
corpora.
Requires extensive linguistic knowledge Requires large parallel corpora for
Data Dependency
and manually defined rules. training models.
Produces grammatically structured but Generates more fluent translations
Accuracy & Fluency
less natural translations. but may lack grammatical accuracy.
Requires human effort for rule creation Needs high computational power
Computational Requirements but is computationally less intensive for model training but translates
during translation. faster once trained.
Difficult to scale to new languages as Easier to scale if a large parallel
Adaptability
new rules must be manually created. corpus is available.
Rule-based MT vs Statistical MT
Statistical Machine Translation
Feature Rule-Based Machine Translation (RBMT)
(SMT)
Works well for structured and
Adapts better to idioms, slang, and
Flexibility grammatically defined texts but struggles
new words but may produce errors.
with informal language.
Moses, Google Translate (before
Examples Systran, Apertium
switching to Neural MT)
Challenges with SMT
• Data Dependency: SMT requires large parallel corpora to train effective models. High-quality
bilingual datasets are scarce for low-resource languages, leading to poor translations.
• Word Alignment Errors: SMT relies on statistical alignment of words between source and target
languages. Misalignment issues arise when dealing with complex sentence structures or idiomatic
expressions that do not have direct word-to-word mappings.
• Reordering Issues: Different languages follow different syntactic structures (e.g., English follows
Subject-Verb-Object (SVO), while Japanese follows Subject-Object-Verb (SOV)).
• Handling of Morphologically Rich Languages: Some languages (e.g., Turkish, Finnish, Hindi) have
complex morphology (words change form based on tense, gender, etc.). SMT does not effectively
handle such variations, resulting in incorrect translations.
• Contextual Limitations: SMT operates at the phrase level, often ignoring long-range dependencies in
a sentence.
• Lack of Generalization: SMT models are trained on specific datasets and struggle with unseen words
or domain-specific terms (e.g., medical or legal jargon).
Challenges with SMT
• Computational Cost: Training SMT requires significant computational resources, especially
for large-scale bilingual corpora.
• Difficulty in Low-Resource Language Pairs: SMT performs poorly for languages with
limited parallel corpora. Underrepresented dialects and indigenous languages suffer from
poor translations due to insufficient training data.