Machine Translation, Auto Encoders and Decoders
Machine Translation, Auto Encoders and Decoders
decoders
Machine Translation
Machine translation is a sub-field of computational linguistics that focuses on developing systems
capable of automatically translating text or speech from one language to another. In Natural Language
Processing (NLP), the goal of machine translation is to produce translations that are not only
grammatically correct but also convey the meaning of the original content accurately.
(annyeong
haseyo)
- 2
Need of Machine Translation
Machine translation in Natural Language Processing (NLP) has several benefits, including:
1. Improved communication: Machine translation makes it easier for people who speak different
languages to communicate with each other, breaking down language barriers and facilitating
international cooperation.
2. Cost savings: Machine translation is typically faster and less expensive than human translation,
making it a cost-effective solution for businesses and organizations that need to translate large
amounts of text.
3. Increased accessibility: Machine translation can make digital content more accessible to users who
speak different languages, improving the user experience and expanding the reach of digital products
and services.
4. Improved efficiency: Machine translation can streamline the translation process, allowing businesses
and organizations to quickly translate large amounts of text and improving overall efficiency.
5. Language learning: Machine translation can be a valuable tool for language learners, helping them to
understand the meaning of unfamiliar words and phrases and improving their language skills.
- 3
Application of Machine Translation
Machine translation has many applications, including:
1. Cross-border communication: Machine translation allows people from different countries to
communicate with each other more easily, breaking down language barriers and facilitating
international cooperation.
2. Localization: Machine translation can be used to quickly and efficiently translate websites, software,
and other digital content into different languages, making them more accessible to users around the
world.
3. Business: Machine translation can be used by businesses to translate documents, contracts, and other
important materials, enabling them to work with partners and customers from around the world.
4. Education: Machine translation can be used in education to help students learn new languages and
improve their language skills.
5. Government: Machine translation can be used by governments to translate official documents and
communications, improving accessibility and transparency.
- 4
Process of Machine Translation
The basic requirement in the complex cognitive process of machine translation is to understand the
meaning of a text in the original (source) language and then restore it to the target (sink) language.
• We need to decode the meaning of the source text in its entirety. There is a need to interpret and
analyze all the features of the text available in the corpus.
• We also require an in-depth knowledge of the grammar, semantics, syntax, idioms, etc. of the source
language for this process.
• We then need to re-encode this meaning in the target language, which also needs the same in-depth
knowledge as the source language to replicate the meaning in the target language.
- 5
Types of Machine Translation in NLP
- 6
Rule-based Machine Translation or RBMT
Also called knowledge-based machine translation, these are the earliest set of classical methods used for
machine translation.
• These translation systems are mainly based on linguistic information about the source and target languages
that are derived from dictionaries and grammar covering the characteristic elements of each language
separately.
• Once we have input sentences in some source languages, RBMT systems generally generate the translation
to output language based on the morphological, syntactic, and semantic analysis of both the source and the
target languages involved in the translation tasks.
• The sub-approaches under RBMT systems are the direct machine translation approach, the interlingual
machine translation approach, and the transfer-based machine translation approach.
- 7
Statistical Machine Translation or SMT
Statistical machine translation (SMT) models are generated with the use of statistical models whose
parameters are derived from the analysis of bilingual text corpora. The ideas for SMT come from
information theory.
● The principle theory of SMT-based models is that they rely on the Bayes Theorem to perform the
analysis of parallel texts.
○ The central idea is that every sentence in one language is a possible translation of any
sentence in the other language, but the most appropriate is the translation that is assigned the
highest probability by the system.
● The SMT methods need millions of sentences in both the source and target languages for analysis
to collect the relevant statistics for each word.
○ In the early set of models, the abstracts of the European Parliament and the `United Nations
Security Council meetings, which were available in the languages of all member countries,
were taken for getting the required probabilities.
● SMT-based methods are further classified into Word-based SMT, Phrase-based SMT, and Syntax
based SMT approaches.
- 8
Statistical Machine Translation or SMT
- 9
Neural Machine Translation or NMT
● Neural Machine Translation methods in NLP employ artificial intelligence techniques and, in
particular, rely upon neural network models which were originally based on the human brain to
build machine learning models with the end goal of translation and also further learning
languages, and improving the performance of the machine translation systems constantly.
● Neural Machine Translation systems do not need any specific requirements that are regular to
other machine translation systems like statistical-based methods in general.
○ Neural Machine Translation works by incorporating training data, which can be generic or
custom depending on the user’s needs.
○ Generic Data: Includes the total of all the data learned from translations performed over time
by the machine translation engines employed in the entire realm of translation.
● This data also serves as a generalized translation tool for various applications, including text,
voice, and documents.
○ Custom or Specialized Data: These are the sets of training data fed to machine translation
tasks to build specialization in a subject matter.
○ Subjects include engineering, design, programming, or any discipline with its specialized
glossaries and dictionaries.
- 10
Benefits of Machine Translation
● Fast Translation Speed: Machine translation using NLP has the potential to translate millions of
words for high-volume translations happening in modern day-to-day processes.
○ MT cuts the human translator effort usually done post-editing.
○ We can use machine translation to tag high-volume content and organize the contents so that we
can look up over search or retrieve content in translated languages quickly.
● Excellent Language Selection: Most machine language translation tools these days can translate
more than 100 languages.
○ The MT programs are powerful enough to translate multiple languages at once so they can be
rolled out to a global user base across products and do documentation updates on the fly.
○ MT is well-suited to many language pairs these days (English to most European languages) with
very high accuracies.
● Reduced Costs: As annotation with humans doesn't scale well both in terms of cost and speed, we can
use machine translation using NLP to cut translation delivery times and costs.
○ We can also produce basic translations with little compute costs that can be further utilized by
human translators so that they can further refine and edit from previous steps.
- 11
Sequence to Sequence Model
With the advent of recurrent network models for a variety of complex tasks where the order of the task
(like the sequence of words in a sentence) is important, Sequence to Sequence has become very popular
for machine translation tasks in NLP.
Sequence to Sequence models primarily consists of an encoder and decoder model architecture and utilize
a combination of recurrent units in their layers.
• The encoder and the decoder are two distinct individual components working hand in hand together to
produce state-of-the-art results while performing different kinds of computations.
• The input sequences of words are taken as the input by the encoder, and the decoder component
typically has the relevant architecture to produce the associated target sequences.
• In the machine translation task, the source language is passed as input to the encoder as a sequence of
vectors, and then it is passed through a series of recurrent units whose end output is stored as the last
hidden state as the encoder state representation.
• This hidden state is used as input for the decoder and passed through a series of recurrent again as a
for loop, usually each output sequentially so that the output produced has the highest probability.
- 12
Sequence to Sequence Model
- 13
Attention Model
One of the potential issues with using encoder–decoder architecture of seq-2-seq models is that the neural
network typically compresses all the necessary information of a source sentence into a fixed-length vector
and also keeps track of what words are exactly important.
• The task is difficult for recurrent layers to cope with long sentences, especially those that are longer
than the sentences in the input training corpus.
• Attention is proposed as a novel solution to the limitation of this particular encoder-decoder model of
decoding long sentences by selectively focusing on sub-parts of the sentence during translation.
- 14
Intuition for Attention Models
Attention allows neural network models to approximate the visual attention mechanism humans use in
processing different tasks.
• The initial step of the model works by studying a certain point of an image or text with intense high-
resolution focus, like how humans process a new scene in an image or text while talking.
• While focussing on current high-resolution contexts, it also perceives the surrounding areas in low
resolution simultaneously.
• The last step is to then adjust the focal point as the network begins to understand the overall picture
together.
- 15
Transformers in Machine Learning
Transformer is a neural network architecture used for performing machine learning tasks
particularly in natural language processing (NLP) and computer vision. In 2017 Vaswani et al.
published a paper ” Attention is All You Need” in which the transformers architecture was
introduced. The article explores the architecture, workings and applications of transformers.
Transformer Architecture is a model that uses self-attention to transform one whole sentence into a
single sentence. This is useful where older models work step by step and it helps overcome the
challenges seen in models like RNNs and LSTMs. Traditional models like
RNNs (Recurrent Neural Networks) suffer from the vanishing gradient problem which leads to long-
term memory loss. RNNs process text sequentially meaning they analyze words one at a time.
For example, in the sentence: “XYZ went to France in 2019 when there were no cases of COVID
and there he met the president of that country” the word “that country” refers to “France”.
However RNN would struggle to link “that country” to “France” since it processes each word in
sequence leading to losing context over long sentences. This limitation prevents RNNs from
understanding the full meaning of the sentence.
While adding more memory cells in LSTMs (Long Short-Term Memory networks) helped address the
vanishing gradient issue they still process words one by one. This sequential processing means LSTMs
can’t analyze an entire sentence at once.
For instance the word “point” has different meanings in these two sentences:
• “The needle has a sharp point.” (Point = Tip)
Traditional models struggle with this context dependence, whereas, Transformer model through its self-
attention mechanism, processes the entire sentence in parallel addressing these issues and making it
significantly more effective at understanding context.
Architecture and Working of Transformers
1. Positional Encoding
Unlike RNNs transformers lack an inherent understanding of word order since they process data in parallel. To
solve this Positional Encodings are added to token embeddings providing information about the position of
each token within a sequence.
The Feed-Forward Networks consist of two linear transformations with a ReLU activation. It is applied
independently to each position in the sequence.
Mathematically:
FFN(x)=max(0,xW1+b1)W2+b2
3. Attention Mechanism
The attention mechanism allows transformers to determine which words in a sentence are most relevant to
each other. This is done using a scaled dot-product attention approach:
1. Each word in a sequence is mapped to three vectors:
•Query (Q)
•Key (K)
•Value (V)
2. Attention scores are computed as: Attention(Q,K,V)=
These scores determine how much attention each word should pay to others.
Multi-Head Attention
Instead of using a single attention mechanism transformers apply multi-head attention where multiple attention layers run
in parallel. This enables the model to capture different types of relationships within the input.
4. Encoder-Decoder Architecture
The encoder-decoder structure is key to transformer models. The encoder processes the input sequence into a vector,
while the decoder converts this vector back into a sequence. Each encoder and decoder layer includes self-
attention and feed-forward layers. In the decoder, an encoder-decoder attention layer is added to focus on relevant parts
of the input.
For example, a French sentence “Je suis étudiant” is translated into “I am a student” in English.
The encoder consists of multiple layers (typically 6 layers). Each layer has two main components:
•Self-Attention Mechanism – Helps the model understand word relationships.
•Feed-Forward Neural Network – Further transforms the representation.
The decoder also consists of 6 layers, but with an additional encoder-decoder attention mechanism. This allows the
decoder to focus on relevant parts of the input sentence while generating output.
Transformer Model
Transformers models also come under sequence-to-sequence models for many computational tasks, but
they rely mostly on the attention mechanism to draw the dependencies between layers of the network.
• When compared with the previous neural network architectures, the Transformer based neural
attention models achieved the highest BLEU scores (Bilingual Evaluation Understudy) on machine
translation tasks while also requiring significantly less compute and training time.
• The primary idea in the transformer is to apply a self-attention mechanism and model the relationships
between all words in a sentence directly regardless of their respective position like in the recurrent
networks.
• The main difference with the existing Seq2Seq models is that they do not imply any recurrent layers
and instead handle the entire input sequence in at once and don’t iterate word by word.
• The encoder is a combination of a Multi-Head Self-Attention Mechanism and a series of Fully-
Connected Feed-Forward Networks.
• The decoder is a combination of Masked Multi-Head Self-Attention Mechanism, Multi-Head Self-
Attention Mechanism, and also Fully-Connected Feed-forward Networks.
- 21
Transformer Model
- 22
Autoencoders
Autoencoders are a specialized class of algorithms that can learn efficient representations of input data
with no need for labels. It is a class of artificial neural networks designed for unsupervised learning.
Learning to compress and effectively represent input data without specific labels is the essential principle
of an automatic decoder. This is accomplished using a two-fold structure that consists of an encoder and a
decoder. The encoder transforms the input data into a reduced-dimensional representation, which is often
referred to as “latent space” or “encoding”. From that representation, a decoder rebuilds the initial input.
For the network to gain meaningful patterns in data, a process of encoding and decoding facilitates the
definition of essential features.
- 23
Architecture of Autoencoder in Deep Learning
1. Encoder
• Input layer take raw input data
• The hidden layers progressively reduce the dimensionality of the input, capturing important
features and patterns. These layer compose the encoder.
• The bottleneck layer (latent space) is the final hidden layer, where the dimensionality is
significantly reduced. This layer represents the compressed encoding of the input data.
1. Decoder
• The bottleneck layer takes the encoded representation and expands it back to the dimensionality of
the original input.
• The hidden layers progressively increase the dimensionality and aim to reconstruct the original
input.
• The output layer produces the reconstructed output, which ideally should be as close as possible to
the input data.
1. The loss function used during training is typically a reconstruction loss, measuring the difference
between the input and the reconstructed output. Common choices include mean squared error
(MSE) for continuous data or binary cross-entropy for binary data.
2. During training, the autoencoder learns to minimize the reconstruction loss, forcing the network to
capture the most important features of the input data in the bottleneck layer.
- 24
Architecture of Autoencoder in Deep Learning
- 25
Types of Autoencoders
• Denoising Autoencoder
• Sparse Autoencoder
• Variational Autoencoder
• Convolutional Autoencoder
- 26
Variational Autoencoder
Variational Autoencoder makes strong assumptions about the distribution of latent variables and uses the
Stochastic Gradient Variational Bayes estimator in the training process. It assumes that the data is generated
by a Directed Graphical Model and tries to learn an approximation of the latent variable distribution to the
conditional probability distribution, where phi and theta are the parameters of the encoder and the decoder,
respectively.
Advantages
• Variational Autoencoders are used to generate new data points that resemble the original training data.
These samples are learned from the latent space.
• Variational Autoencoder is probabilistic framework that is used to learn a compressed representation of
the data that captures its underlying structure and variations, so it is useful in detecting anomalies and
data exploration.
Disadvantages
• Variational Autoencoder use approximations to estimate the true distribution of the latent variables.
This approximation introduces some level of error, which can affect the quality of generated samples.
• The generated samples may only cover a limited subset of the true data distribution. This can result in a
lack of diversity in generated samples.
- 27
Variational Autoencoder
- 28
Thank You