Deep Learning Full

VIETNAM NATIONAL UNIVERSITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
PROJECT 7
1. Phạm Huy Thiên Phúc, Student ID: 2053346
2. Student ID:
3. Student ID:
4. Student ID:
Final assignment Introduction to computing
Contents of the project

1.What is Deep Learning ?....................................................................................3
2.The history and the development of Deep Learning........................................5
3.Discuss some prominent architectures of Deep Learning?..............................7
3.1. Supervised Learning......................................................................................8
3.1.1. Convolutional neural networks ...........................................................9
3.1.2. Recurrent neural networks..................................................................10
3.1.3. LSTM networks..................................................................................11
3.1.4. GRU networks…................................................................................12
3.2. Unsupervised
Learning...............................................................................13
3.2.1. Self-organized maps...........................................................................14
3.2.2. Autoencoders......................................................................................15
3.2.3. Restricted Boltzmann Machines.........................................................16
4.Introduce some recent outstanding achievements of Deep Learning............17
4.1.Video to video
synthesis..............................................................................17
4.1.1. Semantic Labels → Cityscapes Street Views.....................................18
4.1.2. Face → Edge → Face ........................................................................19
4.1.3. Body → Pose → Body…...................................................................19
4.1.4. Frame Prediction.................................................................................20
4.2. Language models: Google’s BERT representation..................................21
4.2.1. Mask Language Model (MLM).........................................................22
4.2.2. Next Sentence Prediction (NSP).........................................................23
Reference.................................................................................................................24
1.What is Deep Learning ?

Deep learning is a subset of machine learning [ Firgure 1], which is essentially
a neural network with three or more layers. These neural networks attempt to
simulate the behavior of the human brain—albeit far from matching its ability—
allowing it to “learn” from large amounts of data that is unstructured or unlabeled.
While a neural network with a single layer can still make approximate predictions,
additional hidden layers can help to optimize and refine for accuracy.
Firgure 1:AI, ML and DL
Deep learning drives many artificial intelligence (AI) applications and services
that improve automation, performing analytical and physical tasks without human
intervention. Deep learning technology lies behind everyday products and services
(such as digital assistants, voice-enabled TV remotes, and credit card fraud
detection) as well as emerging technologies (such as self-driving cars).
2. The history and the development of Deep Learning:

ANNs started with a work by McCullogh and Pitts who showed that sets of
simple units (artificial neurons) could perform all possible logic operations and
thus be capable of universal computation. This work was concomitant to Von
Neumann and Turing who first dealt with statistical aspects of the information
processing of the brain and how to build a machine capable of reproducing them.
Frank Rosembalt invented the perceptron machine to perform simple pattern
classification. However, this new learning machine was incapable of solving
simple problems, like the logic XOR. In 1969 Minsky and Papert showed that
perceptrons had intrinsic limitations that could not be transcended, thus leading to
a fading enthusiasm for ANNs. In 1983 John Hopfield proposed a special type of
ANNs (the Hopfield networks) and proved that they had powerful pattern
completion and memory properties. The backpropagation algorithm was first
described by Linnainmaa, S. (1970) as the representation of the cumulative
rounding error of an algorithm (as a Taylor expansion of the local rounding errors),
without reference to neural networks. In 1985, Rumelhart, McClelland, and
Hinton rediscovered this powerful learning rule that allowed them to train ANNs
with several hidden units, thus surpassing the Minsk criticism.
Figure 2: Brief history of Deep learning
Year Contributor Contribution

1943 Walter Pitts and Warren McCulloch Pitts Neuron
McCulloch
1957 Frank Rosenblatt Perceptron
1960 Henry J. Kelley The First Backpropagation Model
1962 Stuart Dreyfus Backpropagation With Chain Rule
1965 Alexey Grigoryevich Multilayer Neural Network

Ivakhnenko and Valentin
Grigorʹevich Lapa
1969 Marvin Minsky and Seymour XOR problem

Papert
1970 Seppo Linnainmaa Automatic differentiation for

backpropagation
Implements backpropagation in
computer code
1971 Alexey Grigoryevich Deep neural network

Ivakhnenko
1980 Kunihiko Fukushima Neocognitron – First CNN Architecture
1982 John Hopfield Hopfield Network
Paul Werbos Backpropagation In ANN
1985 David H. Ackley, Geoffrey Boltzmann Machine

Hinton and Terrence Sejnowski
1986 Terry Sejnowski NetTalk – ANN Learns Speech
Geoffrey Hinton, Rumelhart Implementation Of Backpropagation

and Williams ( Continued )
Year Contributor Contribution

1991 Sepp Hochreiter Vanishing Gradient Problem
1997 Sepp Hochreiter and The Milestone Of LSTM

Jürgen Schmidhube
2006 Geoffrey Hinton, Ruslan Deep Belief Network

Salakhutdinov, Osindero and
Teh
2008 Andrew NG’s group GPUs for training Deep Neural Networks
2011 Yoshua Bengio, Antoine Vanishing Gradient

Bordes,
Xavier Glorot
2012 Alex Krizhevsky AlexNet
2014 Ian Goodfellow Generative Adversarial Neural Network
2016 Deepmind’s deep reinforcement learning model beats human champion in

the complex game of Go
2019 Yoshua Bengio, Geoffrey Turing Award 2018

Hinton, and Yann LeCun for their immense contribution in
advancements in area of deep learning
3.Discuss some prominent architectures of Deep Learning:
Figure 3: Deep Learning Architecture

3.1. Supervised Learning:
Figure 3.1: Supervised learning
Supervised learning (SL) is the machine learning task of learning

a function that maps an input to an output based on example input-
output pairs.It infers a function from labeled training data consisting of a
set of training examples. In supervised learning, each example is
a pair consisting of an input object (typically a vector) and a desired
output value (also called the supervisory signal). A supervised learning
algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples.
3.1.1.Convolutional neural networks :
A CNN is a multilayer neural network that was biologically inspired by the

animal visual cortex. The architecture is particularly useful in image-processing
applications. The first CNN was created by Yann LeCun; at the time, the
architecture focused on handwritten character recognition, such as postal code
interpretation. As a deep network, early layers recognize features (such as edges),
and later layers recombine these features into higher-level attributes of the input.
The LeNet CNN architecture is made up of several layers that implement

feature extraction and then classification (see figure 3.1.1). The image is divided
into receptive fields that feed into a convolutional layer, which then extracts
features from the input image. The next step is pooling, which reduces the
dimensionality of the extracted features (through down-sampling) while retaining
the most important information (typically, through max pooling). Another
convolution and pooling step is then performed that feeds into a fully connected
multilayer perceptron. The final output layer of this network is a set of nodes that
identify features of the image (in this case, a node per identified number). You
train the network by using back-propagation.
Figure 3.1.1: CNN architecture in extraction and classification image
The use of deep layers of processing, convolutions, pooling, and a fully

connected classification layer opened the door to various new applications of
deep learning neural networks. In addition to image processing, the CNN has
been successfully applied to video recognition and various tasks within natural
language processing.
Example applications: Image recognition, video analysis, and natural language

processing.
3.1.2.Recurrent neural networks:
The RNN is one of the foundational network architectures from which other
deep learning architectures are built. The primary difference between a typical
multilayer network and a recurrent network is that rather than completely feed-
forward connections, a recurrent network might have connections that feed back
into prior layers (or into the same layer). This feedback allows RNNs to maintain
memory of past inputs and model problems in time.
RNNs consist of a rich set of architectures (we’ll look at one popular topology
called LSTM next). The key differentiator is feedback within the network, which
could manifest itself from a hidden layer, the output layer, or some combination
thereof.
Figure 3.1.2: RNN architecture and connections between layers
RNNs can be unfolded in time and trained with standard back-propagation or by

using a variant of back-propagation that is called back-propagation in time
(BPTT).
Example applications: Speech recognition and handwriting recognition.

3.1.3. LSTM networks:
The LSTM departed from typical neuron-based neural network architectures

and instead introduced the concept of a memory cell. The memory cell can retain
its value for a short or long time as a function of its inputs, which allows the cell
to remember what’s important and not just its last computed value.
The LSTM memory cell contains three gates that control how information
flows into or out of the cell. The input gate controls when new information can
flow into the memory. The forget gate controls when an existing piece of
information is forgotten, allowing the cell to remember new data. Finally, the
output gate controls when the information that is contained in the cell is used in
the output from the cell. The cell also contains weights, which control each gate.
The training algorithm, commonly BPTT, optimizes these weights based on the
resulting network output error.
Figure 3.1.3: LTSM memory cell
Recent applications of CNNs and LSTMs produced image and video captioning
systems in which an image or video is captioned in natural language. The CNN
implements the image or video processing, and the LSTM is trained to convert
the CNN output into natural language.
Example applications: Image and video captioning systems.

3.1.4. GRU networks:
In 2014, a simplification of the LSTM was introduced called the gated

recurrent unit. This model has two gates, getting rid of the output gate present in
the LSTM model. These gates are an update gate and a reset gate. The update
gate indicates how much of the previous cell contents to maintain. The reset gate
defines how to incorporate the new input with the previous cell contents. A GRU
can model a standard RNN simply by setting the reset gate to 1 and the update
gate to 0.
Figure 3.1.3: GRU cell
The GRU is simpler than the LSTM, can be trained more quickly, and can be
more efficient in its execution. However, the LSTM can be more expressive and
with more data can lead to better results.
Example applications: Natural language text compression, handwriting

recognition, speech recognition, gesture recognition, image captioning.
3.2. Unsupervised Learning:
Figure 3.2: Unsupervised learning
Unsupervised learning (UL) is a type of algorithm that learns

patterns from untagged data. The hope is that, through mimicry, the
machine is forced to build a compact internal representation of its world
and then generate imaginative content. In contrast to supervised
learning (SL) where data is tagged by a human, e.g. as "car" or "fish"
etc, UL exhibits self-organization that captures patterns as neuronal
predilections or probability densities. The other levels in the supervision
spectrum are reinforcement learning where the machine is given only a
numerical performance score as its guidance, and semi-supervised
learning where a smaller portion of the data is tagged. Two broad
methods in UL are Neural Networks and Probabilistic Methods.
3.2.1.Self-organized maps
Self-organized map (SOM) was popularly known as the Kohonen map. SOM is
an unsupervised neural network that creates clusters of the input data set by
reducing the dimensionality of the input. SOMs vary from the traditional artificial
neural network in quite a few ways.
Figure 3.2.1: Self-organizing map
The first significant variation is that weights serve as a characteristic of the node.
After the inputs are normalized, a random input is first chosen. Random weights
close to zero are initialized to each feature of the input record. These weights now
represent the input node. Several combinations of these random weights represent
variations of the input node. The euclidean distance between each of these output
nodes with the input node is calculated. The node with the least distance is
declared as the most accurate representation of the input and is marked as the best
matching unit or BMU. With these BMUs as center points, other units are similarly
calculated and assigned to the cluster that it is the distance from. Radius of points
around BMU weights are updated based on proximity. Radius is shrunk.
Next, in an SOM, no activation function is applied, and because there are no target
labels to compare against there is no concept of calculating error and back
propogation.
Example applications: Dimensionality reduction, clustering high-dimensional

inputs to 2-dimensional output, radiant grade result, and cluster visualization.
3.2.2.Autoencoders
Though the history of when autoencoders were invented is hazy, the first known
usage of autoencoders .This variant of an autoencoders is composed of 3 layers:
input, hidden, and output layers.
Figure 3.2.2: Autoencoders layers
First, the input layer is encoded into the hidden layer using an appropriate
encoding function. The number of nodes in the hidden layer is much less than the
number of nodes in the input layer. This hidden layer contains the compressed
representation of the original input. The output layer aims to reconstruct the input
layer by using a decoder function.
During the training phase, the difference between the input and the output layer
is calculated using an error function, and the weights are adjusted to minimize the
error. Unlike traditional unsupervised learning techniques, where there is no data to
compare the outputs against, autoencoders learn continuosly using backward
propagation. For this reason, autoencoders are classified as self

supervised algorithms.
Example applications: Dimensionality reduction, data interpolation, and data

compression/decompression.
3.2.3.Restricted Boltzmann Machines
An RBM is a 2-layered neural network. The layers are input and hidden layers.
As shown in the following figure, in RBMs every node in a hidden layer is
connected to every node in a visible layer. In a traditional Boltzmann Machine,
nodes within the input and hidden layer are also connected. Due to computational
complexity, nodes within a layer are not connected in a Restricted Boltzmann
Machine.
Figure 3.2.2: Restricted Boltzmann Machines
During the training phase, RBMs calculate the probabilty distribution of the
training set using a stochastic approach. When the training begins, each neuron
gets activated at random. Also, the model contains respective hidden and visible
bias. While the hidden bias is used in the forward pass to build the activation, the
visible bias helps in reconstructing the input.
Because in an RBM the reconstructed input is always different from the original
input, they are also known as generative models.
Also, because of the built-in randomness, the same predictions result in

different outputs. In fact, this is the most significant difference from an
autoencoder, which is a deterministic model.
Example applications: Dimensionality reduction and collaborative filtering
4. Introduce some recent outstanding achievements of Deep

Learning
4.1.Video to video synthesis:
In 2018, Ting-Chun Wang and others announced a new video-to-video

synthesis approach. In this approach, they aim to turn a segmented input source of
video into an output photorealistic video that precisely depicts the content of the
source video. The result is high-resolution, photorealistic, temporally coherent
video results on a diverse set of input formats, including segmentation masks,
sketches, and poses. They can achieve this by using a neural generator network to
create an image with one discriminator network to check whether the images look
good one by one and one discriminator to overlook the sequence of the image
whether it would pass as a video.
Figure 4.1: Network architecture of video-to-video synthesis

Network architecture of our generator. We first train a residual network G1 on

lower solution images. Then, another network G2 is appended to G1 and the two
networks are trained jointly on high resolution images. Specifically, the input to
the residual blocks in G2 is the element-wise sum of the feature map form G2 and
the last feature map from G1.
4.1.1 Semantic Labels → Cityscapes Street Views
Figure 4.1.1: a) Semantic Labels results
Starting from a video in some source domain, they synthesize a new video in a
target domain using a learning network. Semantic labels allow them to edit or
create content in a convenient input domain and generate a video to an output
domain that is harder to edit or create
Figure 4.1.1: b) Semantic Labels results
The network can synthesize multiple results given the same input or manipulated
to generate the desired output video. In the crude map, each color corresponds to
an object class, and we can change the meaning of the label. Some examples of this
are transforming trees into buildings or vice versa and changing the styles of
buildings or roads.
4.1.2. Face → Edge → Face :
They train a sketch-to-face synthesis video model by using the real face videos in
the Face Forensics dataset. The network learns to transfer edge map video to video
of a human face. It also can generate different faces from the same input edge map.
On the other hand, the model can change the facial appearance of the original face
videos. The resulting video is temporarily consistent from frame to frame.
Figure 4.1.2: Face → Edge → Face Results

4.1.3.Body → Pose → Body :
Wang’s model can synthesize videos of human moving given information on

poses and output high-resolution photorealistic dance videos that contain unseen
body shapes and motions. The method can change the clothing for the same dancer
or transfer poses from one person to another person with consistent shadow.
Figure 4.1.3: Body → Pose → Body Results
4.1.4.Frame Prediction :
Figure 4.1.4: Frame Prediction Results
To predict the future video given a few observed frames, the team has
decomposed the task into two sub-tasks:
_Synthesizing future semantic segmentation masks using the observed frames.
_Converting the synthesized segmentation masks into videos. In practice, after

extracting the segmentation masks from the observed frames, they trained a
generator to predict future semantic masks. They then use the proposed video-to-
video synthesis approach to convert the predicted segmentation masks to a future
video.
4.2. Language models: Google’s BERT representation
In Natural Language Processing (NLP), a language model is a model that can

estimate the probability distribution of a set of linguistic units, typically a sequence
of words. These are interesting models since they can be built at little cost and
have significantly improved several NLP tasks such as machine translation, speech
recognition, and parsing.
Historically, one of the best-known approaches is based on Markov models and
n-grams. With the emergence of deep learning, more powerful models generally
based on long short-term memory networks (LSTM) appeared. Although highly
effective, existing models are usually unidirectional, meaning that only the left (or
right) context of a word ends up being considered.
Last October, the Google AI Language team published a paper that caused a stir
in the community. BERT (Bidirectional Encoder Representations from
Transformers) is a new bidirectional language model that has achieved state of the
art results for 11 complex NLP tasks, including sentiment analysis, question
answering, and paraphrase detection.
Figure 4.2: a) Comparative results for the GLUE Benchmark.

The strategy for pre-training BERT differs from the traditional left-to-right or
right-to-left options. The novelty consists of:
 Masking some percentage of the input tokens at random, then predicting
only those masked tokens; this keeps, in a multi-layered context, the words
from indirectly “seeing themselves”
 Building a binary classification task to predict if sentence B follows

immediately after sentence A, which allows the model to determine the
relationship between sentences, a phenomenon not directly captured by
classical language modeling.
4.2.1. Mask Language Model (MLM):
Masking out some of the words in the input and then condition each word
bidirectionally to predict the masked words. Before feeding word sequences into
BERT, 15% of the words in each sequence are replaced with a [MASK] token.
The model then attempts to predict the original value of the masked words, based
on the context provided by the other, non-masked, words in the sequence.
Figure 4.2.1: Mask Language Model
4.2.2. Next Sentence Prediction (NSP):
Next Sentence Prediction (NSP), where BERT learns to model relationships

between sentences. In the training process, the model receives pairs of sentences as
input and learns to predict if the second sentence in the pair is the subsequent
sentence in the original document. Let’s consider two sentences A and B, is B the
actual next sentence that comes after A in the corpus, or just a random sentence?
For example:
Figure 4.2.2: Next Sentence Prediction

When training the BERT model, both the techniques are trained together, thus
minimizing the combined loss function of the two strategies.
Figure 4.2: b) SquAD2.0 Leaderboard
On SQuAD v2.0, BERT achieves 89,474% F1 score (a measure of accuracy),

surpassing the previous state-of-the-art score of 87.147% which is greater than
human performance 0.316%.
BERT is undoubtedly a milestone in the use of Deep Learning for Natural

Language Processing.
References:
Websites:
[1] ibm.com
https://www.ibm.com/cloud/learn/deep-learning
https://developer.ibm.com/technologies/artificial-intelligence/articles/cc-machine-
learning-deep-learning-architectures/
[2] machinelearningknowledge.ai
https://machinelearningknowledge.ai/brief-history-of-deep-learning/
[3] nvlabs.github.io
https://nvlabs.github.io/few-shot-vid2vid/
[4] tryolabs.com
https://tryolabs.com/blog/2018/12/19/major-advancements-deep-learning-2018/
[5] towardsdatascience.com
https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-
7cca943cf3ad#:~:text=1%2C%20BERT%20achieves%2093.2%25%20F1,Language
%20Understanding%20(NLU)%20tasks.
[6] web.stanford.edu
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15812785.
pdf
Science Journals:
[7] Sanskruti Patel, Atul Patel

Deep Leaning Architectures and its Applications A Survey ,2018
[8] Ting-Chun Wang , Ming-Yu Liu , Jun-Yan Zhu , Guilin Li, Andrew Tao , Jan
Kautz , Bryan Catanzaro
Video-to-Video Synthesis, 2018
Books:
[9] Fundamentals of Deep Learning
Nikhil Buduma with contributions by Nicholas Locascio. ( Page 85-109 )
[10] Introduction to Deep Learning Business Applications for Developers
From Conversational Bots in Customer Service to Medical Image Processing —
Armando Vieira Bernardete Ribeiro. ( Page 38-40 )

Deep Learning Full

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Deep Learning Full

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Full

Uploaded by

Copyright:

Available Formats

VIETNAM NATIONAL UNIVERSITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

FACULTY OF COMPUTER SCIENCE AND ENGINEERING

1. Phạm Huy Thiên Phúc, Student ID: 2053346

Contents of the project

1.What is Deep Learning ?

Firgure 1:AI, ML and DL

2. The history and the development of Deep Learning:

Figure 2: Brief history of Deep learning

Year Contributor Contribution

1960 Henry J. Kelley The First Backpropagation Model

1962 Stuart Dreyfus Backpropagation With Chain Rule

1965 Alexey Grigoryevich Multilayer Neural Network

1969 Marvin Minsky and Seymour XOR problem

1970 Seppo Linnainmaa Automatic differentiation for

1971 Alexey Grigoryevich Deep neural network

1980 Kunihiko Fukushima Neocognitron – First CNN Architecture

1982 John Hopfield Hopfield Network

Paul Werbos Backpropagation In ANN

1985 David H. Ackley, Geoffrey Boltzmann Machine

1986 Terry Sejnowski NetTalk – ANN Learns Speech

Geoffrey Hinton, Rumelhart Implementation Of Backpropagation

Year Contributor Contribution

1997 Sepp Hochreiter and The Milestone Of LSTM

2006 Geoffrey Hinton, Ruslan Deep Belief Network

2011 Yoshua Bengio, Antoine Vanishing Gradient

2012 Alex Krizhevsky AlexNet

2014 Ian Goodfellow Generative Adversarial Neural Network

2016 Deepmind’s deep reinforcement learning model beats human champion in

2019 Yoshua Bengio, Geoffrey Turing Award 2018

3.Discuss some prominent architectures of Deep Learning:

Figure 3: Deep Learning Architecture

3.1. Supervised Learning:

Figure 3.1: Supervised learning

Supervised learning (SL) is the machine learning task of learning

3.1.1.Convolutional neural networks :

A CNN is a multilayer neural network that was biologically inspired by the

The LeNet CNN architecture is made up of several layers that implement

Figure 3.1.1: CNN architecture in extraction and classification image

The use of deep layers of processing, convolutions, pooling, and a fully

Example applications: Image recognition, video analysis, and natural language

3.1.2.Recurrent neural networks:

Figure 3.1.2: RNN architecture and connections between layers

RNNs can be unfolded in time and trained with standard back-propagation or by

Example applications: Speech recognition and handwriting recognition.

3.1.3. LSTM networks:

The LSTM departed from typical neuron-based neural network architectures

Figure 3.1.3: LTSM memory cell

Example applications: Image and video captioning systems.

3.1.4. GRU networks:

In 2014, a simplification of the LSTM was introduced called the gated

Figure 3.1.3: GRU cell

Example applications: Natural language text compression, handwriting

3.2. Unsupervised Learning:

Figure 3.2: Unsupervised learning

Unsupervised learning (UL) is a type of algorithm that learns

Figure 3.2.1: Self-organizing map

Example applications: Dimensionality reduction, clustering high-dimensional