0% found this document useful (0 votes)
367 views25 pages

Deep Learning Full

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 25

VIETNAM NATIONAL UNIVERSITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

FACULTY OF COMPUTER SCIENCE AND ENGINEERING

PROJECT 7

1. Phạm Huy Thiên Phúc, Student ID: 2053346

2. Student ID:

3. Student ID:

4. Student ID:
Final assignment Introduction to computing

Contents of the project


1.What is Deep Learning ?....................................................................................3
2.The history and the development of Deep Learning........................................5
3.Discuss some prominent architectures of Deep Learning?..............................7
3.1. Supervised Learning......................................................................................8
3.1.1. Convolutional neural networks ...........................................................9
3.1.2. Recurrent neural networks..................................................................10
3.1.3. LSTM networks..................................................................................11
3.1.4. GRU networks…................................................................................12
3.2. Unsupervised
Learning...............................................................................13
3.2.1. Self-organized maps...........................................................................14
3.2.2. Autoencoders......................................................................................15
3.2.3. Restricted Boltzmann Machines.........................................................16
4.Introduce some recent outstanding achievements of Deep Learning............17
4.1.Video to video
synthesis..............................................................................17
4.1.1. Semantic Labels → Cityscapes Street Views.....................................18
4.1.2. Face → Edge → Face ........................................................................19
4.1.3. Body → Pose → Body…...................................................................19
4.1.4. Frame Prediction.................................................................................20
4.2. Language models: Google’s BERT representation..................................21
4.2.1. Mask Language Model (MLM).........................................................22 
4.2.2. Next Sentence Prediction (NSP).........................................................23

Reference.................................................................................................................24
Final assignment Introduction to computing

1.What is Deep Learning ?


Deep learning is a subset of machine learning [ Firgure 1], which is essentially
a neural network with three or more layers. These neural networks attempt to
simulate the behavior of the human brain—albeit far from matching its ability—
allowing it to “learn” from large amounts of data that is unstructured or unlabeled.
While a neural network with a single layer can still make approximate predictions,
additional hidden layers can help to optimize and refine for accuracy.

Firgure 1:AI, ML and DL

Deep learning drives many artificial intelligence (AI) applications and services
that improve automation, performing analytical and physical tasks without human
intervention. Deep learning technology lies behind everyday products and services
(such as digital assistants, voice-enabled TV remotes, and credit card fraud
detection) as well as emerging technologies (such as self-driving cars).
Final assignment Introduction to computing

2. The history and the development of Deep Learning:


ANNs started with a work by McCullogh and Pitts who showed that sets of
simple units (artificial neurons) could perform all possible logic operations and
thus be capable of universal computation. This work was concomitant to Von
Neumann and Turing who first dealt with statistical aspects of the information
processing of the brain and how to build a machine capable of reproducing them.
Frank Rosembalt invented the perceptron machine to perform simple pattern
classification. However, this new learning machine was incapable of solving
simple problems, like the logic XOR. In 1969 Minsky and Papert showed that
perceptrons had intrinsic limitations that could not be transcended, thus leading to
a fading enthusiasm for ANNs. In 1983 John Hopfield proposed a special type of
ANNs (the Hopfield networks) and proved that they had powerful pattern
completion and memory properties. The backpropagation algorithm was first
described by Linnainmaa, S. (1970) as the representation of the cumulative
rounding error of an algorithm (as a Taylor expansion of the local rounding errors),
without reference to neural networks. In 1985, Rumelhart, McClelland, and
Hinton rediscovered this powerful learning rule that allowed them to train ANNs
with several hidden units, thus surpassing the Minsk criticism.
Final assignment Introduction to computing

Figure 2: Brief history of Deep learning

Year Contributor Contribution


1943 Walter Pitts and Warren McCulloch Pitts Neuron
McCulloch
1957 Frank Rosenblatt Perceptron

1960 Henry J. Kelley The First Backpropagation Model

1962 Stuart Dreyfus Backpropagation With Chain Rule

1965 Alexey Grigoryevich Multilayer Neural Network


Ivakhnenko and Valentin
Grigorʹevich Lapa

1969 Marvin Minsky and Seymour XOR problem


Papert

1970 Seppo Linnainmaa Automatic differentiation for


backpropagation
Implements backpropagation in
computer code

1971 Alexey Grigoryevich Deep neural network


Ivakhnenko

1980 Kunihiko Fukushima Neocognitron – First CNN Architecture

1982 John Hopfield Hopfield Network

Paul Werbos Backpropagation In ANN

1985 David H. Ackley, Geoffrey Boltzmann Machine


Hinton and Terrence Sejnowski

1986 Terry Sejnowski NetTalk – ANN Learns Speech

Geoffrey Hinton, Rumelhart Implementation Of Backpropagation


and Williams ( Continued )
Final assignment Introduction to computing

Year Contributor Contribution


1991 Sepp Hochreiter Vanishing Gradient Problem

1997 Sepp Hochreiter and The Milestone Of LSTM


Jürgen Schmidhube

2006 Geoffrey Hinton, Ruslan Deep Belief Network


Salakhutdinov, Osindero and
Teh

2008 Andrew NG’s group GPUs for training Deep Neural Networks

2011 Yoshua Bengio, Antoine Vanishing Gradient


Bordes,
Xavier Glorot

2012 Alex Krizhevsky AlexNet

2014 Ian Goodfellow Generative Adversarial Neural Network

2016 Deepmind’s deep reinforcement learning model beats human champion in


the complex game of Go

2019 Yoshua Bengio, Geoffrey Turing Award 2018


Hinton, and Yann LeCun for their immense contribution in
advancements in area of deep learning
Final assignment Introduction to computing

3.Discuss some prominent architectures of Deep Learning:

Figure 3: Deep Learning Architecture


Final assignment Introduction to computing

3.1. Supervised Learning:

Figure 3.1: Supervised learning

Supervised learning (SL) is the machine learning task of learning


a function that maps an input to an output based on example input-
output pairs.It infers a function from labeled training data consisting of a
set of training examples. In supervised learning, each example is
a pair consisting of an input object (typically a vector) and a desired
output value (also called the supervisory signal). A supervised learning
algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples.
Final assignment Introduction to computing

3.1.1.Convolutional neural networks :

A CNN is a multilayer neural network that was biologically inspired by the


animal visual cortex. The architecture is particularly useful in image-processing
applications. The first CNN was created by Yann LeCun; at the time, the
architecture focused on handwritten character recognition, such as postal code
interpretation. As a deep network, early layers recognize features (such as edges),
and later layers recombine these features into higher-level attributes of the input.

The LeNet CNN architecture is made up of several layers that implement


feature extraction and then classification (see figure 3.1.1). The image is divided
into receptive fields that feed into a convolutional layer, which then extracts
features from the input image. The next step is pooling, which reduces the
dimensionality of the extracted features (through down-sampling) while retaining
the most important information (typically, through max pooling). Another
convolution and pooling step is then performed that feeds into a fully connected
multilayer perceptron. The final output layer of this network is a set of nodes that
identify features of the image (in this case, a node per identified number). You
train the network by using back-propagation.

Figure 3.1.1: CNN architecture in extraction and classification image

The use of deep layers of processing, convolutions, pooling, and a fully


connected classification layer opened the door to various new applications of
deep learning neural networks. In addition to image processing, the CNN has
been successfully applied to video recognition and various tasks within natural
language processing.

Example applications: Image recognition, video analysis, and natural language


processing.
Final assignment Introduction to computing

3.1.2.Recurrent neural networks:

The RNN is one of the foundational network architectures from which other
deep learning architectures are built. The primary difference between a typical
multilayer network and a recurrent network is that rather than completely feed-
forward connections, a recurrent network might have connections that feed back
into prior layers (or into the same layer). This feedback allows RNNs to maintain
memory of past inputs and model problems in time.

RNNs consist of a rich set of architectures (we’ll look at one popular topology
called LSTM next). The key differentiator is feedback within the network, which
could manifest itself from a hidden layer, the output layer, or some combination
thereof.

Figure 3.1.2: RNN architecture and connections between layers

RNNs can be unfolded in time and trained with standard back-propagation or by


using a variant of back-propagation that is called back-propagation in time
(BPTT).

Example applications: Speech recognition and handwriting recognition.


Final assignment Introduction to computing

3.1.3. LSTM networks:

The LSTM departed from typical neuron-based neural network architectures


and instead introduced the concept of a memory cell. The memory cell can retain
its value for a short or long time as a function of its inputs, which allows the cell
to remember what’s important and not just its last computed value.

The LSTM memory cell contains three gates that control how information
flows into or out of the cell. The input gate controls when new information can
flow into the memory. The forget gate controls when an existing piece of
information is forgotten, allowing the cell to remember new data. Finally, the
output gate controls when the information that is contained in the cell is used in
the output from the cell. The cell also contains weights, which control each gate.
The training algorithm, commonly BPTT, optimizes these weights based on the
resulting network output error.

Figure 3.1.3: LTSM memory cell

Recent applications of CNNs and LSTMs produced image and video captioning
systems in which an image or video is captioned in natural language. The CNN
implements the image or video processing, and the LSTM is trained to convert
the CNN output into natural language.

Example applications: Image and video captioning systems.


Final assignment Introduction to computing

3.1.4. GRU networks:

In 2014, a simplification of the LSTM was introduced called the gated


recurrent unit. This model has two gates, getting rid of the output gate present in
the LSTM model. These gates are an update gate and a reset gate. The update
gate indicates how much of the previous cell contents to maintain. The reset gate
defines how to incorporate the new input with the previous cell contents. A GRU
can model a standard RNN simply by setting the reset gate to 1 and the update
gate to 0.

Figure 3.1.3: GRU cell

The GRU is simpler than the LSTM, can be trained more quickly, and can be
more efficient in its execution. However, the LSTM can be more expressive and
with more data can lead to better results.

Example applications: Natural language text compression, handwriting


recognition, speech recognition, gesture recognition, image captioning.
Final assignment Introduction to computing

3.2. Unsupervised Learning:

Figure 3.2: Unsupervised learning

Unsupervised learning (UL) is a type of algorithm that learns


patterns from untagged data. The hope is that, through mimicry, the
machine is forced to build a compact internal representation of its world
and then generate imaginative content. In contrast to supervised
learning (SL) where data is tagged by a human, e.g. as "car" or "fish"
etc, UL exhibits self-organization that captures patterns as neuronal
predilections or probability densities. The other levels in the supervision
spectrum are reinforcement learning where the machine is given only a
numerical performance score as its guidance, and semi-supervised
learning where a smaller portion of the data is tagged. Two broad
methods in UL are Neural Networks and Probabilistic Methods.
Final assignment Introduction to computing

3.2.1.Self-organized maps

Self-organized map (SOM) was popularly known as the Kohonen map. SOM is
an unsupervised neural network that creates clusters of the input data set by
reducing the dimensionality of the input. SOMs vary from the traditional artificial
neural network in quite a few ways.

Figure 3.2.1: Self-organizing map

The first significant variation is that weights serve as a characteristic of the node.
After the inputs are normalized, a random input is first chosen. Random weights
close to zero are initialized to each feature of the input record. These weights now
represent the input node. Several combinations of these random weights represent
variations of the input node. The euclidean distance between each of these output
nodes with the input node is calculated. The node with the least distance is
declared as the most accurate representation of the input and is marked as the best
matching unit or BMU. With these BMUs as center points, other units are similarly
calculated and assigned to the cluster that it is the distance from. Radius of points
around BMU weights are updated based on proximity. Radius is shrunk.

Next, in an SOM, no activation function is applied, and because there are no target
labels to compare against there is no concept of calculating error and back
propogation.
Final assignment Introduction to computing

Example applications: Dimensionality reduction, clustering high-dimensional


inputs to 2-dimensional output, radiant grade result, and cluster visualization.

3.2.2.Autoencoders

Though the history of when autoencoders were invented is hazy, the first known
usage of autoencoders .This variant of an autoencoders is composed of 3 layers:
input, hidden, and output layers.

Figure 3.2.2: Autoencoders layers

First, the input layer is encoded into the hidden layer using an appropriate
encoding function. The number of nodes in the hidden layer is much less than the
number of nodes in the input layer. This hidden layer contains the compressed
representation of the original input. The output layer aims to reconstruct the input
layer by using a decoder function.

During the training phase, the difference between the input and the output layer
is calculated using an error function, and the weights are adjusted to minimize the
error. Unlike traditional unsupervised learning techniques, where there is no data to
compare the outputs against, autoencoders learn continuosly using backward
Final assignment Introduction to computing

propagation. For this reason, autoencoders are classified as self


supervised algorithms.

Example applications: Dimensionality reduction, data interpolation, and data


compression/decompression.

3.2.3.Restricted Boltzmann Machines

An RBM is a 2-layered neural network. The layers are input and hidden layers.
As shown in the following figure, in RBMs every node in a hidden layer is
connected to every node in a visible layer. In a traditional Boltzmann Machine,
nodes within the input and hidden layer are also connected. Due to computational
complexity, nodes within a layer are not connected in a Restricted Boltzmann
Machine.

Figure 3.2.2: Restricted Boltzmann Machines

During the training phase, RBMs calculate the probabilty distribution of the
training set using a stochastic approach. When the training begins, each neuron
gets activated at random. Also, the model contains respective hidden and visible
bias. While the hidden bias is used in the forward pass to build the activation, the
visible bias helps in reconstructing the input.

Because in an RBM the reconstructed input is always different from the original
input, they are also known as generative models.
Final assignment Introduction to computing

Also, because of the built-in randomness, the same predictions result in


different outputs. In fact, this is the most significant difference from an
autoencoder, which is a deterministic model.

Example applications: Dimensionality reduction and collaborative filtering

4. Introduce some recent outstanding achievements of Deep


Learning
4.1.Video to video synthesis:

In 2018, Ting-Chun Wang and others announced a new video-to-video


synthesis approach. In this approach, they aim to turn a segmented input source of
video into an output photorealistic video that precisely depicts the content of the
source video. The result is high-resolution, photorealistic, temporally coherent
video results on a diverse set of input formats, including segmentation masks,
sketches, and poses. They can achieve this by using a neural generator network to
create an image with one discriminator network to check whether the images look
good one by one and one discriminator to overlook the sequence of the image
whether it would pass as a video.

Figure 4.1: Network architecture of video-to-video synthesis


Final assignment Introduction to computing

Network architecture of our generator. We first train a residual network G1 on


lower solution images. Then, another network G2 is appended to G1 and the two
networks are trained jointly on high resolution images. Specifically, the input to
the residual blocks in G2 is the element-wise sum of the feature map form G2 and
the last feature map from G1.

4.1.1 Semantic Labels → Cityscapes Street Views

Figure 4.1.1: a) Semantic Labels results

Starting from a video in some source domain, they synthesize a new video in a
target domain using a learning network. Semantic labels allow them to edit or
create content in a convenient input domain and generate a video to an output
domain that is harder to edit or create
Final assignment Introduction to computing

Figure 4.1.1: b) Semantic Labels results

The network can synthesize multiple results given the same input or manipulated
to generate the desired output video. In the crude map, each color corresponds to
an object class, and we can change the meaning of the label. Some examples of this
are transforming trees into buildings or vice versa and changing the styles of
buildings or roads.

4.1.2. Face → Edge → Face :

They train a sketch-to-face synthesis video model by using the real face videos in
the Face Forensics dataset. The network learns to transfer edge map video to video
of a human face. It also can generate different faces from the same input edge map.
On the other hand, the model can change the facial appearance of the original face
videos. The resulting video is temporarily consistent from frame to frame.

Figure 4.1.2: Face → Edge → Face Results


4.1.3.Body → Pose → Body :

Wang’s model can synthesize videos of human moving given information on


poses and output high-resolution photorealistic dance videos that contain unseen
Final assignment Introduction to computing

body shapes and motions. The method can change the clothing for the same dancer
or transfer poses from one person to another person with consistent shadow.
Figure 4.1.3: Body → Pose → Body Results
4.1.4.Frame Prediction :

Figure 4.1.4: Frame Prediction Results

To predict the future video given a few observed frames, the team has
decomposed the task into two sub-tasks:

_Synthesizing future semantic segmentation masks using the observed frames.

_Converting the synthesized segmentation masks into videos. In practice, after


extracting the segmentation masks from the observed frames, they trained a
generator to predict future semantic masks. They then use the proposed video-to-
video synthesis approach to convert the predicted segmentation masks to a future
video.
Final assignment Introduction to computing

4.2. Language models: Google’s BERT representation

In Natural Language Processing (NLP), a language model is a model that can


estimate the probability distribution of a set of linguistic units, typically a sequence
of words. These are interesting models since they can be built at little cost and
have significantly improved several NLP tasks such as machine translation, speech
recognition, and parsing.
Historically, one of the best-known approaches is based on Markov models and
n-grams. With the emergence of deep learning, more powerful models generally
based on long short-term memory networks (LSTM) appeared. Although highly
effective, existing models are usually unidirectional, meaning that only the left (or
right) context of a word ends up being considered.
Last October, the Google AI Language team published a paper that caused a stir
in the community. BERT (Bidirectional Encoder Representations from
Transformers) is a new bidirectional language model that has achieved state of the
art results for 11 complex NLP tasks, including sentiment analysis, question
answering, and paraphrase detection.

Figure 4.2: a) Comparative results for the GLUE Benchmark.


The strategy for pre-training BERT differs from the traditional left-to-right or
right-to-left options. The novelty consists of:
 Masking some percentage of the input tokens at random, then predicting
only those masked tokens; this keeps, in a multi-layered context, the words
from indirectly “seeing themselves”
Final assignment Introduction to computing

 Building a binary classification task to predict if sentence B follows


immediately after sentence A, which allows the model to determine the
relationship between sentences, a phenomenon not directly captured by
classical language modeling.
4.2.1. Mask Language Model (MLM): 
Masking out some of the words in the input and then condition each word
bidirectionally to predict the masked words. Before feeding word sequences into
BERT, 15% of the words in each sequence are replaced with a [MASK] token.
The model then attempts to predict the original value of the masked words, based
on the context provided by the other, non-masked, words in the sequence.

Figure 4.2.1: Mask Language Model

4.2.2. Next Sentence Prediction (NSP):

Next Sentence Prediction (NSP), where BERT learns to model relationships


between sentences. In the training process, the model receives pairs of sentences as
input and learns to predict if the second sentence in the pair is the subsequent
sentence in the original document. Let’s consider two sentences A and B, is B the
actual next sentence that comes after A in the corpus, or just a random sentence?
For example:

Figure 4.2.2: Next Sentence Prediction


Final assignment Introduction to computing

When training the BERT model, both the techniques are trained together, thus
minimizing the combined loss function of the two strategies.

Figure 4.2: b) SquAD2.0 Leaderboard

On SQuAD v2.0, BERT achieves 89,474% F1 score (a measure of accuracy),


surpassing the previous state-of-the-art score of 87.147% which is greater than
human performance 0.316%.

BERT is undoubtedly a milestone in the use of Deep Learning for Natural


Language Processing.
Final assignment Introduction to computing

References:

Websites:

[1] ibm.com

https://www.ibm.com/cloud/learn/deep-learning

https://developer.ibm.com/technologies/artificial-intelligence/articles/cc-machine-
learning-deep-learning-architectures/

[2] machinelearningknowledge.ai

https://machinelearningknowledge.ai/brief-history-of-deep-learning/

[3] nvlabs.github.io

https://nvlabs.github.io/few-shot-vid2vid/

[4] tryolabs.com

https://tryolabs.com/blog/2018/12/19/major-advancements-deep-learning-2018/

[5] towardsdatascience.com

https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-
7cca943cf3ad#:~:text=1%2C%20BERT%20achieves%2093.2%25%20F1,Language
%20Understanding%20(NLU)%20tasks.

[6] web.stanford.edu

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15812785.
pdf

Science Journals:

[7] Sanskruti Patel, Atul Patel


Final assignment Introduction to computing

Deep Leaning Architectures and its Applications A Survey ,2018

[8] Ting-Chun Wang , Ming-Yu Liu , Jun-Yan Zhu , Guilin Li, Andrew Tao , Jan
Kautz , Bryan Catanzaro

Video-to-Video Synthesis, 2018

Books:

[9] Fundamentals of Deep Learning

Nikhil Buduma with contributions by Nicholas Locascio. ( Page 85-109 )

[10] Introduction to Deep Learning Business Applications for Developers

From Conversational Bots in Customer Service to Medical Image Processing —

Armando Vieira Bernardete Ribeiro. ( Page 38-40 )

You might also like