0% found this document useful (0 votes)
53 views43 pages

Deep Learning and Applications: Pham The Bao Ptbao@sgu - Edu.vn

The document discusses deep learning and its applications. It provides an overview of convolutional neural networks (CNNs) and long short-term memory (LSTMs), and how they can be implemented with TensorFlow. It discusses how deep learning can be used to understand language, images, and speech by training computers to recognize patterns from raw data using multiple layers of representation. Deep learning works best with large amounts of labeled or unlabeled data and when the input space has local structure, such as with images, language, etc.

Uploaded by

Xuân Xuânn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views43 pages

Deep Learning and Applications: Pham The Bao Ptbao@sgu - Edu.vn

The document discusses deep learning and its applications. It provides an overview of convolutional neural networks (CNNs) and long short-term memory (LSTMs), and how they can be implemented with TensorFlow. It discusses how deep learning can be used to understand language, images, and speech by training computers to recognize patterns from raw data using multiple layers of representation. Deep learning works best with large amounts of labeled or unlabeled data and when the input space has local structure, such as with images, language, etc.

Uploaded by

Xuân Xuânn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Deep Learning and Applications

Pham The Bao


[email protected]
Contents
• Introduction
• CNN
• LSTM
• CNN with TensorFlow
• RNN (LSTM) with TensorFlow

2
What is the Issue?
• How do we understand language?
• By looking at a picture, how can we tell what it
is?
• How can a computer “look” at a speech
waveform and tell what is being said?
– Waveform => phonemes => words =>
understanding
• How can we train computers to represent
high-level concepts from low-level raw data?

3
Data! More Data!!
• Assuming that we have a sufficiently powerful
learning algorithm, one of the most reliable
ways to get better performance is to give the
algorithm more data.
• This has led to the saying, "sometimes it's not
who has the best algorithm that wins; it's who
has the most data."

4
Unlabeled Data
• Getting more labeled data is expensive
– Hand-labeling is expensive
• Manual feature extraction is hard
– Hand-engineering features is tedious
• Can we let our algorithm learn from un-
labeled data?
– We can download tons of unlabeled data from the
Internet

5
Theoretical Perspective
• How to learn multi-layer generative models of
unlabeled data by learning one layer of
features at a time
• How to use generative models to make
discriminative training methods work better
for classification and regression
• How to perform non-linear dimensionality
reduction on very large training sets

6
Two Types of ML Tasks
1. Statistical Methods 2. AI Methods
• Low dimensional data • High dimensional data
(say, less than 100 • The noise is not that
dimensions) much to drown the
• Lots of noise in data structure
• Not much of structure • Huge amount of
• The main problem is structure
distinguishing structure • The main problem is
from noise figuring out the
structure in the data

7
Historical Background:
First generation NN
• Perceptron (1960’s)
– Used a layer of hand-coded features and tried to
recognize objects by learning how to weight these
features
• There was a neat learning algorithm for adjusting the
weights
• The Perceptrons are fundamentally limited in what they
can learn to do

8
Second Generation NN (1985’s)
• The back-propagation algorithm for learning multiple
layers of non-linear features was invented several
times in the 1970’s and 1980’s
• Back-prop showed great promise
– Intuitively there are advantages for multiple layers
– But rarely you benefit from more than one hidden layer
• By the 1990’s people had largely given up on it:
– BP could not make good use of multiple hidden
layers (except in “time- delay” and convolutional
nets).
– It did not work well in recurrent networks.

9
What is Wrong with BP?
• It requires labeled training data.
– Almost all data is unlabeled.
• The learning time does not scale well
– It is very slow in networks with multiple
hidden layers.
• It can get stuck in poor local optima.
• BP Is OK, but for deep nets it is far from
optimal.
10
Then Came the SVM
• Vapnik and co-workers developed a very clever type of
Perceptron called it the SVM
– Instead of hand-coding the layer of non-adaptive features,
each training example is used to create a new feature
using a fixed recipe
• The feature computes how similar a test example is to that of a
training example
– Then a clever optimization technique is used to select the
best subset of the features and decide how to weight
these features in classifying a test case
• But it is just a Perceptron with all its limitations
• Nevertheless, by 1990’s people abandoned multi-layer
NN and opted for SVMs

11
Overcoming Limitations of BP
• Keep the efficiency and simplicity of using a
gradient method for adjusting the weights, but
use it for modeling the structure of the sensory
inputs
– Adjust weights to maximize the probability that a
generative model would have produced the sensory
input
– Learn p(image) not p(label|image)
• What kind of generative model should we learn?

12
Inspiration from Visual Cortex
• Look at the picture
• It shows how different regions of the brain
process information to make sense

13
The Visual Cortex
• V1: Identifies the edges
• V2: Identifies combinations of edges as
mouth, nose, eyes, chin, etc.
• V4: Identifies the face

14
Learning Feature Hierarchies
object models

object parts
(combination
of edges)

edges

pixels
15
Google Brain Project
• Inside Google’s secretive X laboratory, known for
inventing self-driving cars and augmented reality
glasses, a group of researchers began working
several years ago on a simulation of the human
brain.
• There scientists created one of the largest neural
networks for machine learning by connecting
16,000 processors, and they turned it loose on
the Internet to learn on its own.

16
What Did it Learn?
• Presented with 10 million digital images found
in YouTube videos, what did Google’s Brain
do?
• Same as what millions of humans do with
YouTube: looked for cats!
• The neural network taught itself to recognize
cats!

17
How Many Computers to Identify a
Cat? 16,000

18
The Central Idea of Deep Nets
• Multiple layers of representation
• Each layer could be a
– Neural net
– Boltzmann machine
– DAG like a Bayesian network
• Distributed representation
– Units in a layer are not necessarily mutually
exclusive in capturing features

19
Deep Learning Overview
• Train networks with many layers
• Multiple layers work to build an improved feature space
– First layer learns 1st order features (e.g. edges…)
– 2nd layer learns higher order features (combinations of first
layer features, combinations of edges, etc.)
– Early layers usually learn in an unsupervised mode and
discover general features of the input space – serving
multiple tasks related to, say, image recognition, etc.
– Then final layer features are fed into supervised layer(s)
• And entire network is often subsequently tuned using supervised
training of the entire net, using the initial weightings learned in the
unsupervised phase
– Could also do fully supervised versions, etc. (eg. Some early
BP type nets)

20
Deep Learning
• Usually best when input space is locally
structured
– spatial or temporal:
– images, language, etc. vs arbitrary input features
• Images Example: early vision layer

21
Why Deep Learning
• Biological Plausibility – e.g. Visual Cortex
• Hastad’s proof - Problems which can be represented
with a polynomial number of nodes with k layers, may
require an exponential number of nodes with k-1 layers
(e.g. parity)
• Highly varying functions can be efficiently represented
with deep architectures
– Less weights/parameters to update than a less efficient
shallow representation
• Sub-features created in deep architecture can
potentially be shared between multiple tasks
– Type of Transfer/Multi-task learning

22
Early Work
• Fukushima (1980) – Neo-Cognitron
• LeCun (1998) – Convolutional Neural Networks
– Similarities to Neo-Cognitron
• Many layered feed-forward network with
backpropagation
– Tried early but without much success
• Very slow
• Diffusion of gradient
– Very recent work has shown significant accuracy
improvements by "patiently" training deeper layers
with BP using fast machines (GPUs)

23
Training Deep Networks
• Difficulties of supervised training of deep networks
– Early layers do not get trained well
• Diffusion of Gradient – error attenuates as it propagates back
• Top couple layers can usually learn any task "pretty well" and the
error to earlier layers drops quickly
• the top layers "mostly" solve the task– lower layers never get the
opportunity to use their capacity to improve results
• Need a way for early layers to do effective work
• Leads to very slow training
– Often not enough labeled data available while there may be
lots of unlabeled data
• Can we use unsupervised/semi-supervised approaches to take
advantage of the unlabeled data
– Deep networks tend to have more local minima problems
than shallow networks during supervised training

24
Greedy Layer-Wise Training - 1
1. Train first layer using unlabeled data
– Use supervised or semi-supervised training but just use the larger amount of unlabeled
data. Can also use labeled data but just leave out the label.
2. Then freeze the first layer parameters and start training the
second layer using the output of the first layer as input to the
second layer
3. Repeat this for as many layers as desired
– This builds our set of robust features
4. Use the outputs of the final layer as inputs to a supervised
layer/model and train the last supervised layer(s) (freeze
early weights)
5. Unfreeze all weights and fine tune the full network by
training with a supervised approach, given the pre-processed
weight settings

25
Greedy Layer-Wise Training - 2
• Greedy layer-wise training avoids many of the
problems of trying to train a deep net in a
supervised fashion
– Each layer gets full learning focus in its turn since it is
the only current "top" layer
– Can take advantage of the unlabeled data
– Once you finally start the supervised training portion
the network weights have been adjusted so that you
are already in a good error basin and now just need to
fine tune. This helps with problems of
• Ineffective early layer learning
• Deep network local minima

CS 678 – Deep Learning 26


Digression on Neural Nets

27
Equations

28
An Autoencoder
• One of the features of BP is its ability to
discover useful intermediate representations
at the hidden layer
• Consider an autoencoder trained with BP
– It has, say, 8 input and 8 output units.
– 3 units at the hidden layer
– Input is reproduced at the output
– The 3 hidden units are forced to represent the
input in some coded fashion

29
Autoencoder

30
Learning Features
• An autoencoder neural network is an
unsupervised learning algorithm that applies
back propagation, setting the target values to
be equal to the inputs.
• It learns an approximation to the identity
function, so “output = input”
• It captures features in input data as
activations of hidden units
• See figure on next slide
31
Feature Learning

32
Trained Network
• Having trained the parameters of this model,
given any new input , we can now compute
the corresponding vector of activations of the
hidden units.
• This often gives a better representation of the
input than the original raw input . We can also
visualize the algorithm for computing the
features/activations as the following neural
network:

33
Hidden Unit Activations

34
Auto-Encoders
• A type of unsupervised learning which tries to discover generic
features of the data
– Learn identity function by learning important sub-features (not by just
passing through data)
– Compression, etc.
– Can use just new features in the new training set or concatenate both

35
Stacked Auto-Encoders
• Bengio (2007) – After Deep Belief Networks (2006)
• Stack many (sparse) auto-encoders in succession and train them
using greedy layer-wise training
• Drop the decode layer each time

CS 678 – Deep Learning 36


Visualizing Trained Encoder
• Suppose we applied 10x10 pixel images as
inputs
• Suppose there are 100 hidden units
• What are the features the hidden units are
looking for?
• We can visualize the functions computed by
hidden units and plot them and we see
something like.....

37
Visualizing Hidden Unit Activation

38
Unsupervised Feature Learning
• Give our algorithms a large amount of
unlabeled data with which to learn a good
feature representation of the input.
• If we are trying to solve a specific classification
task, then we take this learned feature
representation and combine it with whatever
labeled data we may have, and apply
supervised learning on that labeled data to
solve the classification task.

39
Self-Taught Learning
• In the previous slides, we used an autoencoder to
learn features. These features were learned using
only unlabeled data.
• We can now feed these features as input to a
classifier such as Softmax or logistic regression
classifier.
• In self-taught learning, we first trained a sparse
autoencoder on the unlabeled data. Then, given a
new example , we used the hidden layer to
extract features . This is illustrated in the
following diagram:

40
Self-taught Learning:
Logistic Regression

41
Deep Neural Nets
• So far we constructed a 3-layer neural network comprising an
input, hidden and output layer.
• While fairly effective, this 3-layer model is a fairly shallow
network;
– The features (hidden layer activations a(2)) are computed using only "one
layer" of computation (the hidden layer).
• Consider deep networks, ie, with multiple hidden layers;
– this will allow us to compute much more complex features of the input.
– Because each hidden layer computes a non-linear transformation of the
previous layer, a deep network can have significantly greater representational
power (i.e., can learn more complex functions) than a shallow one.
• While training a deep network, it is important to use a non-
linear activation function in each hidden layer. This is because
multiple layers of linear functions would itself compute only a
linear function of the input

42
Advantages of Deep Networks
• By using a deep network, in the case of images,
one can also start to learn part-whole
decompositions.
• For example,
– the first layer might learn to group together pixels in
an image in order to detect edges (as seen in the
earlier exercises).
– The second layer might then group together edges to
detect longer contours, or perhaps detect simple
"parts of objects."
– An even deeper layer might then group together these
contours or detect even more complex features.

43

You might also like