0% found this document useful (0 votes)
13 views28 pages

Supervised Deep Learning

Uploaded by

neeharika.sssvv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views28 pages

Supervised Deep Learning

Uploaded by

neeharika.sssvv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

SUPERVISED DEEP LEARNING

Without activation function,


we are restricted to only
linear output.
z = net input
b = bias term
f = activation function
a = output of one layer
passed as input to next
layer.

Why neural network?


A single neuron (like logistic regression) only permits a linear decision boundary. Most
real-world problems are considerably more complicated and so we need stacked neural
networks rather than having only a single neuron.
Moving in forward direction – i.e., input to output  feedforward networks.
Hidden_layer_size = (5,2)  Two hidden layers, with one having 5 units and another
hidden layer having 2 units. len((5,2)) is the number of hidden layers we have.
Every layer between input layer and output layer is hidden layer. Weights are
represented by matrices.

In practice, we do these computations for many datapoints at the same time, by


stacking the rows into a matrix. But the equations look the same.
Optimization and Gradient Descent:
Update the parameters by going through one observation at a time – Stochastic
gradient descent.
Update parameters using just a certain subset amount of our observations within our
data – Mini-batch gradient descent.

Gradient Descent: Too large alpha (learning


rate)  overshot minimum.
Too small alpha  take
more time to reach
minimum i.e., more time to
optimize the model.
Each point can be iteratively
calculated from the previous
one.
This process is repeated
and we eventually end up at
global minimum.

Stochastic Gradient Descent:


It speeds up by only using a single data point to determine the gradient and the cost
function.
Path is less direct due to noise in single data point or a single
observation: “Stochastic”.
Mini-batch gradient descent:
Let n be a number between 1 and size of the entire dataset. Perform an update for
every n training examples.

Here, we can reduce the memory relative to original or vanilla gradient descent where
we use entire dataset.
Less noisy and gets to the optimal value much smoother than stochastic gradient
descent.

 Select the method or methods that best help you find the same results as using
matrix linear algebra to solve the equation θ=(XTX)-1XTy
o Use stochastic gradient descent, use scikit-learn to build a linear
regression model, train a neural network model.

Back-propagation:
Training a Neural Network

In a nutshell this is the process to train a neural network:

 Put in Training inputs, get the output.


 Compare output to correct answers: Look at loss function J.
 Adjust and repeat.
 Backpropagation tells us how to make a single adjustment using calculus.

How have we trained before?

Ans: Using Gradient descent

1. Make prediction.
2. Calculate loss.
3. Calculate gradient of the loss function w.r.t. parameters.
4. Update the parameters by taking a step in the opposite direction.
5. Iterate.
Backpropagation:

We obtain desired changes to inputs using calculus:

o Functions are chosen to have nice derivatives.


o Numerical issues are to be considered.
 Such as exploding and vanishing gradient.
Using this partial derivative, we
will be able to update weights in
correct direction.

So, the idea of backpropagation is that we will first run our neural network with our
initialized weights. Then moving back through our layers, we are going to take the
derivative of each of our weights in our final layer with respect to our loss function. Then
use that to again get our partial derivative in respect to our layer 2 of our weights and
then our layer one weights finally. We will use these to update our initialized values and
then again feed these updated weights through our neural net and repeat the process.

If there are more layers, the


gradient gets very small at early
stages during back propagation
and this is called vanishing
gradient. It is because the 0≤f(z)≤1.
For this reason, other activation
functions such as ReLU became
more common.
Movement is in backward direction.
Sigmoid graph:

Tanh graph:

ReLU (Rectified Linear Unit) graph:


On the left side, rather than those tiny changes, there is zero change. So, these values
will actually zero out particular nodes. Now, this zeroing out will allow for us to
ignore nodes that may not be providing much extra information. Thus, it may be more
efficient than the sigmoid or hyperbolic functions that always maintain at least some
information at each node. Now, on the other hand, there will be no learning happening
at each of those nodes that are being zeroed out and perhaps you want to ensure some
type of learning at all nodes. The solution is Leaky ReLU.

Leaky ReLU graph:

Alpha is small number here. They are not necessarily better than ReLU all the time.

Regularization techniques for deep learning:


Drop out and early stopping are few regularization techniques in neural networks.
With more layers, we could learn more complex models. These models may perfectly fit
to our training data or sometimes may overfit as well and not generalize well with new
data. So, to prevent over fitting in neural networks, we use regularization techniques.
To prevent overfitting, we have several means to regularize neural networks:
 Adding some regularization penalty in cost function similar to Lasso and Ridge.
 Dropout where we randomly loose certain neurons in our network to ensure our
model is not over reliant on any particular neuron.
 Early stopping will just be the idea of stopping gradient descent short so that is
not perfectly fit to the training set.
 Stochastic and mini-batch gradient descent (to some extent) – don’t perfectly fit
to training data and therefore, may generalize better than full-batch gradient
descent.
Optimizers and Data shuffling:
Optimizers:
Different methods of optimizing weights – Optimizers.

More momentum more


smoothing.
ꭜ is replaced by beta and then
alpha = 1-beta. Alpha is learning
rate.
Momentum may cause to overshoot
optimum value.

AdaGrad (Adaptive Gradient):


The idea is to scale the update for each weight separately. ꭜ here is learning rate.
 Update frequently updated weights less.
 Keep running sum of previous updates.
 Divide new updates by the factor of previous sum.
This leads to smaller
updates each iteration. As
we get closer to optimal
value, the learning rate
shrinks, and this avoids
overshooting.
RMSProp (Root Mean Square Propagation):
Similar to AdaGrad.
Rather than using the sum of previous gradients, decay older gradients more than more
recent ones  more weights to more recent gradients.
It is more adaptive to recent gradients.
Adam (Adaptive moment estimation):
The idea is to use both first order and second-order change information and decay both
over time.
Momentum + RMSProp = Adam

Generally, beta1 = 0.9 and beta2 = 0.999.

Which optimizer to choose?


RMSProp and Adam seem to be quite popular.
It can be difficult to predict in advance which will be best for particular problem.
It is still an active area of inquiry.
 Adam speeds up the optimization process tremendously and usually does a fairly
good job at finding optimal solutions. There are going to be times when it does
have trouble conversion.

Details of training a neural network:


Classical approach: Get derivative for entire dataset, then take a step in that direction.
Pro: Each step is informed by all data.
Con: Very slow, especially as data gets big.
Stochastic gradient descent: Get derivative for just one point, then take a step in that
direction. Here, steps are less informed, but you take more of them. It should balance
out the missteps.
So, take smaller step size. Also helps in regularization as it does not perfectly fit.
Mini-batch gradient descent: Get derivative for a small set of points, then take a step
in that direction. Strikes a balance between extremes (full batch and stochastic gradient
descents).

An epoch refers to a single pass through all the training data.


 In full batch gradient descent, there would be one step taken per epoch.
 In SGD/ Online learning, there would be n steps taken per epoch (n = training set
size).
 In mini-batch, there would be (n/batch size) steps taken per epoch (n = number
of rows).

Training Neural Networks is sensitive to how to compute the derivative of each weight
and how to reach convergence. Important concepts that are involved at this step:

 Batching methods, which includes techniques like full-batch, mini-batch, and


stochastic gradient descent, get the derivative for a set of points.
 Data shuffling, which aids convergence by making sure data is presented in a
different order every epoch.
Keras library:
Some of the most common libraries:
 TensorFlow – Developed by Google. Build AI related produced.
o It has keras built in.
 Theano – Grandfather of deep learning frameworks.
o Dead in 2018
 PyTorch – Research oriented. Developed by Facebook.
Keras is very high-level library, can run on either TensorFlow or Theano.
Typical command structure:
Build the structure of network:
o Compile the model (how many layers we want), specifying loss function,
metrics, and optimizer (this includes the learning rate).
o Fit the model on to training data (specifying batch size and number of
epochs).
o Predict on new data.
o Evaluate results.

Keras provides two approaches to building the structure of model:


 Sequential model: allows a linear stack of layers – simpler and more convenient
if model has this form.
 Functional API: more detailed and complex but allows more complicated
architectures.
In machine-learning there is an approach called early stop to avoid overfitting. In that
approach you plot the error rate on training and validation data. The horizontal axis is
the number of epochs and the vertical axis is the error rate. You should stop training
when the error rate of validation data is minimum. Consequently, if you increase the
number of epochs, you will have an over-fitted model.

Convolution Neural Network:


Convolutional Layers have relatively few weights and more layers than other
architectures. In practice, data scientists add layers to CNNs to solve specific problems
using Transfer Learning.

Kernel is used to find edges, corners etc. in our image. A kernel is a grid of weights
“overlaid” on image, centered on one pixel.

o Each weight is multiplied with the kernel which is overlaid on the pixel.
o The output over the centered pixel is: . This is convolution
operation.

Kernels are local feature detectors. Kernel doe not need to be square.

Primary ideas behind CNN:

 Let the neural network learn which kernels are most useful.
 Use same set of kernels across entire image (translation variance).
 Reduces number of parameters and variance (from bias-variance point of view).

When we work with these centered values and trying to output centered values, the
edges and corners of our image are overlooked. This can be solved by padding.

Padding:

 Pixels in the edge are not used as center pixels since there not enough
surrounding pixels.
 Padding adds extra pixels around the frame, so the pixels from the original image
become center pixels as the kernel moves across the image.
 These added pixels are typically of value zero (zero-padding).

Striding:

 Striding is the step size the kernel moves across the image.
 When stride>1, it scales down the output dimension.
 Stride = 2  move 2 steps both horizontally and vertically. This can be different
for horizontal and vertical steps.

Depth:

In images, we often have multiple numbers associated with each pixel location. These
numbers are referred to as channels. Example: RGB – 3 channels, CMYK – 4 channels.

CMYK is Cyan, Magenta, Yellow and Black. The number of channels is depth. So, the
kernel itself will have a depth the same size as the number of input channels.
Example: a 5 * 5 kernel on an RGB image  5 * 5 * 3(RGB) = 75 weights.

The output from the layer will also have depth.

 The network typically train many different kernels.


 Each kernel outputs a single number at each pixel location.
 So, is there are 10 kernels  10 different patterns are detected.

Pooling:

The idea is to reduce the image size by mapping a patch of pixels to a single value.

 Shrinks the dimensions in an image.


 Does not have parameters, though there are different types of pooling operations
(like max, average).
o Max-pool – For each distinct patch, represent it by its maximum value.
o Average-pool - For each distinct patch, represent it by its average value.
 Pooling is a fixed operation and we need not learn any weights.

Transfer Learning

The main idea of Transfer Learning consists of keeping early layers of a pre-trained
network and re-train the later layers for a specific application.

Last layers in the network capture features that are more particular to the specific data
you are trying to classify.

Later layers are easier to train as adjusting their weights has a more immediate impact
on the final result.

Guiding Principles for Fine Tuning

While there are no rules of thumb, these are some guiding principles to keep in mind:

 The more similar your data and problem are to the source data of the pre-
trained network, the less intensive fine-tuning will be.
 If your data is substantially different in nature than the data the source model
was trained on, Transfer Learning may be of little value.

CNN Architectures

LeNet-5

 Created by Yann LeCun in the 1990s


 Used on the MNIST data set, black and white images.
 Novel Idea: Use convolutions to efficiently learn features on data set.
AlexNet

 Considered the “flash point” for modern deep learning.


 Created in 2012 for the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC).
 Task: predict the correct label from among 1000 classes.
 Dataset: around 1.2 million images.
AlexNet developers performed data augmentation for training.

 Cropping, horizontal flipping, and other manipulations.


 This augmentation helps with overfitting.
Basic AlexNet Template:

 Convolutions with ReLUs.


 Sometimes add maxpool after convolutional layer.
 Fully connected layers at the end before a softmax classifier.
VGG

Simplify Network Structure: has same concepts and ideas from LeNet, considerably
deeper. Simpler architecture but still be able to find more complex features.

This architecture avoids Manual Choices of Convolution Size and has very Deep
Network with 3x3 Convolutions.

These structures tend to give rise to larger convolutions.

This was one of the first architectures to experiment with many layers (More is better!).
It can use multiple 3x3 convolutions to simulate larger kernels with fewer parameters
and it served as “base model” for future works.

VGG reduces working with many number of weights.

9, 49 and 27 are weights.


Can use multiple 3 *3 convolutions to simulate
larger Kernels with fewer parameters.

Inception
Ideated by Szegedy et al 2014, this architecture was built to turn each layer of the
neural network into further branches of convolutions. Each branch handles a smaller
portion of workload. It combines different layers together in a single layer.

With Inception, the idea is perhaps you don't know exactly what type of filter or what
type of layer you want at each step, so you may want to combine or try a bunch of them
together. But this can be computationally expensive. We probably want to accomplish
this with some level of computational efficiency. We are also going to want to ensure
that we can reduce the total number of activations that are needed to run through our
entire network.

The network concatenates different branches at the end. These networks use different
receptive fields and have sparse activations of groups of neurons.

Inception V3 is a relevant example of an Inception architecture.

ResNet

Researchers were building deeper and deeper networks but started finding these
issues:

In theory, the very deep (56-layer) networks should fit the training data better (even if
they overfit) but that was not happening.

Seemed that the early layers were just not getting updated and the signal got lost (due
to vanishing gradient type issues).

These are the main reasons why adding layers does not always decrease training error:

 Early layers of Deep Networks are very slow to adjust.


 Analogous to “Vanishing Gradient” issue.
 In theory, should be able to just have an “identity” transformation that makes
the deeper network behave like a shallow one.
In a nutshell, a ResNet:

 Has several layers such as convolutions.


 Enforces “best transformation” by adding “shortcut connections”.
 Adds the inputs from an earlier layer to the output of current layer.
 Keeps passing both the initial unchanged information and the transformed
information to the next layer.
 Works with much deeper networks and still gets high accuracy.

CNN Applications (supervised):


1. Image recognition/classification (animals, digits, malignant/benign)
a. Automatic feature selection/extraction
2. Object detection in images
3. Coloring black and white images
4. Creating art images
5. Natural language processing
6. Speech recognition
7. Face detection
8. Recommender system
9. Image smoothing, blurring, noise filtering, edge detection.
It is the process of detecting primitive features of image like edges, boundaries, and
curves. This is done by using a kernel to convolve the image matrix.
Convolution types (1D):
1. Full (with 0 padding)  output length: len1+len2-1
2. Same (with left sided 0 padding) output length: max(len1, len2)
3. Valid (without 0 padding) output length: max – min + 1
Validation data is used to generate model properties such as classification error, and
from this tune the model parameters like optimal number of hidden units or determining
the stopping point for back propagation.
Test data is used to evaluate the performance and accuracy of the model against “real
life situations”. No further optimization beyond this point.

Transfer learning:
It is difficult to train large datasets as it takes more time to fit and is computationally
expensive. However, the basic features (edges, shapes) learned in the early layers of
the network should generalize fairly well with other datasets having similar problems.
So, results of the training are just weights (numbers) that are easy to store.
The idea is that keep the early layers of pre-trained network, and re-train the later layers
for a specific application. This is called transfer learning.
Remove the final layer or any layer from the back and train on the pre-trained model.
The additional training of a pre-trained network on a specific new dataset is referred to
as Fine Tuning.
There are different options in "how much” and “how far back” to fine-tune.
 Should I train last layer?
 Go back few layers?
 Re-train the entire network (from the starting point of the existing network)?
Few guiding principles of fine tuning:
 The more similar the data and problem are to the source data of the pre-trained
network, the less fine tuning is necessary.
o Example: Using a network trained on ImageNet to distinguish “dogs” from
“cats” should need relatively little fine-tuning.ImageNet already
distinguished different breeds of dogs and cats, so likely has all the
features you will need.
 The more data you have about your specific problem, the more the network will
benefit from longer and deeper fine tuning.
o Example: If you have 100 dogs and 100 cats in your training data, you
probably want to do little fine tuning like may be remove final layer or two
and use lot of attributes that you learn from ImageNet.
o On the other hand, if you have 100,000 dogs and 100,000 cats, you may
get more value from longer and deeper fine tuning. Going back further or
even retraining full network using that past network to initialize weights.
 If your data is substantially different in nature than the data the source model was
trained on, Transfer Learning may be of little value.
o Example: A network that is based on recognizing typed Latin alphabet
characters would not be useful in distinguishing dogs from cats. But it
would likely be useful as a starting point for recognizing Cyrillic Alphabet
characters.
Recurrent Neural Network (RNN):
Recurrent Neural Networks are a class of neural networks that allow previous outputs to
be used as inputs while having hidden states. They are mostly used in applications of
natural language processing and speech recognition.

One of the main motivations for RNNs is to derive insights from text and do better than
“bag of words” implementations. Ideally, each word is processed or understood in the
appropriate context.

Words should be handled differently depending on “context”. Also, each word should
update the context.

Under the notion of recurrence, words are input one by one. This way, we can handle
variable lengths of text. This means that the response to a word depends on the words
that preceded it.

These are the two main outputs of an RNN:

 Prediction: What would be the prediction if the sequence ended with that
word.
 State: Summary of everything that happened in the past.
Mathematical Details

Mathematically, there are cores and subsequent dense layers.

current state = function1(old state, current input).

current output = function2(current state).

We learn function1 and function2 by training our network!

r = dimension of input vector

s = dimension of hidden state

t = dimension of output vector (after dense layer)

U is a s × r matrix (Linear transformation which is dot multiplied with our input)

W is a s × s matrix (Weight matrices within our states)


V is a t × s matrix (Output of state)

In which the weight matrices U, V, W are the same across all positions.

Kernel initializer is the weight initializer for the inputs, whereas recurrent initializer is
weight initializer for states.

Practical Details

Often, we train on just the “final” output and ignore intermediate outputs.

Slight variation called Backpropagation Through Time (BPTT) is used to train RNNs.

Sensitive to length of sequence (due to “vanishing/exploding gradient” problem).

In practice, we still set a maximum length to our sequences. If the input is shorter than
maximum, we “pad” it. If the input is longer than maximum, we truncate it.

RNN Applications

RNNs often focus on text applications, but are commonly used for other sequential data:

 Forecasting: Customer Sales, Loss Rates, Network Traffic.


 Speech Recognition: Call Center Automation, Voice Applications.
 Manufacturing Sensor Data
 Genome Sequences

Weakness of RNN:

Nature of state transition means it is hard to keep the information from distant past in
current memory without reinforcement. Example: I am from France, I speak ___. In this
___ we expect RNN to fill French. But RNN cannot remember long sequences. This is
weakness of RNN. The solutions to this are LSTM, GRU. LSTMs have more complex
mechanism for updating weights.
Structure of RNN: pad or truncate the maximum length of word  Embedding layer 
RNN  Dense layer, here embedding layer is something that similar words (fast,
quickly) have similar embedding index to be passed into the network.

Standard RNNs have poor memory.

 Transition matrix necessarily weakens signal.


o Solution: Need a structure that can leave some dimensions unchanged
over many steps.
o This problem is addressed by LSTM.

Long-Short Term Memory RNNs (LSTM)

LSTMs are a special kind of RNN (invented in 1997). LSTM has as motivation solve one
of the main weaknesses of RNNs, which is that its transitional nature, makes it hard to
keep information from distant past in current memory without reinforcement. LSTM
define a more complicated update mechanism for the changing of the internal state. By
default, LSTMs remember the information from the last step. On top of that, rather than
keeping just past information, there is more flexibility in retaining or forgetting large
portion of information from those prior steps beside just that last step (Remembering).

LSTM have a more complex mechanism for updating the state.

Standard RNNs have poor memory because the transition Matrix necessarily weakens
signal.

This is the problem addressed by Long-Short Term Memory RNNs (LSTM).

To solve it, you need a structure that can leave some dimensions unchanged over many
steps.

 By default, LSTMs remember the information from the last step.


 Items are overwritten as an active choice.
The idea for updating states that RNNs use is old, but the available computing power to
do it sequence to sequence mapping, explicit memory unit, and text generation tasks is
relatively new.

Augment RNNs with a few additional Gate Units:

 Gate Units control how long/if events will stay in memory.


 Input Gate: If its value is such, it causes items to be stored in memory.
 Forget Gate: If its value is such, it causes items to be removed from memory.
 Output Gate: If its value is such, it causes the hidden unit to feed forward
(output) in the network.
Cell state gets updated in two stages. The
cross gate is the forget gate (decide what
information from the prior cell state, as well
as the current input coming in to forget).
The + gate is add new information portion
which tells us what new information is worth
maintaining.

This is forget gate. It takes previous


hidden state and concatenates with
current input and this is multiplied to
weights (Wf) at forget gate. Then this
is passed through sigmoid function
whose output is between 0 and 1.

This is input gate. The weights get updated


(Wi). The same functionality as forget gate
except it has tanh function as well whose
output is between -1 and 1. The idea being that
the tanh is the actual information you
are deciding whether or not to add on. Then
that sigmoid between zero and one will tell
you ideally what portion of that new information
we would want to add on. If it's close to one,
we add on all information. If it's close to
zero, we don't add on very much of that new
information.

This is used to find the cell state.


Gated Recurrent Units (GRUs)

GRUs are a gating mechanism for RNNs that is an alternative to LSTM. It is based on
the principle of Removed Cell State:

 Past information is now used to transfer past information.


 Think of as a “simpler” and faster version of LSTM.
These are the gates of GRU:

Reset gate: helps decide how much past information to forget.

Update gate: helps decide what information to throw away and what new information to
keep.

LSTM vs GRU

LSTMs are a bit more complex and may therefore be able to find more complicated
patterns.

Conversely, GRUs are a bit simpler and therefore are quicker to train.

GRUs will generally perform about as well as LSTMs with shorter training time,
especially for smaller datasets.
In Keras it is easy to switch from one to the other by specifying a layer type. It is
relatively quickly to change one for the other.

Sequence-to-Sequence Models (Seq2Seq)

Thinking back to any type of RNN interprets text, the model will have a new hidden state
at each step of the sequence containing information about all past words. It is powerful
for language translation and helps us understand how words or sequences are pieced
together that may be different lengths but may be related to one another. It is simply like
language translator.

Seq2Seq improve keeping necessary information in the hidden state from one
sequence to the next.

This way, at the end of a sentence, the hidden state will have all information relating to
past words. The size of the vector from the hidden state is the same no matter the size
of the sentence. In machine translation, the encoder: corpus of sentences in the original
language.

In a nutshell, there is an encoder, a hidden state, and a decoder.

Currently the model produces


only single word at a time. It
means it translates single word
at a time and the single word
that is being produced will be
conditional on whatever that
prior word that was produced. If
it produces one wrong word, we
may end up throwing off the
sequence of words. Solution to
this is beam search.

Beam Search

A solution to the above problem is to produce multiple different hypotheses to produce


words until <EOS> and then see which full sentence is most likely.
Solution: The s(i, j) function will
weigh the different
embedding layer hidden
states to give us a better
embedding for prediction
of next word. This will
better allow you to
translate between different
Solution: languages when the
ordering of words may be
Beam search is an attempt to solve greedy inference. different.

 Greedy Inference, which means that a model producing one word at a time
implies that if it produces one wrong word, it might output a wrong entire
sequence of words.
 Beam search tries to produce multiple different hypotheses to produce words
until <EOS> and then see which full sentence is most likely.
These are examples of common enterprise applications of LSTM models:

 Forecasting: (LSTM among most common Deep Learning models used in


forecasting).
 Speech Recognition, speech to text
 Machine Translation, sentiment analysis
 Image Captioning
 Question Answering (Customer care like say “yes” is this is your request).
 Anomaly Detection
 Robotic Control
 Sentence completion, to solve the problem of modelling sequential data.
If you wanted to build some complex architectures, such as Inception or ResNet you
would have to actually use functional API instead of sequential model, in order to build
out layers, such as with Inception, where you are concatenating a bunch of different
types of layers together, or ResNet where you want to bring along portions of the layer
to further layers, you will have to use something like the functional API.

You might also like