Supervised Deep Learning
Supervised Deep Learning
Here, we can reduce the memory relative to original or vanilla gradient descent where
we use entire dataset.
Less noisy and gets to the optimal value much smoother than stochastic gradient
descent.
Select the method or methods that best help you find the same results as using
matrix linear algebra to solve the equation θ=(XTX)-1XTy
o Use stochastic gradient descent, use scikit-learn to build a linear
regression model, train a neural network model.
Back-propagation:
Training a Neural Network
1. Make prediction.
2. Calculate loss.
3. Calculate gradient of the loss function w.r.t. parameters.
4. Update the parameters by taking a step in the opposite direction.
5. Iterate.
Backpropagation:
So, the idea of backpropagation is that we will first run our neural network with our
initialized weights. Then moving back through our layers, we are going to take the
derivative of each of our weights in our final layer with respect to our loss function. Then
use that to again get our partial derivative in respect to our layer 2 of our weights and
then our layer one weights finally. We will use these to update our initialized values and
then again feed these updated weights through our neural net and repeat the process.
Tanh graph:
Alpha is small number here. They are not necessarily better than ReLU all the time.
Training Neural Networks is sensitive to how to compute the derivative of each weight
and how to reach convergence. Important concepts that are involved at this step:
Kernel is used to find edges, corners etc. in our image. A kernel is a grid of weights
“overlaid” on image, centered on one pixel.
o Each weight is multiplied with the kernel which is overlaid on the pixel.
o The output over the centered pixel is: . This is convolution
operation.
Kernels are local feature detectors. Kernel doe not need to be square.
Let the neural network learn which kernels are most useful.
Use same set of kernels across entire image (translation variance).
Reduces number of parameters and variance (from bias-variance point of view).
When we work with these centered values and trying to output centered values, the
edges and corners of our image are overlooked. This can be solved by padding.
Padding:
Pixels in the edge are not used as center pixels since there not enough
surrounding pixels.
Padding adds extra pixels around the frame, so the pixels from the original image
become center pixels as the kernel moves across the image.
These added pixels are typically of value zero (zero-padding).
Striding:
Striding is the step size the kernel moves across the image.
When stride>1, it scales down the output dimension.
Stride = 2 move 2 steps both horizontally and vertically. This can be different
for horizontal and vertical steps.
Depth:
In images, we often have multiple numbers associated with each pixel location. These
numbers are referred to as channels. Example: RGB – 3 channels, CMYK – 4 channels.
CMYK is Cyan, Magenta, Yellow and Black. The number of channels is depth. So, the
kernel itself will have a depth the same size as the number of input channels.
Example: a 5 * 5 kernel on an RGB image 5 * 5 * 3(RGB) = 75 weights.
Pooling:
The idea is to reduce the image size by mapping a patch of pixels to a single value.
Transfer Learning
The main idea of Transfer Learning consists of keeping early layers of a pre-trained
network and re-train the later layers for a specific application.
Last layers in the network capture features that are more particular to the specific data
you are trying to classify.
Later layers are easier to train as adjusting their weights has a more immediate impact
on the final result.
While there are no rules of thumb, these are some guiding principles to keep in mind:
The more similar your data and problem are to the source data of the pre-
trained network, the less intensive fine-tuning will be.
If your data is substantially different in nature than the data the source model
was trained on, Transfer Learning may be of little value.
CNN Architectures
LeNet-5
Simplify Network Structure: has same concepts and ideas from LeNet, considerably
deeper. Simpler architecture but still be able to find more complex features.
This architecture avoids Manual Choices of Convolution Size and has very Deep
Network with 3x3 Convolutions.
This was one of the first architectures to experiment with many layers (More is better!).
It can use multiple 3x3 convolutions to simulate larger kernels with fewer parameters
and it served as “base model” for future works.
Inception
Ideated by Szegedy et al 2014, this architecture was built to turn each layer of the
neural network into further branches of convolutions. Each branch handles a smaller
portion of workload. It combines different layers together in a single layer.
With Inception, the idea is perhaps you don't know exactly what type of filter or what
type of layer you want at each step, so you may want to combine or try a bunch of them
together. But this can be computationally expensive. We probably want to accomplish
this with some level of computational efficiency. We are also going to want to ensure
that we can reduce the total number of activations that are needed to run through our
entire network.
The network concatenates different branches at the end. These networks use different
receptive fields and have sparse activations of groups of neurons.
ResNet
Researchers were building deeper and deeper networks but started finding these
issues:
In theory, the very deep (56-layer) networks should fit the training data better (even if
they overfit) but that was not happening.
Seemed that the early layers were just not getting updated and the signal got lost (due
to vanishing gradient type issues).
These are the main reasons why adding layers does not always decrease training error:
Transfer learning:
It is difficult to train large datasets as it takes more time to fit and is computationally
expensive. However, the basic features (edges, shapes) learned in the early layers of
the network should generalize fairly well with other datasets having similar problems.
So, results of the training are just weights (numbers) that are easy to store.
The idea is that keep the early layers of pre-trained network, and re-train the later layers
for a specific application. This is called transfer learning.
Remove the final layer or any layer from the back and train on the pre-trained model.
The additional training of a pre-trained network on a specific new dataset is referred to
as Fine Tuning.
There are different options in "how much” and “how far back” to fine-tune.
Should I train last layer?
Go back few layers?
Re-train the entire network (from the starting point of the existing network)?
Few guiding principles of fine tuning:
The more similar the data and problem are to the source data of the pre-trained
network, the less fine tuning is necessary.
o Example: Using a network trained on ImageNet to distinguish “dogs” from
“cats” should need relatively little fine-tuning.ImageNet already
distinguished different breeds of dogs and cats, so likely has all the
features you will need.
The more data you have about your specific problem, the more the network will
benefit from longer and deeper fine tuning.
o Example: If you have 100 dogs and 100 cats in your training data, you
probably want to do little fine tuning like may be remove final layer or two
and use lot of attributes that you learn from ImageNet.
o On the other hand, if you have 100,000 dogs and 100,000 cats, you may
get more value from longer and deeper fine tuning. Going back further or
even retraining full network using that past network to initialize weights.
If your data is substantially different in nature than the data the source model was
trained on, Transfer Learning may be of little value.
o Example: A network that is based on recognizing typed Latin alphabet
characters would not be useful in distinguishing dogs from cats. But it
would likely be useful as a starting point for recognizing Cyrillic Alphabet
characters.
Recurrent Neural Network (RNN):
Recurrent Neural Networks are a class of neural networks that allow previous outputs to
be used as inputs while having hidden states. They are mostly used in applications of
natural language processing and speech recognition.
One of the main motivations for RNNs is to derive insights from text and do better than
“bag of words” implementations. Ideally, each word is processed or understood in the
appropriate context.
Words should be handled differently depending on “context”. Also, each word should
update the context.
Under the notion of recurrence, words are input one by one. This way, we can handle
variable lengths of text. This means that the response to a word depends on the words
that preceded it.
Prediction: What would be the prediction if the sequence ended with that
word.
State: Summary of everything that happened in the past.
Mathematical Details
In which the weight matrices U, V, W are the same across all positions.
Kernel initializer is the weight initializer for the inputs, whereas recurrent initializer is
weight initializer for states.
Practical Details
Often, we train on just the “final” output and ignore intermediate outputs.
Slight variation called Backpropagation Through Time (BPTT) is used to train RNNs.
In practice, we still set a maximum length to our sequences. If the input is shorter than
maximum, we “pad” it. If the input is longer than maximum, we truncate it.
RNN Applications
RNNs often focus on text applications, but are commonly used for other sequential data:
Weakness of RNN:
Nature of state transition means it is hard to keep the information from distant past in
current memory without reinforcement. Example: I am from France, I speak ___. In this
___ we expect RNN to fill French. But RNN cannot remember long sequences. This is
weakness of RNN. The solutions to this are LSTM, GRU. LSTMs have more complex
mechanism for updating weights.
Structure of RNN: pad or truncate the maximum length of word Embedding layer
RNN Dense layer, here embedding layer is something that similar words (fast,
quickly) have similar embedding index to be passed into the network.
LSTMs are a special kind of RNN (invented in 1997). LSTM has as motivation solve one
of the main weaknesses of RNNs, which is that its transitional nature, makes it hard to
keep information from distant past in current memory without reinforcement. LSTM
define a more complicated update mechanism for the changing of the internal state. By
default, LSTMs remember the information from the last step. On top of that, rather than
keeping just past information, there is more flexibility in retaining or forgetting large
portion of information from those prior steps beside just that last step (Remembering).
Standard RNNs have poor memory because the transition Matrix necessarily weakens
signal.
To solve it, you need a structure that can leave some dimensions unchanged over many
steps.
GRUs are a gating mechanism for RNNs that is an alternative to LSTM. It is based on
the principle of Removed Cell State:
Update gate: helps decide what information to throw away and what new information to
keep.
LSTM vs GRU
LSTMs are a bit more complex and may therefore be able to find more complicated
patterns.
Conversely, GRUs are a bit simpler and therefore are quicker to train.
GRUs will generally perform about as well as LSTMs with shorter training time,
especially for smaller datasets.
In Keras it is easy to switch from one to the other by specifying a layer type. It is
relatively quickly to change one for the other.
Thinking back to any type of RNN interprets text, the model will have a new hidden state
at each step of the sequence containing information about all past words. It is powerful
for language translation and helps us understand how words or sequences are pieced
together that may be different lengths but may be related to one another. It is simply like
language translator.
Seq2Seq improve keeping necessary information in the hidden state from one
sequence to the next.
This way, at the end of a sentence, the hidden state will have all information relating to
past words. The size of the vector from the hidden state is the same no matter the size
of the sentence. In machine translation, the encoder: corpus of sentences in the original
language.
Beam Search
Greedy Inference, which means that a model producing one word at a time
implies that if it produces one wrong word, it might output a wrong entire
sequence of words.
Beam search tries to produce multiple different hypotheses to produce words
until <EOS> and then see which full sentence is most likely.
These are examples of common enterprise applications of LSTM models: