Deep Learning
Deep Learning
Here are the course summary as its given on the course link:
If you want to break into cutting-edge AI, this course will help you do so. Deep learning engineers are highly sought
after, and mastering deep learning will give you numerous new career opportunities. Deep learning is also a new
"superpower" that will let you build AI systems that just weren't possible a few years ago.
In this course, you will learn the foundations of deep learning. When you finish this class, you will:
This course also teaches you how Deep Learning actually works, rather than presenting only a cursory or surface-level
description. So after completing it, you will be able to apply deep learning to a your own applications. If you are looking
for a job in AI, after this course you will also be able to answer basic interview questions.
Basically a single neuron will calculate weighted sum of input(W.T*X) and then we can set a threshold to predict output
in a perceptron. If weighted sum of input cross the threshold, perceptron fires and if not then perceptron doesn't
predict.
Disadvantage of perceptron is that it only output binary values and if we try to give small change in weight and bais
then perceptron can flip the output. We need some system which can modify the output slightly according to small
change in weight and bias. Here comes sigmoid function in picture.
If we change perceptron with a sigmoid function, then we can make slight change in output.
e.g. output in perceptron = 0, you slightly changed weight and bias, output becomes = 1 but actual output is 0.7. In case
of sigmoid, output1 = 0, slight change in weight and bias, output = 0.7.
If we apply sigmoid activation function then Single neuron will act as Logistic Regression.
we can understand difference between perceptron and sigmoid function by looking at sigmoid function graph.
Simple NN graph:
RELU stands for rectified linear unit is the most popular activation function right now that makes deep NNs train faster
now.
Hidden layers predicts connection between inputs automatically, thats what deep learning is good at.
Deep NN consists of more hidden layers (Deeper layers)
Each Input will be connected to the hidden layer and the NN will decide the connections.
Supervised learning means we have the (X,Y) and we need to get the function that maps X to Y.
i. Data:
Using this image we can conclude:
For small data NN can perform as Linear regression or SVM (Support vector machine)
For big data a small NN is better that SVM
For big data a big NN is better that a medium NN is better that small NN.
Hopefully we have a lot of data because the world is using the computer a little bit more
Mobiles
IOT (Internet of things)
ii. Computation:
GPUs.
Powerful CPUs.
Distributed computing.
ASICs
iii. Algorithm:
a. Creative algorithms has appeared that changed the way NN works.
For example using RELU function is so much better than using SIGMOID function in training a NN because
it helps with the vanishing gradient problem.
Neural Networks Basics
Learn to set up a machine learning problem with a neural network mindset. Learn to use vectorization to speed up your
models.
Binary classification
Mainly he is talking about how to do a logistic regression to make a binary classifier.
Logistic regression
Algorithm is used for classification algorithm of 2 classes.
Equations:
Simple equation: y = wx + b
If x is a vector: y = w(transpose)x + b
If we need y to be in between 0 and 1 (probability): y = sigmoid(w(transpose)x + b)
In some notations this might be used: y = sigmoid(w(transpose)x)
While b is w0 of w and we add x0 = 1 . but we won't use this notation in the course (Andrew said that the
first notation is better).
In binary classification Y has to be between 0 and 1 .
In the last equation w is a vector of Nx and b is a real number
Gradient Descent
We want to predict w and b that minimize the cost function.
First we initialize w and b to 0,0 or initialize them to a random value in the convex function and then try to improve
the values the reach minimum value.
The gradient decent algorithm repeats: w = w - alpha * dw where alpha is the learning rate and dw is the derivative of
w (Change to w ) The derivative is also the slope of w
Looks like greedy algorithms. the derivative give us the direction to improve our parameters.
The actual equations we will implement:
w = w - alpha * d(J(w,b) / dw) (how much the function slopes in the w direction)
b = b - alpha * d(J(w,b) / db) (how much the function slopes in the d direction)
Derivatives
We will talk about some of required calculus.
You don't need to be a calculus geek to master deep learning but you'll need some skills from it.
Derivative of a linear line is its slope.
ex. f(a) = 3a d(f(a))/d(a) = 3
if a = 2 then f(a) = 6
if we move a a little bit a = 2.001 then f(a) = 6.003 means that we multiplied the derivative (Slope) to the
moved area and added it to the last result.
To conclude, Derivative is the slope and slope is different in different points in the function thats why the derivative is a
function.
Computation graph
Its a graph that organizes the computation from left to right.
We compute the derivatives on a graph from right to left and it will be a lot more easier.
dvar means the derivatives of a final output variable with respect to various intermediate quantities.
X1 Feature
X2 Feature
W1 Weight of the first feature.
W2 Weight of the second feature.
B Logistic Regression parameter.
M Number of training examples
Y(i) Expected output of i
So we have:
Then from right to left we will calculate derivations compared to the result:
From the above we can conclude the logistic regression pseudo code:
# Backward pass
dz(i) = a(i) - Y(i)
dw1 += dz(i) * x1(i)
dw2 += dz(i) * x2(i)
db += dz(i)
J /= m
dw1/= m
dw2/= m
db/= m
# Gradient descent
w1 = w1 - alpa * dw1
w2 = w2 - alpa * dw2
b = b - alpa * db
The above code should run for some iterations to minimize error.
Vectorization is so important on deep learning to reduce loops. In the last code we can make the whole loop in one step
using vectorization!
Vectorization
Deep learning shines when the dataset are big. However for loops will make you wait a lot for a result. Thats why we
need vectorization to get rid of some of our for loops.
NumPy library (dot) function is using vectorization by default.
The vectorization can be done on CPU or GPU thought the SIMD operation. But its faster on GPU.
Whenever possible avoid for loops.
Most of the NumPy library methods are vectorized version.
As an input we have a matrix X and its [Nx, m] and a matrix Y and its [Ny, m] .
We will then compute at instance [z1,z2...zm] = W' * X + [b,b,...b] . This can be written in python as:
In NumPy, obj.reshape(1,4) changes the shape of the matrix by broadcasting the values.
Reshape is cheap in calculations so put it everywhere you're not sure about the calculations.
Broadcasting works when you do a matrix operation with matrices that doesn't match for the operation, in this case
NumPy automatically makes the shapes ready for the operation by broadcasting the values.
In general principle of broadcasting. If you have an (m,n) matrix and you add(+) or subtract(-) or multiply(*) or divide(/)
with a (1,n) matrix, then this will copy it m times into an (m,n) matrix. The same with if you use those operations with a
(m , 1) matrix, then this will copy it n times into (m, n) matrix. And then apply the addition, subtraction, and
multiplication of division element wise.
If you didn't specify the shape of a vector, it will take a shape of (m,) and the transpose operation won't work. You
have to reshape it to (m, 1)
Try to not use the rank one matrix in ANN
Don't hesitate to use assert(a.shape == (5,1)) to check if your matrix shape is the required one.
If you've found a rank one matrix try to run reshape on it.
Jupyter / IPython notebooks are so useful library in python that makes it easy to integrate code and document at the
same time. It runs in the browser and doesn't need an IDE to run.
To open Jupyter Notebook, open the command line and call: jupyter-notebook It should be installed to work.
s = sigmoid(x)
ds = s * (1 - s) # derivative using calculus
General Notes
The main steps for building a Neural Network are:
Define the model structure (such as number of input features and outputs)
Initialize the model's parameters.
Loop.
Calculate current loss (forward propagation)
Calculate current gradient (backward propagation)
Update parameters (gradient descent)
Preprocessing the dataset is important.
Tuning the learning rate (which is an example of a "hyperparameter") can make a big difference to the algorithm.
kaggle.com is a good place for datasets and competitions.
Pieter Abbeel is one of the best in deep reinforcement learning.
X1 \
X2 ==> z = XW + B ==> a = Sigmoid(z) ==> l(a,Y)
X3 /
X1 \
X2 => z1 = XW1 + B1 => a1 = Sigmoid(z1) => z2 = a1W2 + B2 => a2 = Sigmoid(z2) => l(a2,Y)
X3 /
X is the input vector (X1, X2, X3) , and Y is the output variable (1x1)
We are talking about 2 layers NN. The input layer isn't counted.
Nx = 3
for i = 1 to m
z[1, i] = W1*x[i] + b1 # shape of z[1, i] is (noOfHiddenNeurons,1)
a[1, i] = sigmoid(z[1, i]) # shape of a[1, i] is (noOfHiddenNeurons,1)
z[2, i] = W2*a[1, i] + b2 # shape of z[2, i] is (1,1)
a[2, i] = sigmoid(z[2, i]) # shape of a[2, i] is (1,1)
In the last example we can call X = A0 . So the previous step can be rewritten as:
Activation functions
So far we are using sigmoid, but in some cases other functions can be a lot better.
Sigmoid can lead us to gradient decent problem where the updates are so low.
Sigmoid activation function range is [0,1] A = 1 / (1 + np.exp(-z)) # Where z is the input matrix
Tanh activation function range is [-1,1] (Shifted version of sigmoid function)
In NumPy we can implement Tanh using one of these methods: A = (np.exp(z) - np.exp(-z)) / (np.exp(z) +
np.exp(-z)) # Where z is the input matrix
It turns out that the tanh activation usually works better than sigmoid activation function for hidden units because the
mean of its output is closer to zero, and so it centers the data better for the next layer.
Sigmoid or Tanh function disadvantage is that if the input is too small or too high, the slope will be near zero which will
cause us the gradient decent problem.
One of the popular activation functions that solved the slow gradient decent is the RELU function. RELU = max(0,z) # so
if z is negative the slope is 0 and if z is positive the slope remains linear.
So here is some basic rule for choosing activation functions, if your classification is between 0 and 1, use the output
activation as sigmoid and the others as RELU.
Leaky RELU activation function different of RELU is that if the input is negative the slope will be so small. It works as
RELU but most people uses RELU. Leaky_RELU = max(0.01z,z) #the 0.01 can be a parameter for your algorithm.
In NN you will decide a lot of choices like:
No of hidden layers.
No of neurons in each hidden layer.
Learning rate. (The most important parameter)
Activation functions.
And others..
It turns out there are no guide lines for that. You should try all activation functions for example.
g(z) = 1 / (1 + np.exp(-z))
g'(z) = (1 / (1 + np.exp(-z))) * (1 - (1 / (1 + np.exp(-z))))
g'(z) = g(z) * (1 - g(z))
g(z) = np.maximum(0,z)
g'(z) = { 0 if z < 0
1 if z >= 0 }
g(z) = np.maximum(0.01 * z, z)
g'(z) = { 0.01 if z < 0
1 if z >= 0 }
NN parameters:
n[0] = Nx
n[1] = NoOfHiddenNeurons
n[2] = NoOfOutputNeurons = 1
W1 shape is (n[1],n[0])
b1 shape is (n[1],1)
W2 shape is (n[2],n[1])
b2 shape is (n[2],1)
Repeat:
Compute predictions (y'[i], i = 0,...m)
Get derivatives: dW1, db1, dW2, db2
Update: W1 = W1 - LearningRate * dW1
b1 = b1 - LearningRate * db1
W2 = W2 - LearningRate * dW2
b2 = b2 - LearningRate * db2
Forward propagation:
Z1 = W1A0 + b1 # A0 is X
A1 = g1(Z1)
Z2 = W2A1 + b2
A2 = Sigmoid(Z2) # Sigmoid because the output is between 0 and 1
Backpropagation (derivations):
Random Initialization
In logistic regression it wasn't important to initialize the weights randomly, while in NN we have to initialize them
randomly.
If we initialize all the weights with zeros in NN it won't work (initializing bias with zero is OK):
all hidden units will be completely identical (symmetric) - compute exactly the same function
on each gradient descent iteration all the hidden units will always update the same
We need small values because in sigmoid (or tanh), for example, if the weight is too large you are more likely to end up
even at the very start of training with very large values of Z. Which causes your tanh or your sigmoid activation function
to be saturated, thus slowing down learning. If you don't have any sigmoid or tanh activation functions throughout your
neural network, this is less of an issue.
Constant 0.01 is alright for 1 hidden layer networks, but if the NN is deep this number can be changed but it will always
be a small number.
n[0] denotes the number of neurons input layer. n[L] denotes the number of neurons in output layer.
a[l] = g[l](z[l])
These were the notation we will use for deep neural network.
So we have:
A vector n of shape (1, NoOfLayers+1)
A vector g of shape (1, NoOfLayers)
A list of different shapes w based on the number of neurons on the previous and the current layer.
A list of different shapes b based on the number of neurons on the current layer.
We can't compute the whole layers forward propagation without a for loop so its OK to have a for loop here.
The dimensions of the matrices are so important you need to figure it out.
When starting on an application don't start directly by dozens of hidden layers. Try the simplest solutions (e.g. Logistic
Regression), then try the shallow neural network and so on.
Deep NN blocks:
Forward and Backward Propagation
Pseudo code for forward propagation for layer l:
Input A[l-1]
Z[l] = W[l]A[l-1] + b[l]
A[l] = g[l](Z[l])
Output A[l], cache(Z[l])
Parameters vs Hyperparameters
Main parameters of the NN is W and b
Hyper parameters (parameters that control the algorithm) are like:
Learning rate.
Number of iteration.
Number of hidden layers L .
Number of hidden units n .
Choice of activation functions.
You have to try values yourself of hyper parameters.
In the earlier days of DL and ML learning rate was often called a parameter, but it really is (and now everybody call it) a
hyperparameter.
On the next course we will see how to optimize hyperparameters.
Course summary
Here are the course summary as its given on the course link:
This course will teach you the "magic" of getting deep learning to work well. Rather than the deep learning process
being a black box, you will understand what drives performance, and be able to more systematically get good results.
You will also learn TensorFlow.
Bias / Variance
Bias / Variance techniques are Easy to learn, but difficult to master.
So here the explanation of Bias / Variance:
If your model is underfitting (logistic regression of non linear data) it has a "high bias"
If your model is overfitting then it has a "high variance"
Your model will be alright if you balance the Bias / Variance
For more:
Another idea to get the bias / variance if you don't have a 2D plotting mechanism:
High variance (overfitting) for example:
Training error: 1%
Dev error: 11%
high Bias (underfitting) for example:
Training error: 15%
Dev error: 14%
high Bias (underfitting) && High variance (overfitting) for example:
Training error: 15%
Test error: 30%
Best:
Training error: 0.5%
Test error: 1%
These Assumptions came from that human has 0% error. If the problem isn't like that you'll need to use human
error as baseline.
Regularization
Adding regularization to NN will help it reduce variance (overfitting)
L1 matrix norm:
||W|| = Sum(|w[i,j]|) # sum of absolute values of all w
L2 matrix norm because of arcane technical math reasons is called Frobenius norm:
||W||^2 = Sum(|w[i,j]|^2) # sum of all w squared
We stack the matrix as one vector (mn,1) and then we apply sqrt(w1^2 + w2^2.....)
In practice this penalizes large weights and effectively limits the freedom in your model.
The new term (1 - (learning_rate*lambda)/m) * w[l] causes the weight to decay in proportion to its size.
Intuition 1:
If lambda is too large - a lot of w's will be close to zeros which will make the NN simpler (you can think of it as it
would behave closer to logistic regression).
If lambda is good enough it will just reduce some weights that makes the neural network overfit.
Intuition 2 (with tanh activation function):
If lambda is too large, w's will be small (close to zero) - will use the linear part of the tanh activation function, so we
will go from non linear activation to roughly linear which would make the NN a roughly linear classifier.
If lambda good enough it will just make some of tanh activations roughly linear which will prevent overfitting.
Implementation tip: if you implement gradient descent, one of the steps to debug gradient descent is to plot the cost
function J as a function of the number of iterations of gradient descent and you want to see that the cost function J
decreases monotonically after every elevation of gradient descent with regularization. If you plot the old definition of J (no
regularization) then you might not see it decrease monotonically.
Dropout Regularization
In most cases Andrew Ng tells that he uses the L2 regularization.
The dropout regularization eliminates some neurons/weights on each iteration based on a probability.
Vector d[l] is used for forward and back propagation and is the same for them, but it is different for each iteration (pass)
or training example.
At test time we don't use dropout. If you implement dropout at test time - it would add noise to predictions.
Understanding Dropout
In the previous video, the intuition was that dropout randomly knocks out units in your network. So it's as if on every
iteration you're working with a smaller NN, and so using a smaller NN seems like it should have a regularizing effect.
Another intuition: can't rely on any one feature, so have to spread out weights.
It's possible to show that dropout has a similar effect to L2 regularization.
Dropout can have different keep_prob per layer.
The input layer dropout has to be near 1 (or 1 - no dropout) because you don't want to eliminate a lot of features.
If you're more worried about some layers overfitting than others, you can set a lower keep_prob for some layers than
others. The downside is, this gives you even more hyperparameters to search for using cross-validation. One other
alternative might be to have some layers where you apply dropout and some layers where you don't apply dropout and
then just have one hyperparameter, which is a keep_prob for the layers for which you do apply dropouts.
A lot of researchers are using dropout with Computer Vision (CV) because they have a very big input size and almost
never have enough data, so overfitting is the usual problem. And dropout is a regularization technique to prevent
overfitting.
A downside of dropout is that the cost function J is not well defined and it will be hard to debug (plot J by iteration).
To solve that you'll need to turn off dropout, set all the keep_prob s to 1, and then run the code and check that it
monotonically decreases J and then turn on the dropouts again.
Andrew prefers to use L2 regularization instead of early stopping because this technique simultaneously tries to
minimize the cost function and not to overfit which contradicts the orthogonalization approach (will be discussed
further).
But its advantage is that you don't need to search a hyperparameter like in other regularization approaches (like
lambda in L2 regularization).
Model Ensembles:
Algorithm:
Train multiple independent models.
At test time average their results.
It can get you extra 2% performance.
It reduces the generalization error.
You can use some snapshots of your NN at the training ensembles them and take the results.
Normalizing inputs
If you normalize your inputs this will speed up the training process a lot.
Normalization are going on these steps:
i. Get the mean of the training set: mean = (1/m) * sum(x(i))
ii. Subtract the mean from each input: X = X - mean
This makes your inputs centered around 0.
iii. Get the variance of the training set: variance = (1/m) * sum(x(i)^2)
iv. Normalize the variance. X /= variance
These steps should be applied to training, dev, and testing sets (but using mean and variance of the train set).
Why normalize?
If we don't normalize the inputs our cost function will be deep and its shape will be inconsistent (elongated) then
optimizing it will take a long time.
But if we normalize it the opposite will occur. The shape of the cost function will be consistent (look more
symmetric like circle in 2D example) and we can use a larger learning rate alpha - the optimization will be faster.
Then:
Y' = W[L]W[L-1].....W[2]W[1]X
if W[l] = [1.5 0]
[0 1.5] (l != L because of different dimensions in the output layer)
Y' = W[L] [1.5 0]^(L-1) X = 1.5^L # which will be very large
[0 1.5]
if W[l] = [0.5 0]
[0 0.5]
Y' = W[L] [0.5 0]^(L-1) X = 0.5^L # which will be very small
[0 0.5]
The last example explains that the activations (and similarly derivatives) will be decreased/increased exponentially as a
function of number of layers.
So If W > I (Identity matrix) the activation and gradients will explode.
And If W < I (Identity matrix) the activation and gradients will vanish.
Recently Microsoft trained 152 layers (ResNet)! which is a really big number. With such a deep neural network, if your
activations or gradients increase or decrease exponentially as a function of L, then these values could get really big or
really small. And this makes training difficult, especially if your gradients are exponentially smaller than L, then gradient
descent will take tiny little steps. It will take a long time for gradient descent to learn anything.
There is a partial solution that doesn't completely solve this problem but it helps a lot - careful choice of how you
initialize the weights (next video).
np.random.rand(shape) * np.sqrt(1/n[l-1])
np.random.rand(shape) * np.sqrt(2/n[l-1])
Number 1 or 2 in the nominator can also be a hyperparameter to tune (but not the first to start with)
This is one of the best way of partially solution to Vanishing / Exploding gradients (ReLU + Weight Initialization with
variance) which will help gradients not to vanish/explode too quickly
The initialization in this video is called "He Initialization / Xavier Initialization" and has been published in 2015 paper.
Gradient checking approximates the gradients and is very helpful for finding the errors in your backpropagation
implementation but it's slower than gradient descent (so use only for debugging).
Implementation of this is very simple.
Gradient checking:
First take W[1],b[1],...,W[L],b[L] and reshape into one big vector ( theta )
The cost function will be J(theta)
Then take dW[1],db[1],...,dW[L],db[L] into one big vector ( d_theta )
Algorithm:
Initialization summary
The weights W[l] should be initialized randomly to break symmetry
It is however okay to initialize the biases b[l] to zeros. Symmetry is still broken so long as W[l] is initialized randomly
Random initialization is used to break symmetry and make sure different hidden units can learn different things
Regularization summary
1. L2 Regularization
Observations:
The value of λ is a hyperparameter that you can tune using a dev set.
L2 regularization makes your decision boundary smoother. If λ is too large, it is also possible to "oversmooth", resulting
in a model with high bias.
L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights.
Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It
becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes
more slowly as the input changes.
cost computation:
A regularization term is added to the cost
backpropagation function:
There are extra terms in the gradients with respect to weight matrices
weights:
weights end up smaller ("weight decay") - are pushed to smaller values.
2. Dropout
Optimization algorithms
...
X{bs} = ...
t(1) = 40
t(2) = 49
t(3) = 45
...
t(180) = 60
...
This data is small in winter and big in summer. If we plot this data we will find it some noisy.
Now lets compute the Exponentially weighted averages:
V0 = 0
V1 = 0.9 * V0 + 0.1 * t(1) = 4 # 0.9 and 0.1 are hyperparameters
V2 = 0.9 * V1 + 0.1 * t(2) = 8.5
V3 = 0.9 * V2 + 0.1 * t(3) = 12.15
...
General equation
We can implement this algorithm with more accurate results using a moving window. But the code is more efficient and
faster using the exponentially weighted averages algorithm.
Algorithm is very simple:
v = 0
Repeat
{
Get theta(t)
v = beta * v + (1-beta) * theta(t)
}
vdW = 0, vdb = 0
on iteration t:
# can be mini-batch or batch gradient descent
compute dw, db on current mini-batch
vdW = beta * vdW + (1 - beta) * dW
vdb = beta * vdb + (1 - beta) * db
W = W - learning_rate * vdW
b = b - learning_rate * vdb
Momentum helps the cost function to go to the minimum point in a more fast and consistent way.
beta is another hyperparameter . beta = 0.9 is very common and works very well in most cases.
RMSprop
Stands for Root mean square prop.
This algorithm speeds up the gradient descent.
Pseudo code:
sdW = 0, sdb = 0
on iteration t:
# can be mini-batch or batch gradient descent
compute dw, db on current mini-batch
RMSprop will make the cost function move slower on the vertical direction and faster on the horizontal direction in the
following example:
Ensure that sdW is not zero by adding a small value epsilon (e.g. epsilon = 10^-8 ) to it:
W = W - learning_rate * dW / (sqrt(sdW) + epsilon)
vdW = 0, vdW = 0
sdW = 0, sdb = 0
on iteration t:
# can be mini-batch or batch gradient descent
compute dw, db on current mini-batch
Some people perform learning rate decay discretely - repeatedly decrease after some number of epochs.
Some people are making changes to the learning rate manually.
decay_rate is another hyperparameter .
Tuning process
We need to tune our hyperparameters to get the best out of them.
Hyperparameters importance are (as for Andrew Ng):
i. Learning rate.
ii. Momentum beta.
iii. Mini-batch size.
iv. No. of hidden units.
v. No. of layers.
vi. Learning rate decay.
vii. Regularization lambda.
viii. Activation functions.
ix. Adam beta1 & beta2 .
Its hard to decide which hyperparameter is the most important in a problem. It depends a lot on your problem.
One of the ways to tune is to sample a grid with N hyperparameter settings and then try all settings combinations on
your problem.
Try random values: don't use a grid.
You can use Coarse to fine sampling scheme :
When you find some hyperparameters values that give you a better performance - zoom into a smaller region
around these values and sample more densely within this space.
These methods can be automated.
a_log = -3
b_log = -1
r = (a_log - b_log) * np.random.rand() + b_log
beta = 1 - 10^r # because 1 - beta = 10^r
beta[1] , gamma[1] , ..., beta[L] , gamma[L] are updated using any optimization algorithms (like GD, RMSprop,
Adam)
If you are using a deep learning framework, you won't have to implement batch norm yourself:
Ex. in Tensorflow you can add this line: tf.nn.batch-normalization()
Batch normalization is usually applied with mini-batches.
If we are using batch normalization parameters b[1] , ..., b[L] doesn't count because they will be eliminated after
mean subtraction step, so:
beta[l] - (n[l], m)
gamma[l] - (n[l], m)
Softmax Regression
In every example we have used so far we were talking about binary classification.
There are a generalization of logistic regression called Softmax regression that is used for multiclass
classification/regression.
For example if we are classifying by classes dog , cat , baby chick and none of that
Dog class = 1
Cat class = 2
Baby chick class = 3
None class = 0
To represent a dog vector y = [0 1 0 0]
To represent a cat vector y = [0 0 1 0]
To represent a baby chick vector y = [0 0 0 1]
To represent a none vector y = [1 0 0 0]
Notations:
C = no. of classes
t = e^(Z[L]) # shape(C, m)
A[L] = e^(Z[L]) / sum(t) # shape(C, m), sum(t) - sum of t's for each example (shape (1, m))
Training a Softmax classifier
There's an activation which is called hard max, which gets 1 for the maximum value and zeros for the others.
If you are using NumPy, its np.max over the vertical axis.
The Softmax name came from softening the values and not harding them like hard max.
Softmax is a generalization of logistic activation function to C classes. If C = 2 softmax reduces to logistic regression.
The loss function used with softmax:
dZ[L] = Y_hat - Y
Y_hat * (1 - Y_hat)
Example:
TensorFlow
In this section we will learn the basic structure of TensorFlow programs.
Lets see how to implement a minimization function:
Code v.1:
import numpy as np
import tensorflow as tf
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
session.run(w) # Runs the definition of w, if you print this it will print zero
session.run(train)
for i in range(1000):
session.run(train)
Code v.2 (we feed the inputs to the algorithm through coefficients):
import numpy as np
import tensorflow as tf
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
session.run(w) # Runs the definition of w, if you print this it will print zero
session.run(train, feed_dict={x: coefficients})
for i in range(1000):
session.run(train, feed_dict={x: coefficients})
In TensorFlow you implement only the forward propagation and TensorFlow will do the backpropagation by itself.
In TensorFlow a placeholder is a variable you can assign a value to later.
If you are using a mini-batch training you should change the feed_dict={x: coefficients} to the current mini-batch
data.
Almost all TensorFlow programs use this:
In deep learning frameworks there are a lot of things that you can do with one line of code like changing the optimizer.
Side notes:
Writing and running programs in TensorFlow has the following steps:
i. Create Tensors (variables) that are not yet executed/evaluated.
ii. Write operations between those Tensors.
iii. Initialize your Tensors.
iv. Create a Session.
v. Run the Session. This will run the operations you'd written above.
Instead of needing to write code to compute the cost function we know, we can use this line in TensorFlow :
tf.nn.sigmoid_cross_entropy_with_logits(logits = ..., labels = ...)
For 3-layer NN, it is important to note that the forward propagation stops at Z3 . The reason is that in TensorFlow the
last linear layer output is given as input to the function computing the loss. Therefore, you don't need A3 !
To reset the graph use tf.reset_default_graph()
Extra Notes
If you want a good papers in deep learning look at the ICLR proceedings (Or NIPS proceedings) and that will give you a
really good view of the field.
Who is Yuanqing Lin?
Head of Baidu research.
First one to win ImageNet
Works in PaddlePaddle deep learning platform.
I've seen teams waste months or years through not understanding the principles taught in this course. I hope this two
week course will save you months of time.
This is a standalone course, and you can take this so long as you have basic machine learning knowledge. This is the
third course in the Deep Learning Specialization.
ML Strategy 1
Why ML Strategy
You have a lot of ideas for how to improve the accuracy of your deep learning system:
Collect more data.
Collect more diverse training set.
Train algorithm longer with gradient descent.
Try different optimization algorithm (e.g. Adam).
Try bigger network.
Try smaller network.
Try dropout.
Add L2 regularization.
Change network architecture (activation functions, # of hidden units, etc.)
This course will give you some strategies to help analyze your problem to go in a direction that will help you get better
results.
Orthogonalization
Some deep learning developers know exactly what hyperparameter to tune in order to try to achieve one effect. This is a
process we call orthogonalization.
In orthogonalization, you have some controls, but each control does a specific task and doesn't affect other controls.
For a supervised learning system to do well, you usually need to tune the knobs of your system to make sure that four
things hold true - chain of assumptions in machine learning:
i. You'll have to fit training set well on cost function (near human level performance if possible).
If it's not achieved you could try bigger network, another optimization algorithm (like Adam)...
ii. Fit dev set well on cost function.
If its not achieved you could try regularization, bigger training set...
iii. Fit test set well on cost function.
If its not achieved you could try bigger dev. set...
iv. Performs well in real world.
If its not achieved you could try change dev. set, change cost function...
Suppose we run the classifier on 10 images which are 5 cats and 5 non-cats. The classifier identifies that there are 4
cats, but it identified 1 wrong cat.
Confusion matrix:
Actual cat 3 2
Actual non-cat 1 4
Recall: percentage of true recognition cat of the all cat predictions: R = 3/(3 + 2)
Accuracy: (3+4)/10
Using a precision/recall for evaluation is good in a lot of cases, but separately they don't tell you which algothims is
better. Ex:
Classifier Precision Recall
A 95% 90%
B 98% 85%
A better thing is to combine precision and recall in one single (real) number evaluation metric. There a metric called F1
score, which combines them
You can think of F1 score as an average of precision and recall F1 = 2 / ((1/P) + (1/R))
A 90% 80 ms
B 92% 95 ms
C 92% 1,500 ms
So we can solve that by choosing a single optimizing metric and decide that other metrics are satisfying. Ex:
So as a general rule:
Train/dev/test distributions
Dev and test sets have to come from the same distribution.
Choose dev set and test set to reflect data you expect to get in the future and consider important to do well on.
Setting up the dev set, as well as the validation metric is really defining what target you want to aim at.
Algorithm A 3% error (But a lot of porn images are treated as cat images here)
Algorithm B 5% error
In the last example if we choose the best algorithm by metric it would be "A", but if the users decide it will be "B"
Thus in this case, we want and need to change our metric.
OldMetric = (1/m) * sum(y_pred[i] != y[i] ,m)
Where m is the number of Dev set items.
NewMetric = (1/sum(w[i])) * sum(w[i] * (y_pred[i] != y[i]) ,m)
where:
w[i] = 1 if x[i] is not porn
This is actually an example of an orthogonalization where you should take a machine learning problem and break it into
distinct steps:
i. Figure out how to define a metric that captures what you want to do - place the target.
ii. Worry about how to actually do well on this metric - how to aim/shoot accurately at the target.
Conclusion: if doing well on your metric + dev/test set doesn't correspond to doing well in your application, change
your metric and/or dev/test set.
Avoidable bias
Suppose that the cat classification algorithm gives these results:
Humans 1% 7.5%
Training error 8% 8%
In the left example, because the human level error is 1% then we have to focus on the bias.
In the right example, because the human level error is 7.5% then we have to focus on the variance.
The human-level error as a proxy (estimate) for Bayes optimal error. Bayes optimal error is always less (better), but
human-level in most cases is not far from it.
You can't do better than Bayes error unless you are overfitting.
Avoidable bias = Training error - Human (Bayes) error
ML Strategy 2
In the cat classification example, if you have 10% error on your dev set and you want to decrease the error.
You discovered that some of the mislabeled data are dog pictures that look like cats. Should you try to make your
cat classifier do better on dogs (this could take some weeks)?
Error analysis approach:
Get 100 mislabeled dev set examples at random.
Count up how many are dogs.
if 5 of 100 are dogs then training your classifier to do better on dogs will decrease your error up to 9.5% (called
ceiling), which can be too little.
if 50 of 100 are dogs then you could decrease your error up to 5%, which is reasonable and you should work on
that.
Based on the last example, error analysis helps you to analyze the error before taking an action that could take lot of
time with no need.
Sometimes, you can evaluate multiple error analysis ideas in parallel and choose the best idea. Create a spreadsheet to
do that and decide, e.g.:
1 ✓ ✓ Pitbull
2 ✓ ✓ ✓
4 ✓
....
In the last example you will decide to work on great cats or blurry images to improve your performance.
This quick counting procedure, which you can often do in, at most, small numbers of hours can really help you make
much better prioritization decisions, and understand how promising different approaches are to work on.
If you want to check for mislabeled data in dev/test set, you should also try error analysis with the mislabeled column.
Ex:
1 ✓
2 ✓ ✓
4 ✓
....
Then:
If overall dev set error: 10%
Then errors due to incorrect data: 0.6%
Then errors due to other causes: 9.4%
Then you should focus on the 9.4% error rather than the incorrect data.
Apply the same process to your dev and test sets to make sure they continue to come from the same distribution.
Consider examining examples your algorithm got right as well as ones it got wrong. (Not always done if you
reached a good accuracy)
Train and (dev/test) data may now come from a slightly different distributions.
It's very important to have dev and test sets to come from the same distribution. But it could be OK for a train set to
come from slightly other distribution.
1. Carry out manual error analysis to try to understand the difference between training and dev/test sets.
2. Make training data more similar, or collect more data similar to dev/test sets.
If your goal is to make the training data more similar to your dev set one of the techniques you can use Artificial data
synthesis that can help you make more training data.
Combine some of your training data with something that can convert it to the dev/test set distribution.
Examples:
a. Combine normal audio with car noise to get audio with car noise example.
b. Generate cars using 3D graphics in a car classification example.
Be cautious and bear in mind whether or not you might be accidentally simulating data only from a tiny subset of
the space of all possible examples because your NN might overfit these generated data (like particular car noise or
a particular design of 3D graphics cars).
Transfer learning
Apply the knowledge you took in a task A and apply it in another task B.
For example, you have trained a cat classifier with a lot of data, you can use the part of the trained NN it to solve x-ray
classification problem.
To do transfer learning, delete the last layer of NN and it's weights and:
i. Option 1: if you have a small data set - keep all the other weights as a fixed weights. Add a new last layer(-s) and
initialize the new layer weights and feed the new data to the NN and learn the new weights.
ii. Option 2: if you have enough data you can retrain all the weights.
Option 1 and 2 are called fine-tuning and training on task A called pretraining.
When transfer learning make sense:
Task A and B have the same input X (e.g. image, audio).
You have a lot of data for the task A you are transferring from and relatively less data for the task B your transferring
to.
Low level features from task A could be helpful for learning task B.
Multi-task learning
Whereas in transfer learning, you have a sequential process where you learn from task A and then transfer that to task B.
In multi-task learning, you start off simultaneously, trying to have one neural network do several things at the same
time. And then each of these tasks helps hopefully all of the other tasks.
Example:
You want to build an object recognition system that detects pedestrians, cars, stop signs, and traffic lights (image
has multiple labels).
Then Y shape will be (4,m) because we have 4 classes and each one is a binary one.
Then
Cost = (1/m) * sum(sum(L(y_hat(i)_j, y(i)_j))), i = 1..m, j = 1..4 , where
L = - y(i)_j * log(y_hat(i)_j) - (1 - y(i)_j) * log(1 - y_hat(i)_j)
In the last example you could have trained 4 neural networks separately but if some of the earlier features in neural
network can be shared between these different types of objects, then you find that training one neural network to do
four things results in better performance than training 4 completely separate neural networks to do the four tasks
separately.
Multi-task learning will also work if y isn't complete for some labels. For example:
Y = [1 ? 1 ...]
[0 0 1 ...]
[? 1 ? ...]
And in this case it will do good with the missing data, just the loss function will be different:
Loss = (1/m) * sum(sum(L(y_hat(i)_j, y(i)_j) for all j which y(i)_j != ?))
Audio ---> Features --> Phonemes --> Words --> Transcript # non-end-to-end system
Audio ---------------------------------------> Transcript # end-to-end deep learning system
End-to-end deep learning gives data more freedom, it might not use phonemes when training!
To build the end-to-end deep learning system that works well, we need a big dataset (more data then in non end-to-
end system). If we have a small dataset the ordinary implementation could work just fine.
Example 2:
Face recognition system:
English --> Text analysis --> ... --> French # non-end-to-end system
English ----------------------------> French # end-to-end deep learning system - best approach
Here end-to-end deep leaning system works better because we have enough data to build it.
Example 4:
Estimating child's age from the x-ray picture of a hand:
Image --> Bones --> Age # non-end-to-end system - best approach for now
Image ------------> Age # end-to-end system
In this example non-end-to-end system works better because we don't have enough data to train end-to-end
system.
Course summary
Here is the course summary as given on the course link:
This course will teach you how to build convolutional neural networks and apply it to image data. Thanks to deep
learning, computer vision is working far better than just two years ago, and this is enabling numerous exciting
applications ranging from safe autonomous driving, to accurate face recognition, to automatic reading of radiology
images.
You will:
Understand how to build a convolutional neural network, including recent variations such as residual networks.
Know how to apply convolutional networks to visual detection and recognition tasks.
Know to use neural style transfer to generate art.
Be able to apply these algorithms to a variety of image, video, and other 2D or 3D data.
Foundations of CNNs
Learn to implement the foundational layers of CNNs (pooling, convolutions) and to stack them properly in a deep
network to solve multi-class image classification problems.
Computer vision
Computer vision is one of the applications that are rapidly active thanks to deep learning.
Some of the applications of computer vision that are using deep learning includes:
Self driving cars.
Face recognition.
Deep learning is also enabling new types of art to be created.
Rapid changes to computer vision are making new applications that weren't possible a few years ago.
Computer vision deep leaning techniques are always evolving making a new architectures which can help us in other
areas other than computer vision.
For example, Andrew Ng took some ideas of computer vision and applied it in speech recognition.
Examples of a computer vision problems includes:
Image classification.
Object detection.
Detect object and localize them.
Neural style transfer
Changes the style of an image using another image.
One of the challenges of computer vision problem that images can be so large and we want a fast and accurate
algorithm to work with that.
For example, a 1000x1000 image will represent 3 million feature/input to the full connected neural network. If the
following hidden layer contains 1000, then we will want to learn weights of the shape [1000, 3 million] which is 3
billion parameter only in the first layer and thats so computationally expensive!
One of the solutions is to build this using convolution layers instead of the fully connected layers.
In an image we can detect vertical edges, horizontal edges, or full edge detector.
In the last example a 6x6 matrix convolved with 3x3 filter/kernel gives us a 4x4 matrix.
If you make the convolution operation in TensorFlow you will find the function tf.nn.conv2d . In keras you will find
Conv2d function.
The vertical edge detection filter will find a 3x3 place in an image where there are a bright region followed by a
dark region.
If we applied this filter to a white region followed by a dark region, it should find the edges in between the two
colors as a positive value. But if we applied the same filter to a dark region followed by a white region it will give us
negative values. To solve this we can use the abs function to make it positive.
1 1 1
0 0 0
-1 -1 -1
There are a lot of ways we can put number inside the horizontal or vertical edge detections. For example here are the
vertical Sobel filter (The idea is taking care of the middle row):
1 0 -1
2 0 -2
1 0 -1
Also something called Scharr filter (The idea is taking great care of the middle row):
3 0 -3
10 0 -10
3 0 -3
What we learned in the deep learning is that we don't need to hand craft these numbers, we can treat them as weights
and then learn them. It can learn horizontal, vertical, angled, or any edge type automatically rather than getting them by
hand.
Padding
In order to to use deep neural networks we really need to use paddings.
In the last section we saw that a 6x6 matrix convolved with 3x3 filter/kernel gives us a 4x4 matrix.
To give it a general rule, if a matrix nxn is convolved with fxf filter/kernel give us n-f+1,n-f+1 matrix.
We want to apply convolution operation multiple times, but if the image shrinks we will lose a lot of data on this
process. Also the edges pixels are used less than other pixels in an image.
So the problems with convolutions are:
Shrinks output.
throwing away a lot of information that are in the edges.
To solve these problems we can pad the input image before convolution by adding some rows and columns to it. We
will call the padding amount P the number of row/columns that we will insert in top, bottom, left and right of the
image.
The general rule now, if a matrix nxn is convolved with fxf filter/kernel and padding p give us n+2p-f+1,n+2p-f+1
matrix.
If n = 6, f = 3, and p = 1 Then the output image will have n+2p-f+1 = 6+2-3+1 = 6 . We maintain the size of the image.
Same convolutions is a convolution with a pad so that output size is the same as the input size. Its given by the
equation:
P = (f-1) / 2
In computer vision f is usually odd. Some of the reasons is that its have a center value.
Strided convolution
Strided convolution is another piece that are used in CNNs.
When we are making the convolution operation we used S to tell us the number of pixels we will jump when we are
convolving filter/kernel. The last examples we described S was 1.
if a matrix nxn is convolved with fxf filter/kernel and padding p and stride s it give us (n+2p-f)/s + 1,(n+2p-
f)/s + 1 matrix.
In math textbooks the conv operation is filpping the filter before using it. What we were doing is called cross-correlation
operation but the state of art of deep learning is using this as conv operation.
Same convolutions is a convolution with a padding so that output size is the same as the input size. Its given by the
equation:
p = (n*s - n + f - s) / 2
When s = 1 ==> P = (f-1) / 2
Hint: no matter the size of the input, the number of the parameters is same if filter size is same. That makes it less prone
to overfitting.
Hyperparameters
f[l] = filter size
p[l] = padding # Default is zero
s[l] = stride
nc[l] = number of filters
number of filters = 10
number of filters = 20
number of filters = 40
In the last example you seen that the image are getting smaller after each layer and thats the trend now.
Types of layer in a convolutional network:
Convolution. #Conv
Pooling #Pool
Fully connected #FC
Pooling layers
Other than the conv layers, CNNs often uses pooling layers to reduce the size of the inputs, speed up computation, and
to make some of the features it detects more robust.
Max pooling example:
This example has f = 2 , s = 2 , and p = 0 hyperparameters
The max pooling is saying, if the feature is detected anywhere in this filter then keep a high number. But the main
reason why people are using pooling because its works well in practice and reduce computations.
Max pooling has no parameters to learn.
Example of Max pooling on 3D input:
Input: 4x4x10
Max pooling size = 2 and stride = 2
Output: 2x2x10
Average pooling is taking the averages of the values instead of taking the max values.
Max pooling is used more often than average pooling in practice.
If stride of pooling equals the size, it will then apply the effect of shrinking.
Hyperparameters summary
f : filter size.
s : stride.
Padding are rarely uses here.
Max or average pooling.
number of filters = 6
number of filters = 16
Why convolutions?
Two main advantages of Convs are:
Parameter sharing.
A feature detector (such as a vertical edge detector) that's useful in one part of the image is probably useful in
another part of the image.
sparsity of connections.
In each layer, each output value depends only on a small number of inputs which makes it translation
invariance.
Putting it all together:
Classic networks
In this section we will talk about classic networks which are LeNet-5, AlexNet, and VGG.
LeNet-5
The goal for this model was to identify handwritten digits in a 32x32x1 gray image. Here are the drawing of it:
This model was published in 1998. The last layer wasn't using softmax back then.
It has 60k parameters.
The dimensions of the image decreases as the number of channels increases.
Conv ==> Pool ==> Conv ==> Pool ==> FC ==> FC ==> softmax this type of arrangement is quite common.
The activation function used in the paper was Sigmoid and Tanh. Modern implementation uses RELU in most of the
cases.
[LeCun et al., 1998. Gradient-based learning applied to document recognition]
AlexNet
Named after Alex Krizhevsky who was the first author of this paper. The other authors includes Geoffrey Hinton.
The goal for the model was the ImageNet challenge which classifies images into 1000 classes. Here are the drawing
of the model:
Summary:
Conv => Max-pool => Conv => Max-pool => Conv => Conv => Conv => Max-pool ==> Flatten ==> FC ==> FC
==> Softmax
The original paper contains Multiple GPUs and Local Response normalization (RN).
Multiple GPUs were used because the GPUs were not so fast back then.
Researchers proved that Local Response normalization doesn't help much so for now don't bother yourself for
understanding or implementing it.
This paper convinced the computer vision researchers that deep learning is so important.
[Krizhevsky et al., 2012. ImageNet classification with deep convolutional neural networks]
VGG-16
This network is large even by modern standards. It has around 138 million parameters.
Most of the parameters are in the fully connected layers.
It has a total memory of 96MB per image for only forward propagation!
Most memory are in the earlier layers.
Number of filters increases from 64 to 128 to 256 to 512. 512 was made twice.
Pooling was the only one who is responsible for shrinking the dimensions.
There are another version called VGG-19 which is a bigger version. But most people uses the VGG-16 instead of the
VGG-19 because it does the same.
VGG paper is attractive it tries to make some rules regarding using CNNs.
[Simonyan & Zisserman 2015. Very deep convolutional networks for large-scale image recognition]
These networks can go deeper without hurting the performance. In the normal NN - Plain networks - the theory tell
us that if we go deeper we will get a better solution to our problem, but because of the vanishing and exploding
gradients problems the performance of the network suffers as it goes deeper. Thanks to Residual Network we can
go deeper as we want now.
On the left is the normal NN and on the right are the ResNet. As you can see the performance of ResNet increases
as the network goes deeper.
In some cases going deeper won't effect the performance and that depends on the problem on your hand.
Some people are trying to train 1000 layer now which isn't used in practice.
[He et al., 2015. Deep residual networks for image recognition]
X --> Big NN --> a[l] --> Layer1 --> Layer2 --> a[l+2]
Then:
Then if we are using L2 regularization for example, W[l+2] will be zero. Lets say that b[l+2] will be zero too.
This show that identity function is easy for a residual block to learn. And that why it can train deeper NNs.
Also that the two layers we added doesn't hurt the performance of big NN we made.
Hint: dimensions of z[l+2] and a[l] have to be the same in resNets. In case they have different dimensions what we
put a matrix parameters (Which can be learned or fixed)
a[l+2] = g( z[l+2] + ws * a[l] ) # The added Ws should make the dimensions equal
Using a skip-connection helps the gradient to backpropagate and thus helps you to train deeper networks
Identity block:
Hint the conv is followed by a batch norm BN before RELU . Dimensions here are same.
This skip is over 2 layers. The skip connection can jump n connections where n>2
This drawing represents Keras layers.
The convolutional block:
It has been used in a lot of modern CNN implementations like ResNet and Inception models.
We want to shrink the number of channels. We also call this feature transformation.
In the second discussed example above we have shrinked the input from 32 to 5 channels.
We will later see that by shrinking it we can save a lot of computations.
If we have specified the number of 1 x 1 Conv filters to be the same as the input number of channels then the
output will contain the same number of channels. Then the 1 x 1 Conv will act like a non linearity and will learn non
linearity operator.
Replace fully connected layers with 1 x 1 convolutions as Yann LeCun believes they are the same.
In Convolutional Nets, there is no such thing as "fully-connected layers". There are only convolution layers with
1x1 convolution kernels and a full connection table. Yann LeCun
Transfer Learning
If you are using a specific NN architecture that has been trained before, you can use this pretrained parameters/weights
instead of random initialization to solve your problem.
It can help you boost the performance of the NN.
The pretrained models might have trained on a large datasets like ImageNet, Ms COCO, or pascal and took a lot of time
to learn those parameters/weights with optimized hyperparameters. This can save you a lot of time.
Lets see an example:
Lets say you have a cat classification problem which contains 3 classes Tigger, Misty and neither.
You don't have much a lot of data to train a NN on these images.
Andrew recommends to go online and download a good NN with its weights, remove the softmax activation layer
and put your own one and make the network learn only the new layer while other layer weights are fixed/frozen.
Frameworks have options to make the parameters frozen in some layers using trainable = 0 or freeze = 0
One of the tricks that can speed up your training, is to run the pretrained NN without final softmax layer and get an
intermediate representation of your images and save them to disk. And then use these representation to a shallow
NN network. This can save you the time needed to run an image through all the layers.
Its like converting your images into vectors.
Another example:
What if in the last example you have a lot of pictures for your cats.
One thing you can do is to freeze few layers from the beginning of the pretrained network and learn the other
weights in the network.
Some other idea is to throw away the layers that aren't frozen and put your own layers there.
Another example:
If you have enough data, you can fine tune all the layers in your pretrained network but don't random initialize the
parameters, leave the learned parameters as it is and learn from there.
Data Augmentation
If data is increased, your deep NN will perform better. Data augmentation is one of the techniques that deep learning
uses to increase the performance of deep NN.
The majority of computer vision applications needs more data right now.
Some data augmentation methods that are used for computer vision tasks includes:
Mirroring.
Random cropping.
The issue with this technique is that you might take a wrong crop.
The solution is to make your crops big enough.
Rotation.
Shearing.
Local warping.
Color shifting.
For example, we add to R, G, and B some distortions that will make the image identified as the same for the
human but is different for the computer.
In practice the added value are pulled from some probability distribution and these shifts are some small.
Makes your algorithm more robust in changing colors in images.
There are an algorithm which is called PCA color augmentation that decides the shifts needed automatically.
Implementing distortions during training:
You can use a different CPU thread to make you a distorted mini batches while you are training your NN.
Data Augmentation has also some hyperparameters. A good place to start is to find an open source data augmentation
implementation and then use it or fine tune these hyperparameters.
Object detection
Learn how to apply your knowledge of CNNs to one of the toughest but hottest field of computer vision: Object
detection.
Object Localization
Object detection is one of the areas in which deep learning is doing great in the past two years.
Image Classification:
Classify an image to a specific class. The whole image represents one class. We don't want to know exactly
where are the object. Usually only one object is presented.
Semantic Segmentation:
We want to Label each pixel in the image with a category label. Semantic Segmentation Don't differentiate
instances, only care about pixels. It detects no objects just pixels.
If there are two objects of the same class is intersected, we won't be able to separate them.
Instance Segmentation
This is like the full problem. Rather than we want to predict the bounding box, we want to know which pixel
label but also distinguish them.
To make image classification we use a Conv Net with a Softmax attached to the end of it.
To make classification with localization we use a Conv Net with a softmax attached to the end of it and a four numbers
bx , by , bh , and bw to tell you the location of the class in the image. The dataset should contain this four numbers
with the class too.
Y = [
Pc # Probability of an object is presented
bx # Bounding box
by # Bounding box
bh # Bounding box
bw # Bounding box
c1 # The classes
c2
...
]
Y = [
1 # Object is present
0
0
100
100
0
1
0
]
Y = [
0 # Object isn't presented
? # ? means we dont care with other values
?
?
?
?
?
?
]
The loss function for the Y we have created (Example of the square error):
L(y',y) = {
(y1'-y1)^2 + (y2'-y2)^2 + ... if y1 = 1
(y1'-y1)^2 if y1 = 0
}
In practice we use logistic regression for pc , log likely hood loss for classes, and squared error for the bounding
box.
Landmark Detection
In some of the computer vision problems you will need to output some points. That is called landmark detection.
For example, if you are working in a face recognition problem you might want some points on the face like corners of
the eyes, corners of the mouth, and corners of the nose and so on. This can help in a lot of application like detecting the
pose of the face.
Y shape for the face recognition problem that needs to output 64 landmarks:
Y = [
THereIsAface # Probability of face is presented 0 or 1
l1x,
l1y,
....,
l64x,
l64y
]
Another application is when you need to get the skeleton of the person using different landmarks/points in the person
which helps in some applications.
Hint, in your labeled data, if l1x,l1y is the left corner of left eye, all other l1x,l1y of the other examples has to be the
same.
Object Detection
We will use a Conv net to solve the object detection problem using a technique called the sliding windows detection
algorithm.
For example lets say we are working on Car object detection.
The first thing, we will train a Conv net on cropped car images and non car images.
After we finish training of this Conv net we will then use it with the sliding windows technique.
Sliding windows detection algorithm:
i. Decide a rectangle size.
ii. Split your image into rectangles of the size you picked. Each region should be covered. You can use some strides.
iii. For each rectangle feed the image into the Conv net and decide if its a car or not.
iv. Pick larger/smaller rectangles and repeat the process from 2 to 3.
v. Store the rectangles that contains the cars.
vi. If two or more rectangles intersects choose the rectangle with the best accuracy.
Disadvantage of sliding window is the computation time.
In the era of machine learning before deep learning, people used a hand crafted linear classifiers that classifies the
object and then use the sliding window technique. The linear classier make it a cheap computation. But in the deep
learning era that is so computational expensive due to the complexity of the deep learning model.
To solve this problem, we can implement the sliding windows with a Convolutional approach.
One other idea is to compress your deep learning model.
Say now we have a 16 x 16 x 3 image that we need to apply the sliding windows in. By the normal implementation
that have been mentioned in the section before this, we would run this Conv net four times each rectangle size will
be 16 x 16.
The convolution implementation will be as follows:
Simply we have feed the image into the same Conv net we have trained.
The left cell of the result "The blue one" will represent the the first sliding window of the normal implementation.
The other cells will represent the others.
Its more efficient because it now shares the computations of the four times needed.
Another example would be:
This example has a total of 16 sliding windows that shares the computation together.
[Sermanet et al., 2014, OverFeat: Integrated recognition, localization and detection using convolutional networks]
The weakness of the algorithm is that the position of the rectangle wont be so accurate. Maybe none of the rectangles is
exactly on the object you want to recognize.
In red, the rectangle we want and in blue is the required car rectangle.
YOLO stands for you only look once and was developed back in 2015.
Yolo Algorithm:
We have a problem if we have found more than one object in one grid box.
One of the best advantages that makes the YOLO algorithm popular is that it has a great speed and a Conv net
implementation.
How is YOLO different from other Object detectors? YOLO uses a single CNN network for both classification and
localizing the object using bounding boxes.
In the next sections we will see some ideas that can make the YOLO algorithm better.
Non-max Suppression
One of the problems we have addressed in YOLO is that it can detect an object multiple times.
Non-max Suppression is a way to make sure that YOLO detects the object just once.
For example:
Each car has two or more detections with different probabilities. This came from some of the grids that thinks that
this is the center point of the object.
Non-max suppression algorithm:
i. Lets assume that we are targeting one class as an output class.
ii. Y shape should be [Pc, bx, by, bh, hw] Where Pc is the probability if that object occurs.
iii. Discard all boxes with Pc < 0.6
iv. While there are any remaining boxes:
a. Pick the box with the largest Pc Output that as a prediction.
b. Discard any remaining box with IoU > 0.5 with that box output in the previous step i.e any box with high
overlap(greater than overlap threshold of 0.5).
If there are multiple classes/object types c you want to detect, you should run the Non-max suppression c times,
once for every output class.
Anchor Boxes
In YOLO, a grid only detects one object. What if a grid cell wants to detect multiple object?
So Previously, each object in training image is assigned to grid cell that contains that object's midpoint.
With two anchor boxes, Each object in training image is assigned to grid cell that contains object's midpoint and anchor
box for the grid cell with highest IoU. You have to check where your object should be based on its rectangle closest to
which anchor box.
Example of data:
YOLO Algorithm
YOLO is a state-of-the-art object detection model that is fast and accurate
Lets sum up and introduce the whole YOLO algorithm given an example.
Suppose we need to do object detection for our autonomous driver system.It needs to identify three classes:
We decided to choose two anchor boxes, a taller one and a wide one.
Like we said in practice they use five or more anchor boxes hand made or generated using k-means.
Our labeled Y shape will be [Ny, HeightOfGrid, WidthOfGrid, 16] , where Ny is number of instances and each row (of
size 16) is as follows:
[Pc, bx, by, bh, bw, c1, c2, c3, Pc, bx, by, bh, bw, c1, c2, c3]
Your dataset could be an image with a multiple labels and a rectangle for each label, we should go to your dataset and
make the shape and values of Y like we agreed.
An example:
We first initialize all of them to zeros and ?, then for each label and rectangle choose its closest grid point then the
shape to fill it and then the best anchor point based on the IOU. so that the shape of Y for one image should be
[HeightOfGrid, WidthOfGrid,16]
Train the labeled images on a Conv net. you should receive an output of [HeightOfGrid, WidthOfGrid,16] for our case.
To make predictions, run the Conv net on an image and run Non-max suppression algorithm for each class you have in
our case there are 3 classes.
Summary:
________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
========================================================================================
input_1 (InputLayer) (None, 608, 608, 3) 0
________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 608, 608, 32) 864 input_1[0][0]
________________________________________________________________________________________
batch_normalization_1 (BatchNorm (None, 608, 608, 32) 128 conv2d_1[0][0]
________________________________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 608, 608, 32) 0 batch_normalization_1[0][0]
________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 304, 304, 32) 0 leaky_re_lu_1[0][0]
________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 304, 304, 64) 18432 max_pooling2d_1[0][0]
________________________________________________________________________________________
batch_normalization_2 (BatchNorm (None, 304, 304, 64) 256 conv2d_2[0][0]
________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 304, 304, 64) 0 batch_normalization_2[0][0]
_______________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 152, 152, 64) 0 leaky_re_lu_2[0][0]
________________________________________________________________________________________
conv2d_3 (Conv2D) (None, 152, 152, 128) 73728 max_pooling2d_2[0][0]
________________________________________________________________________________________
batch_normalization_3 (BatchNorm (None, 152, 152, 128) 512 conv2d_3[0][0]
________________________________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 152, 152, 128) 0 batch_normalization_3[0][0]
________________________________________________________________________________________
conv2d_4 (Conv2D) (None, 152, 152, 64) 8192 leaky_re_lu_3[0][0]
________________________________________________________________________________________
batch_normalization_4 (BatchNorm (None, 152, 152, 64) 256 conv2d_4[0][0]
________________________________________________________________________________________
leaky_re_lu_4 (LeakyReLU) (None, 152, 152, 64) 0 batch_normalization_4[0][0]
________________________________________________________________________________________
conv2d_5 (Conv2D) (None, 152, 152, 128) 73728 leaky_re_lu_4[0][0]
________________________________________________________________________________________
batch_normalization_5 (BatchNorm (None, 152, 152, 128) 512 conv2d_5[0][0]
________________________________________________________________________________________
leaky_re_lu_5 (LeakyReLU) (None, 152, 152, 128) 0 batch_normalization_5[0][0]
________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 76, 76, 128) 0 leaky_re_lu_5[0][0]
________________________________________________________________________________________
conv2d_6 (Conv2D) (None, 76, 76, 256) 294912 max_pooling2d_3[0][0]
_______________________________________________________________________________________
batch_normalization_6 (BatchNorm (None, 76, 76, 256) 1024 conv2d_6[0][0]
________________________________________________________________________________________
leaky_re_lu_6 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_6[0][0]
_______________________________________________________________________________________
conv2d_7 (Conv2D) (None, 76, 76, 128) 32768 leaky_re_lu_6[0][0]
________________________________________________________________________________________
batch_normalization_7 (BatchNorm (None, 76, 76, 128) 512 conv2d_7[0][0]
_______________________________________________________________________________________
leaky_re_lu_7 (LeakyReLU) (None, 76, 76, 128) 0 batch_normalization_7[0][0]
________________________________________________________________________________________
conv2d_8 (Conv2D) (None, 76, 76, 256) 294912 leaky_re_lu_7[0][0]
________________________________________________________________________________________
batch_normalization_8 (BatchNorm (None, 76, 76, 256) 1024 conv2d_8[0][0]
________________________________________________________________________________________
leaky_re_lu_8 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_8[0][0]
________________________________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 38, 38, 256) 0 leaky_re_lu_8[0][0]
________________________________________________________________________________________
conv2d_9 (Conv2D) (None, 38, 38, 512) 1179648 max_pooling2d_4[0][0]
________________________________________________________________________________________
batch_normalization_9 (BatchNorm (None, 38, 38, 512) 2048 conv2d_9[0][0]
________________________________________________________________________________________
leaky_re_lu_9 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_9[0][0]
________________________________________________________________________________________
conv2d_10 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_9[0][0]
________________________________________________________________________________________
batch_normalization_10 (BatchNor (None, 38, 38, 256) 1024 conv2d_10[0][0]
________________________________________________________________________________________
leaky_re_lu_10 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_10[0][0]
________________________________________________________________________________________
conv2d_11 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_10[0][0]
________________________________________________________________________________________
batch_normalization_11 (BatchNor (None, 38, 38, 512) 2048 conv2d_11[0][0]
________________________________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_11[0][0]
_______________________________________________________________________________________
conv2d_12 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_11[0][0]
________________________________________________________________________________________
batch_normalization_12 (BatchNor (None, 38, 38, 256) 1024 conv2d_12[0][0]
________________________________________________________________________________________
leaky_re_lu_12 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_12[0][0]
________________________________________________________________________________________
conv2d_13 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_12[0][0]
________________________________________________________________________________________
batch_normalization_13 (BatchNor (None, 38, 38, 512) 2048 conv2d_13[0][0]
________________________________________________________________________________________
leaky_re_lu_13 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_13[0][0]
________________________________________________________________________________________
max_pooling2d_5 (MaxPooling2D) (None, 19, 19, 512) 0 leaky_re_lu_13[0][0]
_______________________________________________________________________________________
conv2d_14 (Conv2D) (None, 19, 19, 1024) 4718592 max_pooling2d_5[0][0]
________________________________________________________________________________________
batch_normalization_14 (BatchNor (None, 19, 19, 1024) 4096 conv2d_14[0][0]
________________________________________________________________________________________
leaky_re_lu_14 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_14[0][0]
________________________________________________________________________________________
conv2d_15 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_14[0][0]
________________________________________________________________________________________
batch_normalization_15 (BatchNor (None, 19, 19, 512) 2048 conv2d_15[0][0]
________________________________________________________________________________________
leaky_re_lu_15 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_15[0][0]
________________________________________________________________________________________
conv2d_16 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_15[0][0]
________________________________________________________________________________________
batch_normalization_16 (BatchNor (None, 19, 19, 1024) 4096 conv2d_16[0][0]
________________________________________________________________________________________
leaky_re_lu_16 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_16[0][0]
________________________________________________________________________________________
conv2d_17 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_16[0][0]
________________________________________________________________________________________
batch_normalization_17 (BatchNor (None, 19, 19, 512) 2048 conv2d_17[0][0]
________________________________________________________________________________________
leaky_re_lu_17 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_17[0][0]
_______________________________________________________________________________________
conv2d_18 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_17[0][0]
________________________________________________________________________________________
batch_normalization_18 (BatchNor (None, 19, 19, 1024) 4096 conv2d_18[0][0]
________________________________________________________________________________________
leaky_re_lu_18 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_18[0][0]
________________________________________________________________________________________
conv2d_19 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_18[0][0]
________________________________________________________________________________________
batch_normalization_19 (BatchNor (None, 19, 19, 1024) 4096 conv2d_19[0][0]
________________________________________________________________________________________
conv2d_21 (Conv2D) (None, 38, 38, 64) 32768 leaky_re_lu_13[0][0]
________________________________________________________________________________________
leaky_re_lu_19 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_19[0][0]
________________________________________________________________________________________
batch_normalization_21 (BatchNor (None, 38, 38, 64) 256 conv2d_21[0][0]
________________________________________________________________________________________
conv2d_20 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_19[0][0]
________________________________________________________________________________________
leaky_re_lu_21 (LeakyReLU) (None, 38, 38, 64) 0 batch_normalization_21[0][0]
________________________________________________________________________________________
batch_normalization_20 (BatchNor (None, 19, 19, 1024) 4096 conv2d_20[0][0]
________________________________________________________________________________________
space_to_depth_x2 (Lambda) (None, 19, 19, 256) 0 leaky_re_lu_21[0][0]
________________________________________________________________________________________
leaky_re_lu_20 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_20[0][0]
________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 19, 19, 1280) 0 space_to_depth_x2[0][0]
leaky_re_lu_20[0][0]
________________________________________________________________________________________
conv2d_22 (Conv2D) (None, 19, 19, 1024) 11796480 concatenate_1[0][0]
________________________________________________________________________________________
batch_normalization_22 (BatchNor (None, 19, 19, 1024) 4096 conv2d_22[0][0]
________________________________________________________________________________________
leaky_re_lu_22 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_22[0][0]
________________________________________________________________________________________
conv2d_23 (Conv2D) (None, 19, 19, 425) 435625 leaky_re_lu_22[0][0]
===============================================================================================
Total params: 50,983,561
Trainable params: 50,962,889
Non-trainable params: 20,672
_______________________________________________________________________________________________
https://github.com/allanzelener/YAD2K
https://github.com/thtrieu/darkflow
https://pjreddie.com/darknet/yolo/
Our model has several advantages over classifier-based systems. It looks at the whole image at test time so its
predictions are informed by global context in the image. It also makes predictions with a single network
evaluation unlike systems like R-CNN which require thousands for a single image. This makes it extremely fast,
more than 1000x faster than R-CNN and 100x faster than Fast R-CNN. See our paper for more details on the
full system.
But one of the downsides of YOLO that it process a lot of areas where no objects are present.
R-CNN tries to pick a few windows and run a Conv net (your confident classifier) on top of them.
The algorithm R-CNN uses to pick windows is called a segmentation algorithm. Outputs something like this:
If for example the segmentation algorithm produces 2000 blob then we should run our classifier/CNN on top of these
blobs.
There has been a lot of work regarding R-CNN tries to make it faster:
R-CNN:
Propose regions. Classify proposed regions one at a time. Output label + bounding box.
Downside is that its slow.
[Girshik et. al, 2013. Rich feature hierarchies for accurate object detection and semantic segmentation]
Fast R-CNN:
Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions.
[Girshik, 2015. Fast R-CNN]
Faster R-CNN:
Use convolutional network to propose regions.
[Ren et. al, 2016. Faster R-CNN: Towards real-time object detection with region proposal networks]
Mask R-CNN:
https://arxiv.org/abs/1703.06870
Most of the implementation of faster R-CNN are still slower than YOLO.
Andrew Ng thinks that the idea behind YOLO is better than R-CNN because you are able to do all the things in just one
time instead of two times.
Other algorithms that uses one shot to get the output includes SSD and MultiBox.
[Wei Liu, et. al 2015 SSD: Single Shot MultiBox Detector]
[Jifeng Dai, et. al 2016 R-FCN: Object Detection via Region-based Fully Convolutional Networks ]
Face Recognition
Face recognition system identifies a person's face. It can work on both images or videos.
Liveness detection within a video face recognition system prevents the network from identifying a face in an image. It
can be learned by supervised deep learning using a dataset for live human and in-live human and sequence learning.
Face verification vs. face recognition:
Verification:
Input: image, name/ID. (1 : 1)
Output: whether the input image is that of the claimed person.
"is this the claimed person?"
Recognition:
Has a database of K persons
Get an input image
Output ID if the image is any of the K persons (or not recognized)
"who is this person?"
We can use a face verification system to make a face recognition system. The accuracy of the verification system has to
be high (around 99.9% or more) to be use accurately within a recognition system because the recognition system
accuracy will be less than the verification system given K persons.
One of the face recognition challenges is to solve one shot learning problem.
One Shot Learning: A recognition system is able to recognize a person, learning from one image.
Historically deep learning doesn't work well with a small number of data.
Instead to make this work, we will learn a similarity function:
d( img1, img2 ) = degree of difference between images.
We want d result to be low in case of the same faces.
We use tau T as a threshold for d:
If d( img1, img2 ) <= T Then the faces are the same.
Similarity function helps us solving the one shot learning. Also its robust to new inputs.
Siamese Network
We will implement the similarity function using a type of NNs called Siamease Network in which we can pass multiple
inputs to the two or more networks with the same architecture and parameters.
Siamese network architecture are as the following:
We make 2 identical conv nets which encodes an input image into a vector. In the above image the vector shape is
(128, )
The loss function will be d(x1, x2) = || f(x1) - f(x2) ||^2
If X1 , X2 are the same person, we want d to be low. If they are different persons, we want d to be high.
[Taigman et. al., 2014. DeepFace closing the gap to human level performance]
Triplet Loss
Triplet Loss is one of the loss functions we can use to solve the similarity distance in a Siamese network.
Our learning objective in the triplet loss function is to get the distance between an Anchor image and a positive or a
negative image.
Positive means same person, while negative means different person.
The triplet name came from that we are comparing an anchor A with a positive P and a negative N image.
Formally we want:
Positive distance to be less than negative distance
||f(A) - f(P)||^2 <= ||f(A) - f(N)||^2
Then
||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 <= 0
You need multiple images of the same person in your dataset. Then get some triplets out of your dataset. Dataset
should be big enough.
Choosing the triplets A, P, N:
During training if A, P, N are chosen randomly (Subjet to A and P are the same and A and N aren't the same) then
one of the problems this constrain is easily satisfied
d(A, P) + alpha <= d (A, N)
Triplet loss is one way to learn the parameters of a conv net for face recognition there's another way to learn these
parameters as a straight binary classification problem.
Learning the similarity function another way:
In order to implement this you need to look at the features extracted by the Conv net at the shallower and deeper
layers.
It uses a previously trained convolutional network like VGG, and builds on top of that. The idea of using a network
trained on a different task and applying it to a new task is called transfer learning.
Pick a unit in layer l. Find the nine image patches that maximize the unit's activation.
Notice that a hidden unit in layer one will see relatively small portion of NN, so if you plotted it it will match a
small image in the shallower layers while it will get larger image in deeper layers.
Repeat for other units and layers.
It turns out that layer 1 are learning the low level representations like colors and edges.
You will find out that each layer are learning more complex representations.
The first layer was created using the weights of the first layer. Other images are generated using the receptive field in
the image that triggered the neuron to be max.
[Zeiler and Fergus., 2013, Visualizing and understanding convolutional networks]
A good explanation on how to get receptive field given a layer:
From A guide to receptive field arithmetic for Convolutional Neural Networks
Cost Function
We will define a cost function for the generated image that measures how good it is.
Give a content image C, a style image S, and a generated image G:
J(G) = alpha * J(C,G) + beta * J(S,G)
J(C, G) measures how similar is the generated image to the Content image.
J(S, G) measures how similar is the generated image to the Style image.
alpha and beta are relative weighting to the similarity and these are hyperparameters.
Find the generated image G:
i. Initiate G randomly
For example G: 100 X 100 X 3
ii. Use gradient descent to minimize J(G)
G = G - dG We compute the gradient image and use gradient decent to minimize the cost function.
In the previous section we showed that we need a cost function for the content image and the style image to measure
how similar is them to each other.
Say you use hidden layer l to compute content cost.
If we choose l to be small (like layer 1), we will force the network to get similar output to the original content
image.
In practice l is not too shallow and not too deep but in the middle.
Use pre-trained ConvNet. (E.g., VGG network)
Let a(c)[l] and a(G)[l] be the activation of layer l on the images.
If a(c)[l] and a(G)[l] are similar then they will have the same content
J(C, G) at a layer l = 1/2 || a(c)[l] - a(G)[l] ||^2
As it appears its the sum of the multiplication of each member in the matrix.
To compute gram matrix efficiently:
Reshape activation from H X W X C to HW X C
Name the reshaped activation F.
G[l] = F * F.T
Steps to be made if you want to create a tensorflow model for neural style transfer:
i. Create an Interactive Session.
ii. Load the content image.
iii. Load the style image
iv. Randomly initialize the image to be generated
v. Load the VGG16 model
vi. Build the TensorFlow graph:
Run the content image through the VGG16 model and compute the content cost
Run the style image through the VGG16 model and compute the style cost
Compute the total cost
Define the optimizer and the learning rate
vii. Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every
step.
1D and 3D Generalizations
So far we have used the Conv nets for images which are 2D.
Conv nets can work with 1D and 3D data as well.
An example of 1D convolution:
Input shape (14, 1)
Applying 16 filters with F = 5 , S = 1
Output shape will be 10 X 16
Applying 32 filters with F = 5, S = 1
Output shape will be 6 X 32
The general equation (N - F)/S + 1 can be applied here but here it gives a vector rather than a 2D matrix.
1D data comes from a lot of resources such as waves, sounds, heartbeat signals.
In most of the applications that uses 1D data we use Recurrent Neural Network RNN.
3D data also are available in some applications like CT scan:
Example of 3D convolution:
Input shape (14, 14,14, 1)
Applying 16 filters with F = 5 , S = 1
Output shape (10, 10, 10, 16)
Applying 32 filters with F = 5, S = 1
Output shape will be (6, 6, 6, 32)
Extras
Keras
Keras is a high-level neural networks API (programming framework), written in Python and capable of running on top of
several lower-level frameworks including TensorFlow, Theano, and CNTK.
Keras was developed to enable deep learning engineers to build and experiment with different models very quickly.
Just as TensorFlow is a higher-level framework than Python, Keras is an even higher-level framework and provides
additional abstractions.
Keras will work fine for many common models.
Layers in Keras:
Dense (Fully connected layers).
A linear function followed by a non linear function.
Convolutional layer.
Pooling layer.
Normalisation layer.
A batch normalization layer.
Flatten layer
Flatten a matrix into vector.
Activation layer
Different activations include: relu, tanh, sigmoid, and softmax.
To train and test a model in Keras there are four steps:
i. Create the model.
ii. Compile the model by calling model.compile(optimizer = "...", loss = "...", metrics = ["accuracy"])
iii. Train the model on train data by calling model.fit(x = ..., y = ..., epochs = ..., batch_size = ...)
You can add a validation set while training too.
iv. Test the model on test data by calling model.evaluate(x = ..., y = ...)
Summarize of step in Keras: Create->Compile->Fit/Train->Evaluate/Test
Model.summary() gives a lot of useful informations regarding your model including each layers inputs, outputs, and
number of parameters at each layer.
To choose the Keras backend you should go to $HOME/.keras/keras.json and change the file to the desired backend
like Theano or Tensorflow or whatever backend you want.
After you create the model you can run it in a tensorflow session without compiling, training, and testing capabilities.
You can save your model with model_save and load your model using model_load This will save your whole trained
model to disk with the trained weights.
Course summary
Here are the course summary as its given on the course link:
This course will teach you how to build models for natural language, audio, and other sequence data. Thanks to deep
learning, sequence algorithms are working far better than just two years ago, and this is enabling numerous exciting
applications in speech recognition, music synthesis, chatbots, machine translation, natural language understanding, and
many others.
You will:
Understand how to build and train Recurrent Neural Networks (RNNs), and commonly-used variants such as GRUs
and LSTMs.
Be able to apply sequence models to natural language problems, including text synthesis.
Be able to apply sequence models to audio applications, including speech recognition and music synthesis.
This is the fifth and final course of the Deep Learning Specialization.
Notation
In this section we will discuss the notations that we will use through the course.
Motivating example:
We will index the first element of x by x<1>, the second x<2> and so on.
x<1> = Harry
x<2> = Potter
Similarly, we will index the first element of y by y<1>, the second y<2> and so on.
y<1> = 1
y<2> = 1
Tx is the size of the input sequence and Ty is the size of the output sequence.
x(i)<t> is the element t of the sequence of input vector i. Similarly y(i)<t> means the t-th element in the output sequence
of the i training example.
Tx(i) the input sequence length for training example i. It can be different across the examples. Similarly for Ty(i) will be the
length of the output sequence in the i-th training example.
Representing words:
We will now work in this course with NLP which stands for natural language processing. One of the challenges of
NLP is how can we represent a word?
i. We need a vocabulary list that contains all the words in our target sets.
Example:
[a ... And ... Harry ... Potter ... Zulu]
Each word will have a unique index that it can be represented with.
The sorting here is in alphabetical order.
Vocabulary sizes in modern applications are from 30,000 to 50,000. 100,000 is not uncommon. Some of the
bigger companies use even a million.
To build vocabulary list, you can read all the texts you have and get m words with the most occurrence, or
search online for m most occurrent words.
ii. Create a one-hot encoding sequence for each word in your dataset given the vocabulary you have created.
While converting, what if we meet a word thats not in your dictionary?
We can add a token in the vocabulary with name <UNK> which stands for unknown text and use its index for
your one-hot vector.
Full example:
The goal is given this representation for x to learn a mapping using a sequence model to then target output y as a
supervised learning problem.
In this problem Tx = Ty. In other problems where they aren't equal, the RNN architecture may be different.
a<0> is usually initialized with zeros, but some others may initialize it randomly in some cases.
There are three weight matrices here: Wax, Waa, and Wya with shapes:
Wax: (NoOfHiddenNeurons, nx)
Waa: (NoOfHiddenNeurons, NoOfHiddenNeurons)
Wya: (ny, NoOfHiddenNeurons)
The weight matrix Waa is the memory the RNN is trying to maintain from the previous layers.
A lot of papers and books write the same architecture this way:
It's harder to interpreter. It's easier to roll this drawings to the unrolled version.
In the discussed RNN architecture, the current output ŷ<t> depends on the previous inputs and activations.
Let's have this example 'He Said, "Teddy Roosevelt was a great president"'. In this example Teddy is a person name but
we know that from the word president that came after Teddy not from He and said that were before it.
So limitation of the discussed architecture is that it can not learn from elements later in the sequence. To address this
problem we will later discuss Bidirectional RNN (BRNN).
Now let's discuss the forward propagation equations on the discussed architecture:
The activation function of a is usually tanh or ReLU and for y depends on your task choosing some activation
functions like sigmoid and softmax. In name entity recognition task we will use sigmoid because we only have two
classes.
In order to help us develop complex RNN architectures, the last equations needs to be simplified a bit.
Simplified RNN notation:
Where wa, ba, wy, and by are shared across each element in a sequence.
We will use the cross-entropy loss function:
Where the first equation is the loss for one example and the loss for the whole sequence is given by the summation
over all the calculated single example losses.
Graph with losses:
The backpropagation here is called backpropagation through time because we pass activation a from one sequence
element to another like backwards in time.
Note that starting the second layer we are feeding the generated output back to the network.
There are another interesting architecture in Many To Many. Applications like machine translation inputs and outputs
sequences have different lengths in most of the cases. So an alternative Many To Many architecture that fits the
translation would be as follows:
There are an encoder and a decoder parts in this architecture. The encoder encodes the input sequence into one
matrix and feed it to the decoder to generate the outputs. Encoder and decoder have different weight matrices.
Summary of RNN types:
There is another architecture which is the attention architecture which we will talk about in chapter 3.
ii. We first pass a<0> = zeros vector, and x<1> = zeros vector.
iii. Then we choose a prediction randomly from distribution obtained by ŷ<1>. For example it could be "The".
In numpy this can be implemented using: numpy.random.choice(...)
This is the line where you get a random beginning of the sentence each time you sample run a novel sequence.
iv. We pass the last predicted word with the calculated a<1>
v. We keep doing 3 & 4 steps for a fixed length or until we get the <EOS> token.
vi. You can reject any <UNK> token if you mind finding it in your output.
So far we have to build a word-level language model. It's also possible to implement a character-level language model.
In the character-level language model, the vocabulary will contain [a-zA-Z0-9] , punctuation, special characters and
possibly token.
Character-level language model has some pros and cons compared to the word-level language model
Pros:
a. There will be no <UNK> token - it can create any word.
Cons:
a. The main disadvantage is that you end up with much longer sequences.
b. Character-level language models are not as good as word-level language models at capturing long range
dependencies between how the the earlier parts of the sentence also affect the later part of the sentence.
c. Also more computationally expensive and harder to train.
The trend Andrew has seen in NLP is that for the most part, a word-level language model is still used, but as computers
get faster there are more and more applications where people are, at least in some special cases, starting to look at
more character-level models. Also, they are used in specialized applications where you might need to deal with
unknown words or other vocabulary words a lot. Or they are also used in more specialized applications where you have
a more specialized vocabulary.
An RNN that process a sequence data with the size of 10,000 time steps, has 10,000 deep layers which is very hard to
optimize.
Let's take an example. Suppose we are working with language modeling problem and there are two sequences that
model tries to learn:
What we need to learn here that "was" came with "cat" and that "were" came with "cats". The naive RNN is not very
good at capturing very long-term dependencies like this.
As we have discussed in Deep neural networks, deeper networks are getting into the vanishing gradient problem. That
also happens with RNNs with a long sequence size.
- For computing the word "was", we need to compute the gradient for everything behind. Multiplying fractions tends to
vanish the gradient, while multiplication of large number tends to explode it.
In the problem we descried it means that its hard for the network to memorize "was" word all over back to "cat". So in
this case, the network won't identify the singular/plural words so that it gives it the right grammar form of verb
was/were.
In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick
parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn
them. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Vanishing gradients problem tends to be the bigger problem with RNNs than the exploding gradients problem. We will
discuss how to solve it in next sections.
Exploding gradients can be easily seen when your weight values become NaN . So one of the ways solve exploding
gradient is to apply gradient clipping means if your gradient is more than some threshold - re-scale some of your
gradient vector so that is not too big. So there are cliped according to some maximum value.
Extra:
Each layer in GRUs has a new variable C which is the memory cell. It can tell to whether memorize something or not.
Lets take the cat sentence example and apply it to understand this equations:
We will suppose that U is 0 or 1 and is a bit that tells us if a singular word needs to be memorized.
Splitting the words and get values of C and U at each place:
The 0 val
cat 1 new_val
which 0 new_val
already 0 new_val
... 0 new_val
full .. ..
Because the update gate U is usually a small number like 0.00001, GRUs doesn't suffer the vanishing gradient problem.
Shapes:
What has been descried so far is the Simplified GRU unit. Let's now describe the full one:
The full GRU contains a new gate that is used with to calculate the candidate C. The gate tells you how relevant is
C<t-1> to C<t>
Equations:
So why we use these architectures, why don't we change them, how we know they will work, why not add another gate,
why not use the simpler GRU instead of the full GRU; well researchers has experimented over years all the various types
of these architectures with many many different versions and also addressing the vanishing gradient problem. They have
found that full GRUs are one of the best RNN architectures to be used for many different problems. You can make your
design but put in mind that GRUs and LSTMs are standards.
In GRU we have an update gate U , a relevance gate r , and a candidate cell variables C~<t> while in LSTM we have an
update gate U (sometimes it's called input gate I), a forget gate F , an output gate O , and a candidate cell variables C~
<t>
Bidirectional RNN
There are still some ideas to let you build much more powerful sequence models. One of them is bidirectional RNNs and
another is Deep RNNs.
As we saw before, here is an example of the Name entity recognition task:
The name Teddy cannot be learned from He and said, but can be learned from bears.
BiRNNs fixes this issue.
Here is BRNNs architecture:
Deep RNNs
In a lot of cases the standard one layer RNNs will solve your problem. But in some problems its useful to stack some
RNN layers to make a deeper network.
For example, a deep RNN with 3 layers would look like this:
In feed-forward deep nets, there could be 100 or even 200 layers. In deep RNNs stacking 3 layers is already considered
deep and expensive to train.
In some cases you might see some feed-forward network layers connected after recurrent cell.
The quote is taken from this notebook. If you want the details of the back propagation with programming notes look at
the linked notebook.
Word Representation
NLP has been revolutionized by deep learning and especially by RNNs and deep RNNs.
Word embeddings is a way of representing words. It lets your algorithm automatically understand the analogies
between words like "king" and "queen".
So far we have defined our language by a vocabulary. Then represented our words with a one-hot vector that represents
the word in the vocabulary.
An image example would be:
We will use the annotation O idx for any word that is represented with one-hot like in the image.
One of the weaknesses of this representation is that it treats a word as a thing that itself and it doesn't allow an
algorithm to generalize across words.
For example: "I want a glass of orange ______", a model should predict the next word as juice.
A similar example "I want a glass of apple ______", a model won't easily predict juice here if it wasn't trained on
it. And if so the two examples aren't related although orange and apple are similar.
Inner product between any one-hot encoding vector is zero. Also, the distances between them are the same.
So, instead of a one-hot presentation, won't it be nice if we can learn a featurized representation with each of these
words: man, woman, king, queen, apple, and orange?
- Each word will have a, for example, 300 features with a type of float point number.
Each word column will be a 300-dimensional vector which will be the representation.
We will use the notation e5391 to describe man word features vector.
Now, if we return to the examples we described again:
"I want a glass of orange ______"
I want a glass of apple ______
Orange and apple now share a lot of similar features which makes it easier for an algorithm to generalize between
them.
We call this representation Word embeddings.
To visualize word embeddings we use a t-SNE algorithm to reduce the features to 2 dimensions which makes it easy to
visualize:
You will get a sense that more related words are closer to each other.
The word embeddings came from that we need to embed a unique vector inside a n-dimensional space.
Let's see how we can take the feature representation we have extracted from each word and apply it in the Named
entity recognition problem.
Given this example (from named entity recognition):
In this problem, we encode each face into a vector and then check how similar are these vectors.
Words encoding and embeddings have a similar meaning here.
In the word embeddings task, we are learning a representation for each word in our vocabulary (unlike in image
encoding where we have to map each new image to some n-dimensional vector). We will discuss the algorithm in next
sections.
One of the most fascinating properties of word embeddings is that they can also help with analogy reasoning. While
analogy reasoning may not be by itself the most important NLP application, but it might help convey a sense of what
these word embeddings can do.
Analogies example:
Given this word embeddings table:
The top part represents the inner product of u and v vectors. It will be large if the vectors are very similar.
You can also use Euclidean distance as a similarity function (but it rather measures a dissimilarity, so you should take it
with negative sign).
We can use this equation to calculate the similarities between word embeddings and on the analogy problem where u
= ew and v = eking - eman + ewoman
Embedding matrix
When you implement an algorithm to learn a word embedding, what you end up learning is a embedding matrix.
Let's take an example:
Suppose we are using 10,000 words as our vocabulary (plus token).
The algorithm should create a matrix E of the shape (300, 10000) in case we are extracting 300 features.
If O6257 is the one hot encoding of the word orange of shape (10000, 1), then
np.dot( E ,O6257) = e6257 which shape is (300, 1).
Generally np.dot( E , Oj) = ej
In the next sections, you will see that we first initialize E randomly and then try to learn all the parameters of this
matrix.
In practice it's not efficient to use a dot multiplication when you are trying to extract the embeddings of a specific word,
instead, we will use slicing to slice a specific column. In Keras there is an embedding layer that extracts this column with
no multiplication.
Let's start learning some algorithms that can learn word embeddings.
At the start, word embeddings algorithms were complex but then they got simpler and simpler.
We will start by learning the complex examples to make more intuition.
Neural language model:
Let's start with an example:
We want to build a language model so that we can predict the next word.
So we use this neural network to learn the language model
Word2Vec
For example, we have the sentence: "I want a glass of orange juice to go along with my cereal"
orange juice +1
orange glass -2
orange my +6
This is not an easy learning problem because learning within -10/+10 words (10 - an example) is hard.
Word2Vec model:
Vocabulary size = 10,000 words
Let's say that the context word are c and the target word is t
We want to learn c to t
We get ec by E . oc
We then use a softmax layer to get P(t|c) which is ŷ
Also we will use the cross-entropy loss function.
This model is called skip-grams model.
The last model has a problem with the softmax layer:
Here we are summing 10,000 numbers which corresponds to the number of words in our vocabulary.
If this number is larger say 1 million, the computation will become very slow.
One of the solutions for the last problem is to use "Hierarchical softmax classifier" which works as a tree classifier.
In practice, the hierarchical softmax classifier doesn't use a balanced tree like the drawn one. Common words are at the
top and less common are at the bottom.
How to sample the context c?
One way is to choose the context by random from your corpus.
If you have done it that way, there will be frequent words like "the, of, a, and, to, .." that can dominate other words
like "orange, apple, durian,..."
In practice, we don't take the context uniformly random, instead there are some heuristics to balance the common
words and the non-common words.
word2vec paper includes 2 ideas of learning word embeddings. One is skip-gram model and another is CBoW
(continuous bag-of-words).
Negative Sampling
Negative sampling allows you to do something similar to the skip-gram model, but with a much more efficient learning
algorithm. We will create a different learning problem.
orange juice 1
orange king 0
orange book 0
orange the 0
orange of 0
We get positive example by using the same skip-grams technique, with a fixed window that goes around.
Notice, that we got word "of" as a negative example although it appeared in the same sentence.
We will have a ratio of k negative examples to 1 positive ones in the data we are collecting.
Now let's define the model that will learn this supervised learning problem:
Lets say that the context word are c and the word are t and y is the target.
We will apply the simple logistic regression model.
So we are like having 10,000 binary classification problems, and we only train k+1 classifier of them in each
iteration.
GloVe is another algorithm for learning the word embedding. It's the simplest of them.
This is not used as much as word2vec or skip-gram models, but it has some enthusiasts because of its simplicity.
Let's use our previous example: "I want a glass of orange juice to go along with my cereal".
We will choose a context and a target from the choices we have mentioned in the previous sections.
Then we will calculate this for every pair: Xct = # times t appears in context of c
Xct = Xtc if we choose a window pair, but they will not equal if we choose the previous words for example. In GloVe they
use a window which means they are equal
f(x) - the weighting term, used for many reasons which include:
The log(0) problem, which might occur if there are no pairs for the given target and context values.
Giving not too much weight for stop words like "is", "the", and "this" which occur many times.
Giving not too little weight for infrequent words.
Theta and e are symmetric which helps getting the final word embedding.
If this is your first try, you should try to download a pre-trained model that has been made and actually works best.
If you have enough data, you can try to implement one of the available algorithms.
Because word embeddings are very computationally expensive to train, most ML practitioners will load a pre-
trained set of embeddings.
A final note that you can't guarantee that the axis used to represent the features will be well-aligned with what
might be easily humanly interpretable axis like gender, royal, age.
Sentiment Classification
As we have discussed before, Sentiment classification is the process of finding if a text has a positive or a negative
review. Its so useful in NLP and is used in so many applications. An example would be:
One of the challenges with it, is that you might not have a huge labeled training data for it, but using word embeddings
can help getting rid of this.
The common dataset sizes varies from 10,000 to 100,000 words.
A simple sentiment classification model would be like this:
The embedding matrix may have been trained on say 100 billion words.
Number of features in word embedding is 300.
We can use sum or average given all the words then pass it to a softmax classifier. That makes this classifier works
for short or long sentences.
One of the problems with this simple model is that it ignores words order. For example "Completely lacking in good
taste, good service, and good ambience" has the word good 3 times but its a negative review.
A better model uses an RNN for solving this problem:
And so if you train this algorithm, you end up with a pretty decent sentiment classification algorithm.
Also, it will generalize better even if words weren't in your dataset. For example you have the sentence "Completely
absent of good taste, good service, and good ambience", then even if the word "absent" is not in your label training
set, if it was in your 1 billion or 100 billion word corpus used to train the word embeddings, it might still get this
right and generalize much better even to words that were in the training set used to train the word embeddings but
not necessarily in the label training set that you had for specifically the sentiment classification problem.
We want to make sure that our word embeddings are free from undesirable forms of bias, such as gender bias, ethnicity
bias and so on.
Horrifying results on the trained word embeddings in the context of Analogies:
Man : Computer_programmer as Woman : Homemaker
Father : Doctor as Mother : Nurse
Word embeddings can reflect gender, ethnicity, age, sexual orientation, and other biases of text used to train the model.
Learning algorithms by general are making important decisions and it mustn't be biased.
Andrew thinks we actually have better ideas for quickly reducing the bias in AI than for quickly reducing the bias in the
human race, although it still needs a lot of work to be done.
Addressing bias in word embeddings steps:
Idea from the paper: https://arxiv.org/abs/1607.06520
Given these learned embeddings:
We need to solve the gender bias here. The steps we will discuss can help solve any bias problem but we are
focusing here on gender bias.
Here are the steps:
a. Identify the direction:
Calculate the difference between:
ehe - eshe
emale - efemale
....
Choose some k differences and average them.
This will help you find this:
By that we have found the bias direction which is 1D vector and the non-bias vector which is 299D vector.
b. Neutralize: For every word that is not definitional, project to get rid of bias.
Babysitter and doctor need to be neutral so we project them on non-bias axis with the direction of the
bias:
After that they will be equal in the term of gender. - To do this the authors of the paper trained a
classifier to tell the words that need to be neutralized or not.
c. Equalize pairs
We want each pair to have difference only in gender. Like:
Grandfather - Grandmother - He - She - Boy - Girl
We want to do this because the distance between grandfather and babysitter is bigger than babysitter and
grandmother:
To do that, we move grandfather and grandmother to a point where they will be in the middle of the non-
bias axis.
There are some words you need to do this for in your steps. Number of these words is relatively small.
Basic Models
In this section we will learn about sequence to sequence - Many to Many - models which are useful in various
applications including machine translation and speech recognition.
Let's start with the basic model:
Given this machine translation problem in which X is a French sequence and Y is an English sequence.
The architecture uses a pretrained CNN (like AlexNet) as an encoder for the image, and the decoder is an RNN.
Ideas are from the following papers (they share similar ideas):
Maoet et. al., 2014. Deep captioning with multimodal recurrent neural networks
Vinyals et. al., 2014. Show and tell: Neural image caption generator
Karpathy and Li, 2015. Deep visual-semantic alignments for generating image descriptions
There are some similarities between the language model we have learned previously, and the machine translation model
we have just discussed, but there are some differences as well.
The language model we have learned is very similar to the decoder part of the machine translation model, except for
a<0>
The most common algorithm is the beam search, which we will explain in the next section.
Why not use greedy search? Why not get the best choices each time?
It turns out that this approach doesn't really work!
Lets explain it with an example:
The best output for the example we talked about is "Jane is visiting Africa in September."
Suppose that when you are choosing with greedy approach, the first two words were "Jane is", the word that
may come after that will be "going" as "going" is the most common word that comes after " is" so the result
may look like this: "Jane is going to be visiting Africa in September.". And that isn't the best/optimal solution.
So what is better than greedy approach, is to get an approximate solution, that will try to maximize the output (the last
equation above).
Beam Search
Beam search is the most widely used algorithm to get the best output sequence. It's a heuristic search algorithm.
To illustrate the algorithm we will stick with the example from the previous section. We need Y = "Jane is visiting Africa
in September."
The algorithm has a parameter B which is the beam width. Lets take B = 3 which means the algorithm will get 3
outputs at a time.
For the first step you will get ["in", "jane", "september"] words that are the best candidates.
Then for each word in the first output, get B next (second) words and select top best B combinations where the best are
those what give the highest value of multiplying both probabilities - P(y<1>|x) * P(y<2>|x,y<1>). Se we will have then ["in
september", "jane is", "jane visit"]. Notice, that we automatically discard september as a first word.
Repeat the same process and get the best B words for ["september", "is", "visit"] and so on.
In this algorithm, keep only B instances of your network.
If B = 1 this will become the greedy search.
Refinements to Beam Search
In the previous section, we have discussed the basic beam search. In this section, we will try to do some refinements to
it.
The first thing is Length optimization
In beam search we are trying to optimize:
But there's another problem. The two optimization functions we have mentioned are preferring small sequences
rather than long ones. Because multiplying more fractions gives a smaller value, so fewer fractions - bigger result.
So there's another step - dividing by the number of elements in the sequence.
Unlike exact search algorithms like BFS (Breadth First Search) or DFS (Depth First Search), Beam Search runs faster
but is not guaranteed to find the exact solution.
We have talked before on Error analysis in "Structuring Machine Learning Projects" course. We will apply these concepts
to improve our beam search algorithm.
We will use error analysis to figure out if the B hyperparameter of the beam search is the problem (it doesn't get an
optimal solution) or in our RNN part.
Let's take an example:
Initial info:
x = "Jane visite l’Afrique en septembre."
y* = "Jane visits Africa in September." - right answer
ŷ = "Jane visited Africa last September." - answer produced by model
Our model that has produced not a good result.
We now want to know who to blame - the RNN or the beam search.
To do that, we calculate P(y* | X) and P(ŷ | X). There are two cases:
Case 1 (P(y* | X) > P(ŷ | X)):
Conclusion: Beam search is at fault.
Case 2 (P(y* | X) <= P(ŷ | X)):
Conclusion: RNN model is at fault.
The error analysis process is as following:
You choose N error examples and make the following table:
BLEU Score
One of the challenges of machine translation, is that given a sentence in a language there are one or more possible
good translation in another language. So how do we evaluate our results?
The way we do this is by using BLEU score. BLEU stands for bilingual evaluation understudy.
The intuition is: as long as the machine-generated translation is pretty close to any of the references provided by
humans, then it will get a high BLEU score.
Let's take an example:
X = "Le chat est sur le tapis."
Y1 = "The cat is on the mat." (human reference 1)
Y2 = "There is a cat on the mat." (human reference 2)
Suppose that the machine outputs: "the the the the the the the."
One way to evaluate the machine output is to look at each word in the output and check if it is in the references.
This is called precision:
precision = 7/7 because "the" appeared in Y1 or Y2
This is not a useful measure!
We can use a modified precision in which we are looking for the reference with the maximum number of a
particular word and set the maximum appearing of this word to this number. So:
modified precision = 2/7 because the max is 2 in Y1
We clipped the 7 times by the max which is 2.
Here we are looking at one word at a time - unigrams, we may look at n-grams too
BLEU score on bigrams
The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be
called shingles. An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a
"digram"); size 3 is a "trigram".
Suppose that the machine outputs: "the cat the cat on the mat."
cat the 1 0
cat on 1 1 (Y2)
on the 1 1 (Y1)
Totals 6 4
So far we were using sequence to sequence models with an encoder and decoders. There is a technique called attention
which makes these models even better.
The attention idea has been one of the most influential ideas in deep learning.
The problem of long sequences:
Given this model, inputs, and outputs.
The encoder should memorize this long sequence into one vector, and the decoder has to process this vector to
generate the translation.
If a human would translate this sentence, he/she wouldn't read the whole sentence and memorize it then try to
translate it. He/she translates a part at a time.
The performance of this model decreases if a sentence is long.
We will discuss the attention model that works like a human that looks at parts at a time. That will significantly
increase the accuracy even with longer sequence:
Blue is the normal model, while green is the model with attention mechanism.
In this section we will give just some intuitions about the attention model and in the next section we will discuss it's
details.
At first the attention model was developed for machine translation but then other applications used it like computer
vision and new architectures like Neural Turing machine.
The attention model was descried in this paper:
Bahdanau et. al., 2014. Neural machine translation by jointly learning to align and translate
Now for the intuition:
Suppose that our encoder is a bidirectional RNN:
We give the French sentence to the encoder and it should generate a vector that represents the inputs.
Now to generate the first word in English which is "Jane" we will make another RNN which is the decoder.
Attention weights are used to specify which words are needed when to generate a word. So to generate "jane" we
will look at "jane", "visite", "l'Afrique"
alpha<1,1>, alpha<1,2>, and alpha<1,3> are the attention weights being used.
And so to generate any word there will be a set of attention weights that controls which words we are looking at
right now.
Attention Model
Lets formalize the intuition from the last section into the exact details on how this can be implemented.
First we will have an bidirectional RNN (most common is LSTMs) that encodes French language:
For learning purposes, lets assume that a<t'> will include the both directions activations at time step t'.
We will have a unidirectional RNN to produce the output using a context c which is computed using the attention
weights, which denote how much information does the output needs to look in a<t'>
Sum of the attention weights for each element in the sequence should be 1:
Now we need to know how to calculate e<t, t'>. We will compute e using a small neural network (usually 1-layer,
because we will need to compute this a lot):
s<t-1> is the hidden state of the RNN s, and a<t'> is the activation of the other bidirectional RNN.
One of the disadvantages of this algorithm is that it takes quadratic time or quadratic cost to run.
One fun way to see how attention works is by visualizing the attention weights:
Speech recognition
One of the most exciting developments using sequence-to-sequence models has been the rise of very accurate speech
recognition.
Let's define the speech recognition problem:
X: audio clip
Y: transcript
If you plot an audio clip it will look like this:
The horizontal axis is time while the vertical is changes in air pressure.
What really is an audio recording? A microphone records little variations in air pressure over time, and it is these
little variations in air pressure that your ear perceives as sound. You can think of an audio recording is a long list of
numbers measuring the little air pressure changes detected by the microphone. We will use audio sampled at 44100
Hz (or 44100 Hertz). This means the microphone gives us 44100 numbers per second. Thus, a 10 second audio clip
is represented by 441000 numbers (= 10 * 44100).
It is quite difficult to work with "raw" representation of audio.
Because even human ear doesn't process raw wave forms, the human ear can process different frequencies.
There's a common preprocessing step for an audio - generate a spectrogram which works similarly to human ears.
The horizontal axis is time while the vertical is frequencies. Intensity of different colors shows the amount of
energy - how loud is the sound for different frequencies (a human ear does a very similar preprocessing step).
A spectrogram is computed by sliding a window over the raw audio signal, and calculates the most active
frequencies in each window using a Fourier transformation.
In the past days, speech recognition systems were built using phonemes that are a hand engineered basic units of
sound. Linguists used to hypothesize that writing down audio in terms of these basic units of sound called
phonemes would be the best way to do speech recognition.
End-to-end deep learning found that phonemes was no longer needed. One of the things that made this possible is
the large audio datasets.
Research papers have around 300 - 3000 hours of training data while the best commercial systems are now trained
on over 100,000 hours of audio.
You can build an accurate speech recognition system using the attention model that we have descried in the previous
section:
One of the methods that seem to work well is CTC cost which stands for "Connectionist temporal classification"
To explain this let's say that Y = "the quick brown fox"
We are going to use an RNN with input, output structure:
The _ is a special character called "blank" and <SPC> is for the "space" character.
Basic rule for CTC: collapse repeated characters not separated by "blank"
So the 19 character in our Y can be generated into 1000 character output using CTC and it's special blanks.
The ideas were taken from this paper:
Graves et al., 2006. Connectionist Temporal Classification: Labeling unsegmented sequence data with recurrent
neural networks
This paper's ideas were also used by Baidu's DeepSpeech.
Using both attention model and CTC cost can help you to build an accurate speech recognition system.
With the rise of deep learning speech recognition, there are a lot of devices that can be waked up by saying some words
with your voice. These systems are called trigger word detection systems.
For example, Alexa - a smart device made by Amazon - can answer your call "Alexa, what time is it?" and then Alexa will
respond to you.
Trigger word detection systems include:
For now, the trigger word detection literature is still evolving so there actually isn't a single universally agreed on the
algorithm for trigger word detection yet. But let's discuss an algorithm that can be used.
Let's now build a model that can solve this problem:
X: audio clip
X has been preprocessed and spectrogram features have been returned of X
X<1>, X<2>, ... , X<t>
Y will be labels 0 or 1. 0 represents the non-trigger word, while 1 is that trigger word that we need to detect.
The model architecture can be like this:
The vertical lines in the audio clip represent moment just after the trigger word. The corresponding to this will
be 1.
One disadvantage of this creates a very imbalanced training set. There will be a lot of zeros and few ones.
A hack to solve this is to make an output a few ones for several times or for a fixed period of time before reverting
back to zero.
Extras
The diagram uses a RepeatVector node to copy s <t-1> 's value Tx times, and then Concatenation to concatenate
s <t-1> and a <t> to compute e <t, t>
, which is then passed through a softmax to compute α <t, t>
.