Mod 4
Mod 4
Error surface
Recap- Pattern to be learnt at lower level
8-9-9-9-4
8x9=72
9x9=81
9x9=81
9x4=36
Total =270 parameters
i j=err*f’(netj)
wij
ai
i= jwij*f’(neti)
Training the multi-layer neural network
when using sigmoid neurons the gradient will usually vanish.
Activation functions – which is better?
Reference:
https://medium.com/datadriveninvestor/deep-learning-best-practices-activation-functions-weight-initialization-methods-
part-1-c235ff976ed#:~:text=The%20activation%20function%20is%20the,the%20next%20layer%20as%20input.
Activation functions – which is better?
Classification Regression
f(x)=x
Cost (loss) Cross-entropy
function Mean Squared Error
𝑛 𝐾
1 (𝑖) (𝑖) (𝑖) 𝑖 𝑛
𝐽 𝜃 =− 𝑦𝑘 log 𝑦𝑘 + 1 − 𝑦𝑘 log 1 − 𝑦𝑘 1 2
𝑛 𝐽 𝜃 = 𝑦 (𝑖) − 𝑦 (𝑖)
𝑖=1 𝑘=1 𝑛
𝑖=1
- (for binary classifier with scalar output y=0/1)
Mean Absolute Error
𝑛
- (for desired one-hot 1
output y=[0,0,..1,..0]) 𝐽 𝜃 = 𝑦 (𝑖) − 𝑦 (𝑖)
𝑛
𝑖=1
Overfitting
x2
Weight decay: To prevent overfitting, every time we update a weight w with the gradient ∇E in respect to w, we also
subtract from it λ∙w. This gives the weights a tendency to decay towards zero, hence the name.
Regularization to prevent over-fit
This constraint results in minimized coefficients (aka shrinkage) that trend towards zero the larger the value of lambda.
Shrinking the coefficients leads to a lower variance and in turn a lower error value. Therefore Ridge regression
decreases the complexity of a model but does not reduce the number of variables, it rather just shrinks their effect.
Ridge Regression
Given a suitable lambda value lasso regression can drive some coefficients to zero. The larger the value of lambda the more
features are shrunk to zero. This can eliminate some features entirely and give us a subset of predictors that helps mitigate
multi-collinearity and model complexity. Predictors not shrunk towards zero signify that they are important and thus L1
regularization allows for feature selection (sparse selection).
Lasso Regression
Lasso method overcomes the disadvantage of Ridge
regression by not only punishing high values of the
coefficients β but actually setting them to zero if
they are not relevant. Therefore, you might end up
with fewer features included in the model than you
started with, which is a huge advantage.
Regularization to prevent overfit
L1 regularization (LASSO regression) produces sparse matrices. Sparse matrices are
zero-matrices in which some elements are ones (the sparsity refers to the ones), but in
this context a sparse matrix could be several close-to-zero values and other larger values.
If we find a model with neurons whose weights are close to zero it means we don’t need
those neurons because the model deactivates them with zeros and we might not need a
specific feature/input leading to a simpler model. For instance, if we have 50 coefficients
but only 10 are non-zero, the other 40 are irrelevant to make our predictions. This is not
only interesting from the efficiency point of view but also from the economic point of
view: gathering data and extracting its features might be a very expensive task (in terms
of time and money). Reducing this will benefit us.
Due to the absolute value, L1 regularization provides with a non-differentiable term, but
despite of that, there are methods to minimize it.
Regularization to prevent overfit
Internal Covariate Shift. During training, each layer is trying to correct itself
for the error made up during the forward propagation. But every single layer
acts separately, trying to correct itself for the error made up.
For example, in the network given above, the 2nd layer adjusts its weights and
biases to correct for the output. But due to this readjustment, the output of 2nd
layer, i.e, the input of 3rd layer is changed for same initial input. So the third
layer has to learn from scratch to produce the correct outputs for the same
data.
This presents the problem of a layer starting to learn after it’s previous layer,
i.e, 3rd layer learns after 2nd finished, 4th starts learning after 3rd, etc.
Batch Normalization
More specifically, due to changes in weights of previous layers, the distribution of input
values for current layer changes, forcing it to learn from new “input distribution”
Normalization -In a dataset, all the features (columns) may not be in same range. For eg.
Price of house (thousands), Age of house (within 100) etc. It takes lot of time to train for
these kind of datasets.
Usually, in simpler ML algorithms like linear regression, the input is “normalized” before
training to make them into single distribution. Normalization is to convert the distribution
of all inputs to have mean=0 and standard deviation=1. So most of the values lie
between -1 and 1.
Raw and normalized data
Batch Normalization
We can even apply this normalization to the input of neural networks. It fastens up training
as in linear regression. But since the 2nd layer changes this distribution, the consecutive
layers are not benefited. So, what can we do?
Why not add normalization between each layers? This is what Batch normalization does.
To reduce this problem of internal covariate shift, Batch Normalization adds Normalization
“layer” between each layers. An important thing to note here is that normalization has to be
done separately for each dimension (input neuron), over the ‘mini-batches’, and not
altogether with all dimensions. Hence the name ‘batch’ normalization.
Batch Normalization
Due to this normalization “layers” between each fully connected layers, the range of input
distribution of each layer stays the same, no matter the changes in the previous layer.
Given x inputs from k-th neuron.
Normalization brings all the inputs centered around 0. This way, there is not much change in
each layer input. So, layers in the network can learn from the back-propagation
simultaneously, without waiting for the previous layer to learn. This fastens up the training of
networks.
Batch Normalization
There are usually two types in which Batch Normalization can be applied:
1. Before activation function (non-linearity)
2. After non-linearity
Most of the activation functions have problems while applied this way. For sigmoid and
tanh activation, normalized region is more of linear than nonlinear. For Relu activation,
half of the inputs are zeroed out.
So, some transformation has to be done to move the distribution away from 0.
Batch Normalization
1.Data Augmentation: Suppose we are building an image classification model and are
lacking the requisite data due to various reasons. In such cases, we can use data
augmentation, i.e., applying some changes such as flipping the image, taking random crops
of the image, randomly rotating images, etc. These can potentially help us get more training
data and hence reduce overfitting.
Other regularization methods
2.Early Stopping: Early Stopping is one of the most popular, and also effective, techniques
to prevent overfitting. Use the validation data set to compute the loss function at the end of
each training epoch, and once the loss stops decreasing, stop the training and use the test
data to compute the final classification accuracy. In practice it is more robust to wait until
the validation loss has stopped decreasing for four or five successive epochs before
stopping. The justification for this rule is quite simple: The point at which the validation
loss starts to increase is when the model starts to overfit the training data, since from this
point onwards its generalization ability starts to decrease. Early Stopping can be used by
itself or in combination with other Regularization techniques.
Other regularization methods
Once the DLN model has been trained, its true test is how well it is able to classify inputs that it
has not seen before, which is also known as its Generalization Ability.
There are two kinds of problems that can afflict ML models in general:
i) Even after the model has been fully trained such that its training error is small,
it exhibits a high test error rate. This is known as the problem of Over-fitting.
ii) The training error fails to come down in-spite of several epochs of training. This is
known as the problem of Under-fitting.
The following factors determine how well a model is able to generalize from the training dataset to
the test dataset:
The model capacity and its relation to data complexity:
Model’s capacity is its ability to fit a wide variety of functions. In general if the model capacity is
less than the data complexity then it leads to under-fitting, while if the converse is true, then it can
lead to over-fitting. Hence we should try to choose a model whose capacity matches the
complexity of the training and test datasets.
Even if the model capacity and the data complexity are well matched, we can still encounter the
overfitting problem. This is due to an insufficient amount of training data.
Based on these observations, a useful rule of thumb is the following: The chances of encountering
the overfitting problem increases as the model capacity grows, but decreases as more training data
is available.
Other regularization methods
Note that we are attempting to reduce the training error (to avoid the underfitting problem) and the
test error (to avoid the overfitting problem) at the same time. This leads to conflicting demands on
the model capacity, since training error reduces monotonically as the model capacity increases, but
the test error starts to increase if the model capacity is too high. In general if we plot the test error
as a function of model capacity, it exhibits a characteristic U shaped curve. The ideal model
capacity is the point at which the test error starts to increase. This criteria is used extensively in
DLNs to determine the best set of hyper-parameters to use.
Other regularization methods
The generalization ability of DLN models relies on a very important assumption, which
is that both the training and test datasets can be generated the same probabilistic model.
In practice what this means is that if we train the model to recognize a certain type of
object, human faces for example, we cannot expect it to perform well if the test data
consists entirely of cat faces.
There is a famous result called the No Free Lunch Theorem that states that if this
assumption is not satisfied, i.e., the training and test datasets distributions are un-
constrained, then every classification algorithm has the same error rate when classifying
previously unobserved points. Hence the only way to do better, is by constraining the
training and test datasets to a narrower class of data that is relevant to the problem being
solved.
Bias- Variance Trade-off
Bias is the difference between the average prediction of our model and the correct
value which we are trying to predict. Model with high bias pays very little attention
to the training data and oversimplifies the model. It always leads to high error on
training and test data.
Variance is the variability of model prediction for a given data point or a value
which tells us spread of our data. Model with high variance pays a lot of attention
to training data and does not generalize on the data which it hasn’t seen before. As
a result, such models perform very well on training data but has high error rates on
test data.
Mathematically let the variable we are trying to predict as Y and other covariates
as X. We assume there is a relationship between the two such that
Y=f(X) + e
Where e is the error term and it’s normally distributed with a mean of 0.
We will make a model f^(X) of f(X) using linear regression or any other modeling
technique.
Bias- Variance Trade-off
So the expected squared error at a point x is
Irreducible error is the error that can’t be reduced by creating good models. It
is a measure of the amount of noise in our data. Here it is important to
understand that no matter how good we make our model, our data will have
certain amount of noise or irreducible error that can not be removed.
Bias- Variance Trade-off
In supervised learning, underfitting happens when a model unable to capture the underlying
pattern of the data. These models usually have high bias and low variance. It happens when
we have very less amount of data to build an accurate model or when we try to build a linear
model with a nonlinear data. Also, these kind of models are very simple to capture the
complex patterns in data like Linear and logistic regression.
In supervised learning, overfitting happens when our model captures the noise along with the
underlying pattern in data. It happens when we train our model a lot over noisy dataset. These
models have low bias and high variance. These models are very complex like Decision trees
which are prone to overfitting.
Bias- Variance Trade-off
Less complexity
more complexity
Less complexity
New optimization methods
Effect of learning rate
https://srdas.github.io/DLBook/GradientDescentTechniques.ht
ml
If it is set to a large value then the algorithm moves quickly at the start of the iteration, but
the large step size can cause a parameter overshoot as the system approaches minimum
which can lead to oscillations. If set too small then the algorithm converges with high
likelihood, however it can take a very long time to do so Hence ideally η should be set
adaptively such that it is large in the initial stages of the optimization and becomes smaller
as it gets closer to the minimum
New optimization methods
Effect of learning rate
A good Learning Rate on the other hand combines a quick decrease during the initial epochs with a lower
steady state value. Learning Rate Annealing is the strategy of reducing the Learning Rate as the system
approaches the minimum, such that rate is high at the start of the training and gradually falls as the training
progresses. This reduction can be done in several ways, popular approaches are:
• Track the validation accuracy and decrease the Learning Rate when it appears to plateau.
• Automatically anneal the Learning Rate based on the number of epochs that the Gradient Descent
algorithm has been through.
New optimization methods
Improvements to the parameter update equations
the base parameter update equation
the Gradient Descent process speeds up considerably when compared to the plain Momentum method.
New optimization methods- Adaptive learning rate
AdaGrad – Adaptive Gradient algorithm
Modifies the learning rate at each time step t for every parameter i based on the past gradients that have
been computed for i
New optimization methods- Adaptive learning rate Root Mean Squared Propagation
RMS Prop modifies AdaGrad to perform better in the non-convex setting by changing gradient accumulation
into an exponentially weighted moving average
The exponentially decaying average discards history from the extreme past so that it can converge rapidly.
New optimization methods- Adaptive learning rate
Adam – adaptive moments also stores running average of past squared gradients- variant on the
combination of RMSProp and momentum with few important distinctions. Adam momentum is incorporated
directly as an estimate of the first order moment (with exponential weighting) of the gradient. Apply momentum
to the rescaled gradients in RMSProp
Choosing the models and algorithms
Choosing the models
•If the input data is such that the various classes are approximately linearly separated (which can verified by plotting a
subset of input data points in two dimensions using projection techniques), then a linear model will work.
•Deep Learning Networks are needed for more complex datasets with non-linear boundaries between classes. If the
input data has a 1-D structure, then a Deep Feed Forward Network will suffice.
•If the input data has a 2-D structure (such as black and white images), or a 3-D structure (such color images), then a
Convolutional Neural Network or ConvNet is called for. In some cases 1-D or 2-D data can be aggregated into a higher
level structure which then becomes amenable to processing using ConvNets. ConvNets excel at object detection and
recognition in images and they have also been applied to tasks such as DLN based image generation.
•If the input data forms a sequence with dependencies between the elements of the sequence, then a Recurrent Neural
Network or RNN is required. Typical examples of this kind of data include: Speech waveforms, natural language
sentences, stock prices etc. RNNs are ideal for tasks such as speech recognition, machine translation, captioning etc
Choosing the algorithms
Choice of SGD parameter updation
Adam-
Combines both the advantage and default choice
Choosing the algorithms
Choice of learning rate
• Keep track of the validation error, and reduce the Learning Rate by a factor of 2 when the error
appears to plateau.
• Automatically reduce the Learning Rate using a predetermined schedule. Popular schedules are:
(a) Exponential decrease: η=η010−t/r, so that the Learning Rate drops by a factor of 10
every r steps,
(b) η=η0(1+t/r)−c, this leads to a smaller rate of decrease compared to exponential.
Choosing the algorithms
Choice of activation function
The general rules in this area are: Avoid Sigmoid and Tanh functions, use ReLu as a default
and Leaky ReLU to improve performance, try out MaxOut and ELU on an experimental
basis.
Choosing the algorithms
Weight initialization rule Weight Initialization Rules: Use the Xavier-He initializations as a default
linear neuron with random weights W
Pick the weights from a Gaussian distribution with zero mean and a variance
of 1/nin where nin is the number of input neurons