0% found this document useful (0 votes)
25 views50 pages

6 CNN

This document provides an overview of convolutional neural networks (CNNs). It discusses key concepts like convolution, sparse interaction, parameter sharing, and equivariance that make CNNs effective for processing grid-like data like images and time-series. The document also covers data preprocessing techniques for neural networks like vectorization and value normalization. It describes overfitting and underfitting issues and different regularization techniques to address overfitting like reducing the network size.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views50 pages

6 CNN

This document provides an overview of convolutional neural networks (CNNs). It discusses key concepts like convolution, sparse interaction, parameter sharing, and equivariance that make CNNs effective for processing grid-like data like images and time-series. The document also covers data preprocessing techniques for neural networks like vectorization and value normalization. It describes overfitting and underfitting issues and different regularization techniques to address overfitting like reducing the network size.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Neural Network and Deep

Learning
Samatrix Consulting Pvt Ltd
Convolutional Neural Network
Convolutional Neural Network
• In this chapter, we will learn about a deep learning technique called
convolution.
• Convolution has become the standard method of classifying,
manipulating, and generating images.
• It is easy to implement convolution in deep learning.
Convolutional Neural Network
• In this chapter, we will focus on the key ideas behind convolution and
the related techniques that can be used to make convolution work on
images.
• Convolutional neural networks have been used to recognize the
people in a photograph, detect and classify different types of skin
cancers, repair image damage such as dust, scratches, and blur, and
classify people’s age and gender from their photos.
• Convolutional neural networks are also used in natural language
processing.
Convolutional Neural Network
• Convolutional neural networks are specialized for processing data
that has a grid-like topology.
• Examples include time-series data, which is a 1-D grid taking samples
at regular time intervals, and image data, which is a 2-D grid of pixels.
• The network uses a mathematical operation called convolution hence
it gets the name “convolutional neural network”
What is Convolution
In the following equation, the convolution operation is denoted by an
asterisk

𝑠 𝑡 = (𝑥 ∗ 𝑤)(𝑡)

The first argument, the function 𝑥, is often referred to as the input. The
second argument, the function 𝑤, is referred to as kernel. The output is
referred to as the feature map.

Figure 6.1 illustrates an example of a simple convolution applied to a 2-D


tensor
What is Convolution
Motivation
• In order to improve a machine learning system, convolution uses
three important ideas: sparse interaction, parameter sharing, and
equivariant representations.
• Convolution also provides a means of working with inputs of variable
size.
Sparse Interaction
• In traditional neural networks, every output unit interacts with every input
unit.
• Whereas the convolutional networks have sparse interaction (also known
as sparse connectivity, or sparse weights).
• We can accomplish this by making the kernel smaller than the input.
• For example, if the input image has thousands or millions of pixels, we can
use a kernel that has only tens or hundreds of pixels to detect small and
meaningful features such as edges.
• Therefore, we need to store fewer parameters.
• This reduces the memory requirements of the model as well as improves
its statistical efficiency. We need few operations to compute the output.
Sparse Interaction
• The graphical representation of sparse interaction is illustrated in
figure 6.2 in which we have highlighted one input unit 𝑥3 , and output
units in 𝑠 that are affected by this unit.
• When 𝑠 is formed by convolution with a kernel of width 3, only three
inputs are affected by 𝑥.
• For a fully connected network, connectivity is no longer sparse and all
the outputs are affected by 𝑥3
Sparse Interaction
Equivariance
• Property equivariance means that if the input changes, the output
changes in the same way.
• The convolution creates a 2-D map of where certain features appear
in the input.
• If we move the object in the input, its representation will also move
the same amount in the output.
The Convolution Operation
• One of the major differences between a densely connected layer and
a convolution layer is as follows: the dense layers learn global
patterns in their input feature space, while the convolution layer
learns the local patterns.
• As shown in figure 6.3, the patterns were found in small 2D windows
of the inputs.
• These windows could be 3 x 3. The image can be broken into local
patterns such as edges, textures, and so on.
Data Preprocessing for Neural
Networks
Vectorization
• The input and labels of the neural network should be tensors of
floating-point data.
• In some cases, it could be tensors of integers.
• Every data type such as sound, images, and text, should be turned
into a tensor, which we call “data vectorization”.
Vectorization
• In our MNIST classification example, the image data was encoded as
integers in the 0-255 range.
• Before feeding the data into the network, we had to cast it to float32
and divide it by 255.
• Thus, we got the floating-point values in the 0-1 range.
Value Normalization
• Generally, we should not feed relatively large values, such as multi-
digit integers, which are relatively larger than the initial values of the
weights.
• We should also avoid heterogeneous data, for example, a dataset in
which one feature is in the 0-1 range whereas another feature is in
the 100-200 range.
• Such datasets can trigger large gradient updates, which will prevent
the networks from converging.
• Therefore, the data should
• Take “small” values. Most values should be in the 0-1 range
• Be homogeneous. All the features should take values in the same range
Feature Engineering
• The process in which you use your own knowledge about the data and the
machine learning algorithm at hand and apply hard-coded (non-learned)
transformation to the data before the data is fed into the model, is known
as Feature Engineering.
• We use feature engineering in cases where we do not expect the machine
learning model to learn from the data completely.
• So, the make the job of the model easier, the data is transformed and
presented to the model.
• The essence of feature engineering is to make the problem easier by
expressing it in a simpler way. However, it needs an in-depth understanding
of the problem.
Feature Engineering
• Earlier some of the machine learning models could not learn useful
features from the data by themselves.
• Therefore, feature engineering was critical for the success of the
project.
• For example, before the convolutional neural network, to solve MNIST
digit classification problems, we need to hard-code features such as
the number of the loops in the digit image, the height of each digit in
the image, a histogram of pixel values, and so on.
Feature Engineering
• Modern deep learning algorithms do not require most feature
engineering.
• The neural networks can extract useful features from raw data.
• Still, you may need feature engineering in certain cases due to
following reasons
• You can solve problems more elegantly while using fewer resources by using
feature engineering
• By using good features, you can solve the problem using much fewer data.
The deep learning models need lots of training data to learn features on their
own. However, if we have few samples, feature engineering may be required.
Overfitting and Underfitting
• We have seen in the MNIST classification problem that the
performance of the model on the held-out validation data peaks after
a few epochs and then it would start degrading.
• Therefore, our model quickly starts to overfit the training data.
• We face the overfitting problem in every single machine learning
problem.
• In order to master machine learning, we need to learn, how to deal
with overfitting.
Overfitting and Underfitting
• The fundamental issue in machine learning is to balance between
optimization and generalization.
• Optimization refers to the process in which we adjust the model to
get the best performance on the training data.
• Generalization refers to the performance of the model on the data it
has never seen before.
• The goal is to get good generalization but you do not control
generalization.
• You can only adjust the model based on training data.
Overfitting and Underfitting
• At the beginning of training, both the optimization and generalization are
correlated.
• The loss on the training data will be low so the loss on the test data will be.
• In such a case, we consider our model to be under-fit.
• This means that the model has not learned all the relevant patterns in the
training data set.
• After certain iterations on the training data, there is no further
improvement in the training data and the validation metrics start
degrading.
• This means that the model has started to over-fit.
• It has started to learn patterns that are specific to the training data.
• Such patterns are not relevant to the new data.
Overfitting and Underfitting
• The best solution to address the overfitting problem is to get more
training data.
• A model that is trained on more data generalizes better.
• When it is not possible to get more data, the next possible solution is
to control the quantity of information that the model can store.
• We can also control the information that it can store.
• If the model is constrained to memorize a small number of patterns, it
will be forced to focus on the most prominent patterns.
• Therefore, it will have a better chance of better generalization.
Overfitting and Underfitting
• The process of fighting overfitting in this way is known as
regularization.
• In the next section, we will review some of the most common
regularization techniques.
Addressing Overfitting
Reduce Network’s size
• The simplest way to address the overfitting is to reduce the size of the
model.
• This means that we reduce the number of learnable parameters, which are
decided by the number of layers and the number of units in each layer.
• A model with more parameters will have more memorization capacity.
• At the same time, we need to ensure that the model has enough
parameters so that the model does not have enough memorization
capacity and it underfits.
• We need to optimize between “too much capacity” and “not enough
capacity”.
Reduce Network’s size
• There is no magical formula that helps you decide the rights number
of layers and the right number of units in each layer.
• Therefore, to find the right model for your data, you need to evaluate
an array of different architectures.
• We can start with a small number of layers and parameters and then
add new layers until you see diminishing returns with respect to the
validation loss.
Reduce Network’s size
Suppose our original model is as follows.

import tensorflow as tf

model = tf.keras.models.Sequential([ tf.keras.layers.Dense(16,activation='relu', input_shape=(1000,)),


tf.keras.layers.Dense(16,activation='relu'),
tf.keras.layers.Dense(1,activation='sigmoid')
])
Reduce Network’s size
We can replace it with a smaller network.

model = tf.keras.models.Sequential([
tf.keras.layers.Dense(4, activation='relu', input_shape=(1000,)),
tf.keras.layers.Dense(4, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
Reduce Network’s size
• The comparison of the validation loss for the original network and a smaller
network is as follows.
• We can see that the smaller network starts overfitting later than the reference
one (after 6 epochs rather than 4).
• The performance of the smaller network degrades much more slowly after it
starts overfitting.
Reduce Network’s size
We can try a different network that has a bigger capacity.

model = tf.keras.models.Sequential([
tf.keras.layers.Dense(512,activation='relu',input_shape=(1000,)),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(512, activation='sigmoid')
])
Reduce Network’s size
• The comparison of the validation loss for a bigger network with the original
network is as follows
• The bigger network starts overfitting after the first epoch. It overfits much more
severely.
Reduce Network’s size
• We can also compare the training losses of both networks.
• The training loss of the bigger network approaches zero very quickly.
• The more capacity the network has, the quicker the model will be able to model
the training data (which results in low training loss), but it is more vulnerable to
overfitting (resulting in a large difference between training and validation loss).
Adding Weight Regularization
• Given some training data and network architecture, simpler models are less likely
to overfit than complex ones.
• We can force the weights in the network to take only small values.
• This makes the distribution of weights values more regular.
• This is called “weight regularization”, which is done by adding a cost that is
associated with having large weights.
Adding Weight Regularization
• The cost comes in two flavors:
• L1 regularization: The added cost is proportional to the absolute value of the weight’s
coefficients. It is also called “L1 Norm”
• L2 regularization: The added cost is proportional to the square of the value of the weight’s
coefficients. It is also called “L2 Norm”
Adding Weight Regularization
We can add the L2 weight regularization as follows

model = tf.keras.models.Sequential([
tf.keras.layers.Dense(16, kernel_regularizer=tf.keras.regularizers.l2(0.001), activation='relu', input_shape=(1000,)),
tf.keras.layers.Dense(16, kernel_regularizer=tf.keras.regularizers.l2(0.001), activation='relu’),
tf.keras.layers.Dense(1, activation='sigmoid')
])
Adding Weight Regularization
• The impact of L2 regularization is as follows.
• We can see that even though both the models have the same number
of parameters, the model with L2 regularization has become more
resistant to overfitting than the reference model.
Adding Weight Regularization
As an alternative to L2 regularization, you can use one of the following
Keras weight regularizations.

#L1 regularization
tf.keras.regularizers.l1(0.001)

#L1 and L2 regularization at the same time


tf.keras.regularizers.l1_l2(l1=0.001, l2=0.001)
Dropout
• Dropout is a popular regularization method.
• We apply dropout in a deep network in the form of a dropout layer.
• The dropout layer is also called an accessory layer or a supplemental
layer because it does not do any computation on its own.
• We call it a layer because it allows us to include dropout in the
drawing of the network.
• However, it is not considered a real layer (hidden or otherwise).
• This layer is not counted when we describe the number of layers in a
particular network.
Dropout
• The dropout layer temporarily disconnects some of the neurons on
the previous layer.
• For example, a given layer would have returned a vector [0.2, 0.5, 1.1,
0.7, 1.3] for a given input sample during training.
• After we apply the dropout, the vector will have a few zero entities
distributed at random, e.g. [0, 0.5, 1.1, 0, 1.3].
• We provide a “dropout rate”, which is the fraction of the features that
are being zeroed out; it is usually set between 0.2 and 0.5. At the test
time, no unit would drop out.
Dropout
• The dropped-out units do not
participate in any forward calculations.
• They are also not included in backprop.
• The optimizer does not update their
weights.
• After the batch is completed and the
rest of the weights have been updated,
the dropped-out neuron connections are
restored.
• At the start of the next batch, the layer
again chooses a new random set of
neurons and temporarily removes them.
Dropout
In Keras, you can introduce dropout in a network via the Dropout layer, which gets applied to the
output of the layer right before it.

tf.keras.layers.Dropout(rate=0.2)

Let’s add two Dropout layers in our model and see how well they can reduce overfitting

model = tf.keras.models.Sequential([
tf.keras.layers.Dense(16,activation='relu', input_shape=(1000,)),
tf.keras.layers.Dropout(rate=0.2),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dropout(rate=0.2),
tf.keras.layers.Dense(1, activation='sigmoid')
])
Dropout
• Let’s plot the results
• We could see a clear improvement over the reference network
Summary
• Therefore, the most common ways to prevent overfitting in neural
network
• Getting more training data.
• Reducing the capacity of the network.
• Adding weight regularization.
• Adding dropout.
Universal Workflow – Deep Learning
1. Define the problem at hand and assemble a dataset
a. What is the input data? What do you want to predict?
b. What type of problem you are facing – binary classification, multi-class
classification, regression, or something else
2. Pick a measure of success
a. How do you define success – accuracy? Precision-Recall? Customer retention
rate?
b. The metric of success will help you decide the loss function
Universal Workflow – Deep Learning
3. Prepare your data - The data should be formatted before it can be fed
into the machine learning model
a. The data should be formatted as tensors
b. The value taken by the tensors should be a small value. For example, in the [-1, 1]
range or [0, 1] range
c. For heterogeneous data, the data should be normalized
d. If the dataset is small, you may consider feature engineering
4. Develop the model – three key choices that you need to make
a. Choice of the last-layer activation
b. Choice of the loss function. This should match the problem that you are trying to
solve
c. Choice of an optimizer. What would be the optimizer? What would be the learning
rate? Can we go with the default optimizer for Keras, rmsprop, and its default learning
rate?
Universal Workflow – Deep Learning
You can also pick a last layer activation and a loss function from the
following table
Problem Type Last-layer activation Loss function

Binary Classification Sigmoid binary_crossentropy

Muti-class, Single label classification Softmax categorical_crossentropy

Multi-class, multi-label classification Sigmoid binary_crossentropy

Regression to arbitrary value None mse


Regression to value between 0 and 1 sigmoid binary_crossentropy or mse
Universal Workflow – Deep Learning
5. Scale-up: develop a model that overfits – To figure out how big a model
you will need, you must develop a model that overfits using the following
methods. You should always monitor training loss and validation loss
a. Add layers
b. Make your layers bigger
c. Train for more epochs
6. Regularize Model and Tune Hyperparameters
a. Add dropout
b. Try different architectures by adding or removing layers
c. Add L1/L2 regularization
d. Try different hyperparameters (number of units per layer, the learning rate of the
optimizer, etc.)
e. Optionally feature engineering
Thanks
Samatrix Consulting Pvt Ltd

You might also like