DL Unit-2
DL Unit-2
SVEC TIRUPATI
COURSE MATERIAL
UNIT 2
COURSE B.TECH
SEMESTER 4-1
Version V-1
1|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
TABLE OF CONTENTS – UNIT 1
S. NO CONTENTS PAGE NO.
1 COURSE OBJECTIVES 1
2 PREREQUISITES 1
3 SYLLABUS 1
4 COURSE OUTCOMES 1
5 CO - PO/PSO MAPPING 1
6 LESSON PLAN 2
7 ACTIVITY BASED LEARNING 2
8 LECTURE NOTES 6
2.1 INTRODUCTION TO MACHINE LEARNING 6
2.2 BASICS AND UNDER FITTING 6
2.3 HYPER PARAMETERS AND VALIDATION SETS 7
2.4 ESTIMATORS 10
2.5 BIAS AND VARIANCE 12
2.6 MAXIMUM LIKELIHOOD 13
2.7 BAYESIAN STATISTICS 14
2.8 SUPERVISED AND UNSUPERVISED LEARNING 15
2.9 STOCHASTIC GRADIENT DESCENT 17
2.10 CHALLENGES MOTIVATING DEEP LEARNING 18
2.11 DEEP FEED FORWARD 19
NETWORKS:LEARNING XOR
2.12 GRADIENT BASED LEARNING 20
2.13 HIDDEN UNITS 20
2.14 ARCHITECTURE DESIGN 22
2.15 BACK-PROPOGATION AND OTHER 24
DIFFERENTIATION ALGORITHMS
9 PRACTICE QUIZ 32
10 ASSIGNMENTS 34
11 PART A QUESTIONS & ANSWERS (2 MARKS QUESTIONS) 35
12 PART B QUESTIONS 35
2|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC
13 TIRUPATI
SUPPORTIVE ONLINE CERTIFICATION COURSES 35
14 REAL TIME APPLICATIONS 35
15 CONTENTS BEYOND THE SYLLABUS 37
16 PRESCRIBED TEXT BOOKS & REFERENCE BOOKS 37
17 MINI PROJECT SUGGESTION 37
3|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
1. Course Objectives
The objectives of this course is to
1. To demonstrate the major technology trends driving Deep Learning.
2. To build, train and apply fully connected neural networks.
3. To implement efficient neural networks.
4. To analyze the key parameters and hyper parameters in neural network’s
architecture.
5. To apply concepts of Deep Learning to solve real word problems.
2. Prerequisites
This course is intended for senior undergraduate and junior graduate students
who have a proper understanding of
Python Programming Language
Calculus
Linear Algebra
Probability Theory
Although it would be helpful, knowledge about classical machine learning is NOT
required.
3. Syllabus
UNIT I
Machine Learning: Basics and Under fitting, Hyper parameters and Validation Sets,
Estimators, Bias and Variance, Maximum Likelihood, Bayesian Statistics, Supervised and
Unsupervised Learning,Stochastic Gradient Descent, Challenges Motivating Deep Learning.
4. Course outcomes
1. Demonstrate the mathematical foundation of neural network.
2. Describe the machine learning basics.
3. Differentiate architecture of deep neural network.
4. Build the convolution neural network.
5. Build and Train RNN and LSTMs.
4|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Ma
P P P P
chi P P P P P P P P P P
O O S S
ne O O O O O O O O O 1
1 1 O O
Too 1 2 3 4 5 6 7 8 9 0
1 2 1 2
ls
CO1 3 2
CO2 3 2
CO3 3 3 2 2 3 2 2
CO4 3 3 2 2 3 2 2
CO5
6. Lesson Plan
Referen
Lecture No. Weeks Topics to be covered
ces
Introduction to Machine Learning: Basics and
1 T1
Under fitting, Hyper parameters and Validation Sets
Estimators, Bias and Variance, Maximum Likelihood,
2 T1, R1
1 Bayesian Statistics
Supervised and Unsupervised Learning, Stochastic
3 T1, R1
Gradient Descent
4 Challenges Motivating Deep Learning T1, R1
5|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
2. You will work on case studies from healthcare, autonomous driving, sign language
reading, music generation, and natural language processing. You will master not only
the theory, but also see how it is applied in industry.
8. Lecture Notes
INTRODUCTION TO MACHINE LEARNING
Introduction: Machine learning is essentially a form of applied statistics
with increased emphasis on the use of computers to statistically estimate
complicated functions and a decreased emphasis on proving confidence intervals
around these functions; we therefore present the two central approaches to
statistics: frequentist estimators and Bayesian inference. Most machine learning
algorithms can be divided into the categories of supervised learning and
unsupervised learning; we describe these categories and give some examples of
simple learning algorithms from each category. Most deep learning algorithms are
based on an optimization algorithm called stochastic gradient descent.
The central challenge in machine learning is that we must perform well on new,
previously unseen inputs—not just those on which our model was trained. The ability
to perform well on previously unobserved inputs is called generalization.
Typically, when training a machine learning model, we have access to a training
set, we can compute some error measure on the training set called the training
error, and we reduce this training error. So far, what we have described is simply an
optimization problem. What separates machine learning from optimization is that
we want the generalization error, also called the test error, to be low as well.
The factors determining how well a machine learning algorithm will perform
are its ability to:
Most machine learning algorithms have several settings that we can use to
control the behavior of the learning algorithm. These settings are called
6|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Validation Set
• To solve the problem we use a validation set – Examples that training algorithm
does not observe • Test examples should not be used to make choices about
the model hyperparameters
• Training data is split into two disjoint parts – First to learn the parameters
– Other is the validation set to estimate generalization error during or after
training • allowing for the hyperparameters to be updated – Typically 80% of
training data for training and 20% for validation
Cross-Validation
• When data set is too small, dividing into a fixed training set and fixed testing set
is problematic – If it results in a small test set
• Small test set implies statistical uncertainty around the estimated average test
error
• Cannot claim algorithm A works better than algorithm B for a given task
• k-fold cross-validation
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Estimators
• A single parameter
8|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Bias
bias(θˆm) =E(θˆm) - θ.
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Having discussed the definition of an estimator, let us now discuss some commonly used
estimator
Consider a set of m examples X = {x(1), . . . , x(m)} drawn independently from the true
but unknown data generating distribution Pdata(x). Let Pmodel(x; θ) be a parametric
family of probability distributions over the same space indexed by θ. In other words,
Pmodel(x; θ) maps any configuration xto a real number estimating the true probability
Pdata(x).
The maximum likelihood estimator for θ is then defined as:
Since we assumed the examples to be i.i.d, the above equation can be written in the
product form as:
This product over many probabilities can be inconvenient for a variety of reasons. For
example, it is prone to numerical underflow. Also, to find the maxima/minima of this
function, we can take the derivative of this function
w.r.t θand equate it to 0. Since we have terms in product here, we need to apply the
chain rule which is quite cumbersome with products. To obtain a more convenient but
equivalent optimization problem, we observe that taking the logarithm of the likelihood
does not change its arg max but does conveniently transform a product into a sum and
since log is a strictly increasing function ( natural log function is a monotone
transformation), it would not impact the resulting value of θ.
So we have:
10|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
While not applicable to every deep learning technique, this statistical approach
affects three key fields of machine learning:
Statistical Inference
- Bayesian inference uses Bayesian probability to summarize evidence for the likelihood
of a prediction.
Statistical Modeling
- Bayesian statistics helps some models by classifying and specifying the prior
distributions of any unknown parameters.
Experiment Design
While most machine learning models try to predict outcomes from large
datasets, the Bayesian approach is helpful for several classes of problems that
aren’t easily solved with other probability models. In particular:
11|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Bayesian Frequentist
S.NO
inference inference
It doesn’t use or
It uses
1 render probabilities of a
probabilities for both
hypothesis, ie. no prior or
hypotheses and data.
posterior.
It demands an individual to
3 It never seeks a prior.
learn or make a subjective
prior.
SUPERVISED LEARNING
12|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
UNSUPERVISED LEARNING
Unsupervised learning uses machine learning algorithms to analyze and
cluster unlabeled data sets. These algorithms discover hidden patterns in data
without the need for human intervention (hence, they are “unsupervised”).
Unsupervised learning models are used for three main tasks : clustering,
association and dimensionality reduction:
• Clustering is a data mining technique for grouping unlabeled data based on their
similarities or differences. For example, K-means clustering algorithms assign
similar data points into groups, where the K value represents the size of the
grouping and granularity. This technique is helpful for market segmentation,
image compression, etc.
• Association is another type of unsupervised learning method that uses different
rules to find relationships between variables in a given dataset. These methods are
frequently used for market basket analysis and recommendation engines, along
the lines of “Customers Who Bought This Item Also Bought” recommendations.
• Dimensionality reduction is a learning technique used when the number of features
(or dimensions) in a given dataset is too high. It reduces the number of data
inputs to a manageable size while also preserving the data integrity. Often, this
technique is used in the preprocessing data stage, such as when autoencoders
remove noise from visual data to improve picture quality.
The main difference between supervised and unsupervised
learning: Labeled data
The main distinction between the two approaches is the use of labeled
datasets. To put it simply, supervised learning uses labeled input and output data,
while an unsupervised learning algorithm does not.
13|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
learning models, they require upfront human intervention to label the data
appropriately. For example, a supervised learning model can predict how long
your commute will be based on the time of day, weather conditions and so on.
But first, you’ll have to train it to know that rainy weather extends the driving
time.
• Goals: In supervised learning, the goal is to predict outcomes for new data. You
know up front the type of results to expect. With an unsupervised learning
algorithm, the goal is to get insights from large volumes of new data. The machine
learning itself determines what is different or interesting from the dataset.
• Applications: Supervised learning models are ideal for spam detection,
sentiment analysis, weather forecasting and pricing predictions, among other
things. In contrast, unsupervised learning is a great fit for anomaly detection,
recommendation engines, customer personas and medical imaging.
• Complexity: Supervised learning is a simple method for machine learning,
typically calculated through the use of programs like R or Python. In
unsupervised learning, you need powerful tools for working with large amounts of
unclassified data. Unsupervised learning models are computationally complex
because they need a large training set to produce intended outcomes.
• Drawbacks: Supervised learning models can be time-consuming to train, and
the labels for input and output variables require expertise. Meanwhile,
unsupervised learning methods can have wildly inaccurate results unless you have
human intervention to validate the output variables
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
then the algorithm will have to go through many iterations to converge, which
will take a long time, and if it is too high we may jump the optimal value.
Types of Gradient Descent:
• Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
SGD algorithm:
* So, in SGD, we find out the gradient of the cost function of a single example at
each iteration instead of the sum of the gradient of the cost function of all the
examples.
* In SGD, since only one sample from the dataset is chosen at random for each
iteration, the path taken by the algorithm to reach the minima is usually noisier
than your typical Gradient Descent algorithm. But that doesn’t matter all that much
because the path taken by the algorithm does not matter, as long as we reach
the minima and with a significantly shorter training time.
The path is taken by Batch Gradient Descent as shown below as follows:
15|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
One thing to be noted is that, as SGD is generally noisier than typical Gradient
Descent, it usually took a higher number of iterations to reach the minima,
because of its randomness in its descent. Even though it requires a higher number
of iterations to reach the minima than typical Gradient Descent, it is still
computationally much less expensive than typical Gradient Descent. Hence, in
most scenarios, SGD is preferred over Batch Gradient Descent for optimizing a
learning algorithm.
• Shortcomings of conventional ML
1. The curse of dimensionality
2. Local constancy and smoothness regularization
3. Manifold learning Curse of d i mensionality
• No of possible distinct configurations of a set of variables increases
exponentially with no of variables
– Poses a statistical challenge
• Ex: 10 regions of interest with one variable
– We need to track 100 regions with two variables
– 1000 regions with three variables
16|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Manifold Learning
• An important idea underlying many ideas in machine learning
• A manifold is a connected region
– Mathematically it is a set of points in a neighborhood
– It appears to be in a Euclidean space
• E.g., we experience the world as a 2-D plane while it is a spherical manifold in 3-D
space
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
XOR is a classification problem and one for which the expected outputs are
known in advance. It is therefore appropriate to use a supervised learning
approach.
Perceptrons
Like all ANNs, the perceptron is composed of a network of *units*, which are
analagous to biological neurons. A unit can receive an input from other units.
On doing so, it takes the sum of all values received and decides whether it is
going to forward a signal on to other units to which it is connected. This is
called activation. The activation function uses some means or other to reduce
the sum of input values to a 1 or a 0 (or a value very close to a 1 or 0) in
order to represent activation or lack thereof. Another form of unit, known as
a bias unit, always activates, typically sending a hard coded 1 to all units to
which it is connected.
Perceptrons include a single layer of input units — including one bias unit —
and a single output unit (see figure 2). Here a bias unit is depicted by a
dashed circle, while other units are shown as blue circles. There are two non-
bias input units representing the two binary input values for XOR. Any number
of input units can be included.
18|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
It is the setting of the weight variables that gives the network’s author control over
the process of converting input values to an output value. It is the weights that
determine where the classification line, the line that separates data points into
classification groups, is drawn. If all data points on one side of a classification line
are assigned the class of 0, all others are classified as 1.
A limitation of this architecture is that it is only capable of separating data points with
a single line. This is unfortunate because the XOR inputs are not linearly separable.
This is particularly visible if you plot the XOR input values to a graph. As shown in
figure 3, there is no way to separate the 1 and 0 predictions with a single
classification line.
19|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Multilayer Perceptrons
It is worth noting that an MLP can have any number of units in its input,
hidden and output layers. There can also be any number of hidden layers. The
20|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
This architecture, while more complex than that of the classic perceptron
network, is capable of achieving non-linear separation. Thus, with the right set
of weight values, it can provide the necessary separation to accurately classify
the XOR inputs.
21|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Batch Gradient Descent: This is a type of gradient descent which processes all
the training examples for each iteration of gradient descent. But if the
number of training examples is large, then batch gradient descent is
computationally very expensive. Hence if the number of training examples is
large, then batch gradient descent is not preferred. Instead, we prefer to use
stochastic gradient descent or mini-batch gradient descent.
Mini Batch gradient descent: This is a type of gradient descent which works
faster than both batch gradient descent and stochastic gradient descent.
Here b examples where b<m are processed per iteration. So even if the
number of training examples is large, it is processed in batches of b training
examples in one go. Thus, it works for larger training examples and that too
with lesser number of iterations.
Variables used:
Let m be the number of training examples.
Let n be the number of features.
22|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Repeat {
For every j =0 …n
Where xj(i) Represents the jth feature of the ith training example. So if m is
very large(e.g. 5 million training samples), then it takes hours or even days to
converge to the global minimum.That’s why for large datasets, it is not
recommended to use batch gradient descent as it slows down the learning.
Hence,
Repeat {
For i=1 to m{
HIDDEN UNITS
The design of hidden units is an extremely active area of research and does
not yet have many definitive guiding theoretical principles. Rectified linear
units are an excellent default choice of hidden unit.
23|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Some hidden units are not differentiable at all input points. For example, the
rectified linear function g(z)=max{0, z}g(z)=max{0, z} is not differentiable at
z
= 0. This may seem like it invalidates g for use with a gradientbased learning
algorithm. In practice, gradient descent still performs well enough for these
models to be used for machine learning tasks.
z}. Rectified linear units are easy to optimize due to similarity with linear units.
Only difference with linear units that they output 0 across half its domain
Thus gradient direction is far more useful than with activation functions with
second-order effects
Good practice to set all elements of b to a small value such as 0.1. This
makes it likely that ReLU will be initially active for most training samples and
allow derivatives to pass through
Sigmoid and tanh activation functions cannot be with many layers due to the
vanishing gradient problem.
24|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
One drawback to rectified linear units is that they cannot learn via
gradientbased methods on examples for which their activation is zero.
Most neural networks used the logistic sigmoid activation function prior to
rectified linear units.
g(z)=σ(z)g(z)=σ(z)
g(z)=tanh(z)g(z)=tanh(z)
σ(2z)−1tanh(z)=2 σ(2z)-1
We have already seen sigmoid units as output units, used to predict the
probability that a binary variable is 1.
25|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
ARCHITECTURE DESIGN
The word architecture refers to the overall structure of the network: how many
units it should have and how these units should be connected to each other.
Most neural networks are organized into groups of units called layers. Most
neural network architectures arrange these layers in a chain structure, with
each layer being a function of the layer that preceded it. In this structure, the
first layer is given by
h(1)=g(1)(W(1)Tx+b(1))h(1)=g(1)(W(1)Tx+b(1))
h(2)=g(2)(W(2)Th(1)+b(2))h(2)=g(2)(W(2)Th(1)+b(2))
26|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Optimizing algorithms may not be able to find the value of the parameters
that corresponds to the desired function.
The universal approximation theorem says that there exists a network large
enough to achieve any degree of accuracy we desire, but the theorem does
not say how large this network will be. provides some bounds on the size of a
single-layer network needed to approximate a broad class of functions.
Unfortunately, in the worse case, an exponential number of hidden units may
be required. This is easiest to see in the binary case: the number of possible
binary functions on vectors v∈{0,1}nv∈{0,1}n is 22n22n and selecting one
such function requires 2n2n bits, which will in general require O(2n)O(2n)
degrees of freedom.
27|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
1. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.
28|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
-It is a flexible method as it does not require prior knowledge about the
network
-It is a standard method that generally works well
-It does not need any special mention of the features of the function to be
learned.
Types of Backpropagation Networks
Two Types of Backpropagation Networks are:
• Static Back-propagation
• Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues
like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is
achieved. After that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is
rapid in static back-propagation while it is nonstatic in recurrent
backpropagation.
History of Backpropagation
• Automatic Differentiation
• Deep learning community has been outside the CS community dealing
with automatic differentiation
29|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
Computational Complexity
Practice Quiz
A)Procedure-oriented
B) Object-oriented
30|D L - U N I T - 2
BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
C) Logic-oriented
D) Rule-oriented
A) Training Data
B) Transfer Data
C) Data Training
A)Deep Learning
B)Artificial Intelligence
C)Data Learining
31|D L - U N I T - 2
BTECH_CSE-SEM 4 1
A) Deep Learning
B) Machine Learning
C)Artificial Intelligence
A) Supervised Learning
B)Unsupervised Learning
C)Reinforcement Learning
A) PCA
B) Naive Bayesian
C) Linear Regression
32|D L - U N I T - I