0% found this document useful (0 votes)
94 views32 pages

DL Unit-2

This document provides information on the course material for the Deep Learning unit 2 course. It includes the course objectives, prerequisites, syllabus, outcomes, lesson plan, and lecture notes. The objectives are to demonstrate major trends in deep learning, build and train neural networks, analyze network architecture parameters, and apply deep learning concepts. The syllabus covers machine learning basics, deep feedforward networks, gradient descent, and backpropagation algorithms. The lesson plan has 8 lectures over 2 weeks addressing key topics. Lecture notes provide details on introductions, estimators, regularization, and network architectures.

Uploaded by

SYEDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views32 pages

DL Unit-2

This document provides information on the course material for the Deep Learning unit 2 course. It includes the course objectives, prerequisites, syllabus, outcomes, lesson plan, and lecture notes. The objectives are to demonstrate major trends in deep learning, build and train neural networks, analyze network architecture parameters, and apply deep learning concepts. The syllabus covers machine learning basics, deep feedforward networks, gradient descent, and backpropagation algorithms. The lesson plan has 8 lectures over 2 weeks addressing key topics. Lecture notes provide details on introductions, estimators, regularization, and network architectures.

Uploaded by

SYEDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

S

SVEC TIRUPATI

COURSE MATERIAL

SUBJECT DEEP LEARNONG


(20A05703C)

UNIT 2

COURSE B.TECH

COMPUTER SCIENCE & ENGINEERING


DEPARTMENT
(20A05703c)

SEMESTER 4-1

Mrs. G T PRASANNA KUMARI


PREPARED BY
(Faculty Name/s) Mrs. N. DIVYA

Version V-1

PREPARED / REVISED DATE 20-08-2023

1|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI
TABLE OF CONTENTS – UNIT 1
S. NO CONTENTS PAGE NO.
1 COURSE OBJECTIVES 1
2 PREREQUISITES 1
3 SYLLABUS 1
4 COURSE OUTCOMES 1
5 CO - PO/PSO MAPPING 1
6 LESSON PLAN 2
7 ACTIVITY BASED LEARNING 2
8 LECTURE NOTES 6
2.1 INTRODUCTION TO MACHINE LEARNING 6
2.2 BASICS AND UNDER FITTING 6
2.3 HYPER PARAMETERS AND VALIDATION SETS 7
2.4 ESTIMATORS 10
2.5 BIAS AND VARIANCE 12
2.6 MAXIMUM LIKELIHOOD 13
2.7 BAYESIAN STATISTICS 14
2.8 SUPERVISED AND UNSUPERVISED LEARNING 15
2.9 STOCHASTIC GRADIENT DESCENT 17
2.10 CHALLENGES MOTIVATING DEEP LEARNING 18
2.11 DEEP FEED FORWARD 19
NETWORKS:LEARNING XOR
2.12 GRADIENT BASED LEARNING 20
2.13 HIDDEN UNITS 20
2.14 ARCHITECTURE DESIGN 22
2.15 BACK-PROPOGATION AND OTHER 24
DIFFERENTIATION ALGORITHMS
9 PRACTICE QUIZ 32
10 ASSIGNMENTS 34
11 PART A QUESTIONS & ANSWERS (2 MARKS QUESTIONS) 35
12 PART B QUESTIONS 35

2|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC
13 TIRUPATI
SUPPORTIVE ONLINE CERTIFICATION COURSES 35
14 REAL TIME APPLICATIONS 35
15 CONTENTS BEYOND THE SYLLABUS 37
16 PRESCRIBED TEXT BOOKS & REFERENCE BOOKS 37
17 MINI PROJECT SUGGESTION 37

3|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

1. Course Objectives
The objectives of this course is to
1. To demonstrate the major technology trends driving Deep Learning.
2. To build, train and apply fully connected neural networks.
3. To implement efficient neural networks.
4. To analyze the key parameters and hyper parameters in neural network’s
architecture.
5. To apply concepts of Deep Learning to solve real word problems.

2. Prerequisites
This course is intended for senior undergraduate and junior graduate students
who have a proper understanding of
 Python Programming Language
 Calculus
 Linear Algebra
 Probability Theory
Although it would be helpful, knowledge about classical machine learning is NOT
required.
3. Syllabus
UNIT I

Machine Learning: Basics and Under fitting, Hyper parameters and Validation Sets,
Estimators, Bias and Variance, Maximum Likelihood, Bayesian Statistics, Supervised and
Unsupervised Learning,Stochastic Gradient Descent, Challenges Motivating Deep Learning.

Deep Feed forward Networks: Learning XOR, Gradient-Based Learning, Hidden


Units, Architecture Design, Back-Propagation andother Differentiation Algorithms.

4. Course outcomes
1. Demonstrate the mathematical foundation of neural network.
2. Describe the machine learning basics.
3. Differentiate architecture of deep neural network.
4. Build the convolution neural network.
5. Build and Train RNN and LSTMs.

4|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

5.Co-PO / PSO Mapping

Ma
P P P P
chi P P P P P P P P P P
O O S S
ne O O O O O O O O O 1
1 1 O O
Too 1 2 3 4 5 6 7 8 9 0
1 2 1 2
ls
CO1 3 2

CO2 3 2

CO3 3 3 2 2 3 2 2

CO4 3 3 2 2 3 2 2

CO5

6. Lesson Plan
Referen
Lecture No. Weeks Topics to be covered
ces
Introduction to Machine Learning: Basics and
1 T1
Under fitting, Hyper parameters and Validation Sets
Estimators, Bias and Variance, Maximum Likelihood,
2 T1, R1
1 Bayesian Statistics
Supervised and Unsupervised Learning, Stochastic
3 T1, R1
Gradient Descent
4 Challenges Motivating Deep Learning T1, R1

Deep Feed forward Networks: Learning XOR,


5 T1, R1
Gradient-Based Learning
6 Hidden Units T1, R1
2
7 Architecture Design T1, R1

Back-Propagation and other Differentiation


8 T1, R1
Algorithms.

7. Activity Based Learning


1. DL course is associated with laboratory, different open-ended problem statements are
given for each student to carry out the experiments using google colab tool. The
foundations of Deep Learning, understand how to build neural networks, and learn
how to lead successful machine learning projects. You will learn about Convolutional
networks, RNNs, LSTM,etc.

5|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

2. You will work on case studies from healthcare, autonomous driving, sign language
reading, music generation, and natural language processing. You will master not only
the theory, but also see how it is applied in industry.

8. Lecture Notes
INTRODUCTION TO MACHINE LEARNING
Introduction: Machine learning is essentially a form of applied statistics
with increased emphasis on the use of computers to statistically estimate
complicated functions and a decreased emphasis on proving confidence intervals
around these functions; we therefore present the two central approaches to
statistics: frequentist estimators and Bayesian inference. Most machine learning
algorithms can be divided into the categories of supervised learning and
unsupervised learning; we describe these categories and give some examples of
simple learning algorithms from each category. Most deep learning algorithms are
based on an optimization algorithm called stochastic gradient descent.

2.1 BASICS AND UNDER FITTING

The central challenge in machine learning is that we must perform well on new,
previously unseen inputs—not just those on which our model was trained. The ability
to perform well on previously unobserved inputs is called generalization.
Typically, when training a machine learning model, we have access to a training
set, we can compute some error measure on the training set called the training
error, and we reduce this training error. So far, what we have described is simply an
optimization problem. What separates machine learning from optimization is that
we want the generalization error, also called the test error, to be low as well.

The factors determining how well a machine learning algorithm will perform
are its ability to:

1. Make the training error small.


2. Make the gap between training and test error small.
These two factors correspond to the two central challenges in machine
learning: underfitting and overfitting. Underfitting occurs when the model is
not able to obtain a suffIciently low error value on the training set. Overfitting
occurs when the gap between the training error and test error is too large. We
can control whether a model is more likely to overfit or underfit by altering its
capacity.

2.2 HYPER PARAMETERS AND VALIDATION SETS

Most machine learning algorithms have several settings that we can use to
control the behavior of the learning algorithm. These settings are called
6|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

hyperparameters. The values of hyperparameters are not adapted by the


learning algorithm itself (though we can design a nested learning procedure
where one learning algorithm learns the best hyperparameters for another
learning algorithm).

Reasons for hyperparameters

• Sometimes setting is chosen as a hyperparam because it is too difficult to


optimize

• More frequently, the setting is a hyperparam because it is not appropriate to


learn that hyperparam on the training set

Validation Set

• To solve the problem we use a validation set – Examples that training algorithm
does not observe • Test examples should not be used to make choices about
the model hyperparameters

• Training data is split into two disjoint parts – First to learn the parameters
– Other is the validation set to estimate generalization error during or after
training • allowing for the hyperparameters to be updated – Typically 80% of
training data for training and 20% for validation

Cross-Validation

• When data set is too small, dividing into a fixed training set and fixed testing set
is problematic – If it results in a small test set

• Small test set implies statistical uncertainty around the estimated average test
error

• Cannot claim algorithm A works better than algorithm B for a given task
• k-fold cross-validation

– Partition the data into k non-overlapping subsets

– On trial i, i th subset of data is used as the test set

– Rest of the data is used as the training set

k-fold Cross Validation

• Supply of data is limited

• All available data is partitioned into k groups (folds)


7|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

• k-1 groups are used to train and evaluated on remaining group

• Repeat for all k choices of heldout group

• Performance scores from k runs are averaged

Cross validation confidence

• Cross-validation algorithm returns vector of errors e for examples in D

– Whose mean is the estimated generalization error

– The errors can be used to compute a confidence interval around the


mean

• 95% confidence interval centered around mean is

(µˆm − 1.96SE(µˆm ), µˆm + 1.96SE(µˆm ))

Estimators

Estimation is a statistical term for finding some estimate of unknown


parameter, given some data. Point Estimation is the attempt to provide the
single best prediction of some quantity of interest.

Quantity of interest can be:

• A single parameter

• A vector of parameters — e.g., weights in linear regression

• A whole functionPoint Estimation

To distinguish estimates of parameters from their true value, a point estimate


of a parameter θis represented by θˆ. Let {x(1) , x(2) ,..x(m)} be m
independent and identically distributed data points.Then a point estimator is
any function of the data:

Point estimation can also refer to estimation of relationship between input


and target variables referred to as function estimation.

8|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

Function Estimation : Here we are trying to predict a variable y given an


input vector x. We assume that there is a function f(x) that describes the
approximate relationship between y and x. For example,we may assume that y
= f(x) + ε, where ε stands for the part of y that is not predictable from x. In
function estimation, we are interested in approximating f with a model or
estimate fˆ. Function estimation is really just the same as estimating a
parameter θ; the function estimator fˆis simply a point estimator in function
space. Ex: in polynomial regression we are either estimating a parameter w or
estimating a function mapping from x to y.

2.3 BIAS AND VARIANCE


Bias and variance measure two different sources of error in and
estimator. Bias measures the expected deviation from the true value
of the function or parameter. Variance on the other hand, provides a
measure of the deviation from the expected estimator value that any
particular sampling of the data is likely to cause.

Bias

The bias of an estimator is defined as:

bias(θˆm) =E(θˆm) - θ.

where the expectation is over the data (seen as


samples from a random variable)and θ is the true underlying value
of θ used to define the data generating distribution.

An estimator θˆm is said to be unbiased if bias(θˆm) =


0, which implies that E(θˆm) = θ.

Variance and Standard Error

The variance of an estimator Var(θˆ) where the random


variable is the training set. Alternately, the square root of the
variance is called the standard error, denoted standard error SE(ˆθ).
The variance or the standard error of an estimator provides a
measure of how we would expect the estimate we compute from
data to vary as we independently re-sample the dataset from the
underlying data generating process.

Just as we might like an estimator to exhibit low bias we would


also like it to have relatively low variance.
9|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

2.4 MAXIMUM LIKELIHOOD

Having discussed the definition of an estimator, let us now discuss some commonly used
estimator

Maximum Likelihood Estimation can be defined as a method for estimating parameters


(such as the mean or variance ) from sample data such that the probability (likelihood)
of obtaining the observed data is maximized.

Consider a set of m examples X = {x(1), . . . , x(m)} drawn independently from the true
but unknown data generating distribution Pdata(x). Let Pmodel(x; θ) be a parametric
family of probability distributions over the same space indexed by θ. In other words,
Pmodel(x; θ) maps any configuration xto a real number estimating the true probability
Pdata(x).
The maximum likelihood estimator for θ is then defined as:

Since we assumed the examples to be i.i.d, the above equation can be written in the
product form as:

This product over many probabilities can be inconvenient for a variety of reasons. For
example, it is prone to numerical underflow. Also, to find the maxima/minima of this
function, we can take the derivative of this function
w.r.t θand equate it to 0. Since we have terms in product here, we need to apply the
chain rule which is quite cumbersome with products. To obtain a more convenient but
equivalent optimization problem, we observe that taking the logarithm of the likelihood
does not change its arg max but does conveniently transform a product into a sum and
since log is a strictly increasing function ( natural log function is a monotone
transformation), it would not impact the resulting value of θ.

So we have:

10|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

2.5 BAYESIAN STATISTICS

Bayesian Statistics are a technique that assigns “degrees of belief,” or Bayesian


probabilities, to traditional statistical modeling. In this interpretation of statistics,
probability is calculated as the reasonable expectation of an event occurring based
upon currently known triggers. Or in other words, that probability is a dynamic
process that can change as new information is gathered, rather than a fixed value
based upon frequency or propensity.

While not applicable to every deep learning technique, this statistical approach
affects three key fields of machine learning:

Statistical Inference
- Bayesian inference uses Bayesian probability to summarize evidence for the likelihood
of a prediction.

Statistical Modeling

- Bayesian statistics helps some models by classifying and specifying the prior
distributions of any unknown parameters.

Experiment Design

– By including the concept of “prior belief influence,” this technique uses


sequential analysis to factor in the outcome of earlier experiments when
designing new ones. These “beliefs” are updated by prior and posterior
distribution.

While most machine learning models try to predict outcomes from large
datasets, the Bayesian approach is helpful for several classes of problems that
aren’t easily solved with other probability models. In particular:

- Databases with few data points for reference

- Models with strong prior intuitions from pre-existing observations

11|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

- Data with high levels of uncertainty, or when it’s necessary to


quantify the level of uncertainty across an entire model or compare
different models

- When a model generates a null hypothesis but it’s necessary to claim

something about the likelihood of the alternative hypothesis

Frequentist Statistics vs Bayesian Statistics

Bayesian Frequentist
S.NO
inference inference

It doesn’t use or
It uses
1 render probabilities of a
probabilities for both
hypothesis, ie. no prior or
hypotheses and data.
posterior.

It only counts on the likelihood


2 It relies on the prior and
for both observed and
likelihood of observed data.
unobserved data.

It demands an individual to
3 It never seeks a prior.
learn or make a subjective
prior.

It had dominated statistical It had dominated statistical


4
practice earlier than the practice at the time of the 20th
20th century century

2.6 SUPERVISED AND UNSUPERVISED LEARNING

SUPERVISED LEARNING

Supervised learning is a machine learning approach that’s defined by


its use of labeled datasets. These datasets are designed to train or
“supervise” algorithms into classifying data or predicting outcomes
accurately. Using labeled inputs and outputs, the model can measure its
accuracy and learn over time.

Supervised learning can be separated into two types of problems


when data mining: classification and regression:

12|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

• Classification problems use an algorithm to accurately assign test data


into specific categories, such as separating apples from oranges. Or, in
the real world, supervised learning algorithms can be used to classify
spam in a separate folder from your inbox. Linear classifiers, support
vector machines, decision trees and random forest are all common
types of classification algorithms.
• Regression is another type of supervised learning method that uses an algorithm
to understand the relationship between dependent and independent variables.
Regression models are helpful for predicting numerical values based on different
data points, such as sales revenue projections for a given business. Some popular
regression algorithms are linear regression, logistic regression and polynomial
regression.

UNSUPERVISED LEARNING
Unsupervised learning uses machine learning algorithms to analyze and
cluster unlabeled data sets. These algorithms discover hidden patterns in data
without the need for human intervention (hence, they are “unsupervised”).

Unsupervised learning models are used for three main tasks : clustering,
association and dimensionality reduction:

• Clustering is a data mining technique for grouping unlabeled data based on their
similarities or differences. For example, K-means clustering algorithms assign
similar data points into groups, where the K value represents the size of the
grouping and granularity. This technique is helpful for market segmentation,
image compression, etc.
• Association is another type of unsupervised learning method that uses different
rules to find relationships between variables in a given dataset. These methods are
frequently used for market basket analysis and recommendation engines, along
the lines of “Customers Who Bought This Item Also Bought” recommendations.
• Dimensionality reduction is a learning technique used when the number of features
(or dimensions) in a given dataset is too high. It reduces the number of data
inputs to a manageable size while also preserving the data integrity. Often, this
technique is used in the preprocessing data stage, such as when autoencoders
remove noise from visual data to improve picture quality.
The main difference between supervised and unsupervised
learning: Labeled data

The main distinction between the two approaches is the use of labeled
datasets. To put it simply, supervised learning uses labeled input and output data,
while an unsupervised learning algorithm does not.

In supervised learning, the algorithm “learns” from the training dataset by


iteratively making predictions on the data and adjusting for the correct answer.
While supervised learning models tend to be more accurate than unsupervised

13|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

learning models, they require upfront human intervention to label the data
appropriately. For example, a supervised learning model can predict how long
your commute will be based on the time of day, weather conditions and so on.
But first, you’ll have to train it to know that rainy weather extends the driving
time.

Unsupervised learning models, in contrast, work on their own to discover


the inherent structure of unlabeled data. Note that they still require some human
intervention for validating output variables. For example, an unsupervised
learning model can identify that online shoppers often purchase groups of
products at the same time. However,
a data analyst would need to validate that it makes sense for a recommendation
engine to group baby clothes with an order of diapers, applesauce and sippy cups.

Other key differences between supervised and unsupervised


learning

• Goals: In supervised learning, the goal is to predict outcomes for new data. You
know up front the type of results to expect. With an unsupervised learning
algorithm, the goal is to get insights from large volumes of new data. The machine
learning itself determines what is different or interesting from the dataset.
• Applications: Supervised learning models are ideal for spam detection,
sentiment analysis, weather forecasting and pricing predictions, among other
things. In contrast, unsupervised learning is a great fit for anomaly detection,
recommendation engines, customer personas and medical imaging.
• Complexity: Supervised learning is a simple method for machine learning,
typically calculated through the use of programs like R or Python. In
unsupervised learning, you need powerful tools for working with large amounts of
unclassified data. Unsupervised learning models are computationally complex
because they need a large training set to produce intended outcomes.
• Drawbacks: Supervised learning models can be time-consuming to train, and
the labels for input and output variables require expertise. Meanwhile,
unsupervised learning methods can have wildly inaccurate results unless you have
human intervention to validate the output variables

2.8 STOCHASTIC GRADIENT DESCENT

Gradient Descent in Brief

• Gradient Descent is a generic optimization algorithm capable of finding optimal


solutions to a wide range of problems.
• The general idea is to tweak parameters iteratively in order to minimize the cost
function.
• An important parameter of Gradient Descent (GD) is the size of the steps,
determined by the learning rate hyperparameters. If the learning rate is too small,
14|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

then the algorithm will have to go through many iterations to converge, which
will take a long time, and if it is too high we may jump the optimal value.
Types of Gradient Descent:
• Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent

The word ‘stochastic‘ means a system or process linked with a random


probability. Hence, in Stochastic Gradient Descent, a few samples are selected
randomly instead of the whole data set for each iteration. In Gradient Descent,
there is a term called “batch” which denotes the total number of samples from a
dataset that is used for calculating the gradient for each
iteration. In typical Gradient Descent optimization, like Batch Gradient Descent, the
batch is taken to be the whole dataset. Although using the whole dataset is
really useful for getting to the minima in a less noisy and less random manner,
the problem arises when our dataset gets big.
Suppose, you have a million samples in your dataset, so if you use a typical
Gradient Descent optimization technique, you will have to use all of the one
million samples for completing one iteration while performing the Gradient
Descent, and it has to be done for every iteration until the minima are reached.
Hence, it becomes computationally very expensive to perform.
This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a
single sample, i.e., a batch size of one, to perform each iteration. The sample is
randomly shuffled and selected for performing the iteration.

SGD algorithm:
* So, in SGD, we find out the gradient of the cost function of a single example at
each iteration instead of the sum of the gradient of the cost function of all the
examples.
* In SGD, since only one sample from the dataset is chosen at random for each
iteration, the path taken by the algorithm to reach the minima is usually noisier
than your typical Gradient Descent algorithm. But that doesn’t matter all that much
because the path taken by the algorithm does not matter, as long as we reach
the minima and with a significantly shorter training time.
The path is taken by Batch Gradient Descent as shown below as follows:

15|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

path has been taken by Stochastic Gradient Descent –

One thing to be noted is that, as SGD is generally noisier than typical Gradient
Descent, it usually took a higher number of iterations to reach the minima,
because of its randomness in its descent. Even though it requires a higher number
of iterations to reach the minima than typical Gradient Descent, it is still
computationally much less expensive than typical Gradient Descent. Hence, in
most scenarios, SGD is preferred over Batch Gradient Descent for optimizing a
learning algorithm.

CHALLENGES MOTIVATING DEEP LEARNING


Topics in “Motivations”

• Shortcomings of conventional ML
1. The curse of dimensionality
2. Local constancy and smoothness regularization
3. Manifold learning Curse of d i mensionality
• No of possible distinct configurations of a set of variables increases
exponentially with no of variables
– Poses a statistical challenge
• Ex: 10 regions of interest with one variable
– We need to track 100 regions with two variables
– 1000 regions with three variables

16|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

Local Constancy & Smoothness Regularization


• Prior beliefs
– To generalize well ML algorithms need prior beliefs
• Form of probability distributions over parameters
• Influencing the function itself, while parameters are influenced only
indirectly
• Algorithms biased towards preferring a class of functions
– These biases may not be expressed in terms of a probability distribution
• Most widely used prior is smoothness
– Also called local constancy prior
– States that the function we learn should not change very much within a
small region

Manifold Learning
• An important idea underlying many ideas in machine learning
• A manifold is a connected region
– Mathematically it is a set of points in a neighborhood
– It appears to be in a Euclidean space
• E.g., we experience the world as a 2-D plane while it is a spherical manifold in 3-D
space

DEEP FEED FORWARD NETWORKS: LEARNING XOR

A Feed Forward Neural Network is an artificial Neural Network in which the


nodes are connected circularly. A feed-forward neural network, in which
some routes are cycled, is the polar opposite of a Recurrent Neural Network.
The feed-forward model is the basic type of neural network because the
input is only processed in one direction. The data always flows in one
17|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

direction and never backwards/opposite.

The XOR Problem

The XOR, or “exclusive or”, problem is a classic problem in ANN research. It is


the problem of using a neural network to predict the outputs of XOR logic
gates given two binary inputs. An XOR function should return a true value if
the two inputs are not equal and a false value if they are equal. All possible
inputs and predicted outputs are shown in figure 1.

XOR is a classification problem and one for which the expected outputs are
known in advance. It is therefore appropriate to use a supervised learning
approach.

On the surface, XOR appears to be a very simple problem, however, Minksy


and Papert (1969) showed that this was a big problem for neural network
architectures of the 1960s, known as perceptrons.

Perceptrons

Like all ANNs, the perceptron is composed of a network of *units*, which are
analagous to biological neurons. A unit can receive an input from other units.
On doing so, it takes the sum of all values received and decides whether it is
going to forward a signal on to other units to which it is connected. This is
called activation. The activation function uses some means or other to reduce
the sum of input values to a 1 or a 0 (or a value very close to a 1 or 0) in
order to represent activation or lack thereof. Another form of unit, known as
a bias unit, always activates, typically sending a hard coded 1 to all units to
which it is connected.

Perceptrons include a single layer of input units — including one bias unit —
and a single output unit (see figure 2). Here a bias unit is depicted by a
dashed circle, while other units are shown as blue circles. There are two non-
bias input units representing the two binary input values for XOR. Any number
of input units can be included.

18|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

The perceptron is a type of feed-forward network, which means the process of


generating an output — known as forward propagation — flows in one direction
from the input layer to the output layer. There are no connections between units in
the input layer. Instead, all units in the input layer are connected directly to the
output unit. A simplified explanation of the forward propagation process is that the
input values X1 and X2, along with the bias value of 1, are multiplied by their
respective weights W0..W2, and parsed to the output unit. The output unit takes
the sum of those values and employs an activation function — typically the Heavside
step function — to convert the resulting value to a 0 or 1, thus classifying the input
values as 0 or 1.

It is the setting of the weight variables that gives the network’s author control over
the process of converting input values to an output value. It is the weights that
determine where the classification line, the line that separates data points into
classification groups, is drawn. If all data points on one side of a classification line
are assigned the class of 0, all others are classified as 1.

A limitation of this architecture is that it is only capable of separating data points with
a single line. This is unfortunate because the XOR inputs are not linearly separable.
This is particularly visible if you plot the XOR input values to a graph. As shown in
figure 3, there is no way to separate the 1 and 0 predictions with a single
classification line.

19|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

Multilayer Perceptrons

The solution to this problem is to expand beyond the single-layer architecture


by adding an additional layer of units without any direct access to the outside
world, known as a hidden layer. This kind of architecture — shown in Figure 4
— is another feed-forward network known as a multilayer perceptron (MLP).

It is worth noting that an MLP can have any number of units in its input,
hidden and output layers. There can also be any number of hidden layers. The

20|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

architecture used here is designed specifically for the XOR problem.


Similar to the classic perceptron, forward propagation begins with the input
values and bias unit from the input layer being multiplied by their respective
weights, however, in this case there is a weight for each combination of input
(including the input layer’s bias unit) and hidden unit (excluding the hidden
layer’s bias unit). The products of the input layer values and their respective
weights are parsed as input to the non-bias units in the hidden layer. Each non
-bias hidden unit invokes an activation function — usually the classic sigmoid
function in the case of the XOR problem — to squash the sum of their input
values down to a value that falls between 0 and 1 (usually a value very close
to either 0 or 1). The outputs of each hidden layer unit, including the bias unit,
are then multiplied by another set of respective weights and parsed to an
output unit. The output unit also parses the sum of its input values through an
activation function — again, the sigmoid function is appropriate here — to
return an output value falling between 0 and 1. This is the predicted output.

This architecture, while more complex than that of the classic perceptron
network, is capable of achieving non-linear separation. Thus, with the right set
of weight values, it can provide the necessary separation to accurately classify
the XOR inputs.

21|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

GRADIENT BASED LEARNING


Gradient Descent is an optimization algorithm used for minimizing the cost
function in various machine learning algorithms. It is basically used for
updating the parameters of the learning model.

Types of gradient Descent:

Batch Gradient Descent: This is a type of gradient descent which processes all
the training examples for each iteration of gradient descent. But if the
number of training examples is large, then batch gradient descent is
computationally very expensive. Hence if the number of training examples is
large, then batch gradient descent is not preferred. Instead, we prefer to use
stochastic gradient descent or mini-batch gradient descent.

Stochastic Gradient Descent: This is a type of gradient descent which


processes 1 training example per iteration. Hence, the parameters are being
updated even after one iteration in which only a single example has been
processed. Hence this is quite faster than batch gradient descent. But again,
when the number of training examples is large, even then it processes only
one example which can be additional overhead for the system as the number of
iterations will be quite large.

Mini Batch gradient descent: This is a type of gradient descent which works
faster than both batch gradient descent and stochastic gradient descent.
Here b examples where b<m are processed per iteration. So even if the
number of training examples is large, it is processed in batches of b training
examples in one go. Thus, it works for larger training examples and that too
with lesser number of iterations.

Variables used:
Let m be the number of training examples.
Let n be the number of features.

Note: if b == m, then mini batch gradient descent will behave similarly to


batch gradient descent.

Algorithm for batch gradient descent :


Let hθ(x) be the hypothesis for linear regression. Then, the cost function is
given by:
Let Σ represents the sum of all training examples from i=1 to m.

Jtrain(θ) = (1/2m) Σ( hθ(x(i)) - y(i))2

22|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

Repeat {

θj = θj – (learning rate/m) * Σ( hθ(x(i)) - y(i))xj(i)

For every j =0 …n

Where xj(i) Represents the jth feature of the ith training example. So if m is
very large(e.g. 5 million training samples), then it takes hours or even days to
converge to the global minimum.That’s why for large datasets, it is not
recommended to use batch gradient descent as it slows down the learning.

Algorithm for stochastic gradient descent:


1) Randomly shuffle the data set so that the parameters can be trained evenly
for each type of data.
2) As mentioned above, it takes into consideration one example per iteration.

Hence,

Let (x(i),y(i)) be the training example

Cost(θ, (x(i),y(i))) = (1/2) Σ( hθ(x(i)) -

y(i)) Jtrain(θ) = (1/m) Σ Cost(θ, (x(i),y(i)))

Repeat {

For i=1 to m{

θj = θj – (learning rate) * Σ( hθ(x(i)) -

y(i))xj(i) For every j =0 …n

HIDDEN UNITS

The design of hidden units is an extremely active area of research and does
not yet have many definitive guiding theoretical principles. Rectified linear
units are an excellent default choice of hidden unit.

23|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

We discuss motivations behind choice of hidden unit. It is usually impossible to


predict in advance which will work best. The design process consists of trial
and error, intuiting that a kind of hidden unit may work well, and evaluating
its performance on a validation set

Some hidden units are not differentiable at all input points. For example, the
rectified linear function g(z)=max{0, z}g(z)=max{0, z} is not differentiable at
z
= 0. This may seem like it invalidates g for use with a gradientbased learning
algorithm. In practice, gradient descent still performs well enough for these
models to be used for machine learning tasks.

Most hidden units can be described as accepting a vector of inputs x,


computing an affine transformation z=wTh+bz=wTh+b, and then applying
an element-wise nonlinear function g(z)g(z). Most hidden units are
distinguished from each other only by the choice of the form of the activation
function g(z)g(z)

Rectified Linear Units and Their Generalizations

Rectified linear units use the activation function g(z)=max{0, z}g(z)=max{0,

z}. Rectified linear units are easy to optimize due to similarity with linear units.

Only difference with linear units that they output 0 across half its domain

Derivative is 1 everywhere that the unit is active

Thus gradient direction is far more useful than with activation functions with
second-order effects

Rectified linear units are typically used on top of an affine


transformation: h=g(WTx+b)h=g(WTx+b).

Good practice to set all elements of b to a small value such as 0.1. This
makes it likely that ReLU will be initially active for most training samples and
allow derivatives to pass through

ReLU vs other activations:

Sigmoid and tanh activation functions cannot be with many layers due to the
vanishing gradient problem.

ReLU overcomes the vanishing gradient problem, allowing models to learn


faster and perform better.

24|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

ReLU is the default activation function with MLP and CNN

One drawback to rectified linear units is that they cannot learn via
gradientbased methods on examples for which their activation is zero.

Three generalizations of rectified linear units are based on using a non-zero


slope αi when zi < 0: hi=g(z, α)i=max(0,zi)+αimin(0,zi)hi=g(z,
α)i=max(0,zi)+αimin(0,zi).

Absolute value rectification fixes αi = −1 to obtain g(z) = |z|. It is used for


object recognition from images

A leaky ReLU fixes αi to a small value like 0.01

parametric ReLU treats αi as a learnable parameter

Logistic Sigmoid and Hyperbolic Tangent

Most neural networks used the logistic sigmoid activation function prior to
rectified linear units.

g(z)=σ(z)g(z)=σ(z)

or the hyperbolic tangent activation function

g(z)=tanh(z)g(z)=tanh(z)

These activation functions are closely related because tanh(z)=2

σ(2z)−1tanh(z)=2 σ(2z)-1

We have already seen sigmoid units as output units, used to predict the
probability that a binary variable is 1.

Sigmoidals saturate across most of domain

Saturate to 1 when z is very positive and 0 when z is very negative

Strongly sensitive to input when z is near 0

Saturation makes gradient-learning difficult

Hyperbolic tangent typically performs better than logistic sigmoid. It


resembles the identity function more closely. Because tanh is similar to the
identity function near 0, training a deep neural
network ŷ =wTtanh(UTtanh(VTx))ŷ =wTtanh(UTtanh(VTx))resembles

25|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

training a linear model ŷ =wTUTVTxŷ =wTUTVTx so long as the


activations of the network can be kept small.

ARCHITECTURE DESIGN

The word architecture refers to the overall structure of the network: how many
units it should have and how these units should be connected to each other.

Generic Neural Architecture

Most neural networks are organized into groups of units called layers. Most
neural network architectures arrange these layers in a chain structure, with
each layer being a function of the layer that preceded it. In this structure, the
first layer is given by

h(1)=g(1)(W(1)Tx+b(1))h(1)=g(1)(W(1)Tx+b(1))

the second layer is given by

h(2)=g(2)(W(2)Th(1)+b(2))h(2)=g(2)(W(2)Th(1)+b(2))

In these chain-based architectures, the main architectural considerations are to


choose the depth of the network and the width of each layer.

Universal Approximation Properties and Depth

A feed-forward network with a single hidden layer containing a finite number


of neurons can approximate continuous functions on compact subsets of ℝn,
under mild assumptions on the activation function

26|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

Simple neural networks can represent a wide variety of interesting functions


when given appropriate parameters

However, it does not touch upon the algorithmic learnability of those


parameters.
The universal approximation theorem means that regardless of what function
we are trying to learn, we know that a large MLP will be able to represent this
function. However, we are not guaranteed that the training algorithm will be
able to learn that function. Even if the MLP is able to represent the function,
learning can fail for two different reasons.

Optimizing algorithms may not be able to find the value of the parameters
that corresponds to the desired function.

The training algorithm might choose wrong function due to over-fitting.

The universal approximation theorem says that there exists a network large
enough to achieve any degree of accuracy we desire, but the theorem does
not say how large this network will be. provides some bounds on the size of a
single-layer network needed to approximate a broad class of functions.
Unfortunately, in the worse case, an exponential number of hidden units may
be required. This is easiest to see in the binary case: the number of possible
binary functions on vectors v∈{0,1}nv∈{0,1}n is 22n22n and selecting one
such function requires 2n2n bits, which will in general require O(2n)O(2n)
degrees of freedom.

A feedforward network with a single layer is sufficient to represent any


function, But the layer may be infeasibly large and may fail to generalize
correctly. Using deeper models can reduce no.of units required and reduce
generalization error.

BACKPROPOGATION AND OTHER DIFFERENTIATION ALGORITHMS

Backpropagation is the essence of neural network training. It is the method of


fine-tuning the weights of a neural network based on the error rate obtained in
the previous epoch (i.e., iteration). Proper tuning of the weights allows you
to reduce error rates and make the model reliable by increasing its
generalization.

Backpropagation in neural network is a short form for “backward propagation of


errors.” It is a standard method of training artificial neural networks. This
method helps calculate the gradient of a loss function with respect to all the
weights in the network.

27|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

The Back propagation algorithm in neural network computes the gradient of


the loss function for a single weight by the chain rule. It efficiently computes
one layer at a time, unlike a native direct computation. It computes the
gradient, but it does not define how the gradient is used. It generalizes the
computation in the delta rule.

1. Inputs X, arrive through the preconnected path


2. Input is modeled using real weights W. The weights are usually
randomly selected.
3. Calculate the output for every neuron from the input layer, to the
hidden layers, to the output layer.
4. Calculate the error in the outputs

ErrorB= Actual Output – Desired Output

1. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.

Keep repeating the process until the desired output is achieved

Most prominent advantages of Backpropagation are:


-Backpropagation is fast, simple and easy to program
-It has no parameters to tune apart from the numbers of input

28|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

-It is a flexible method as it does not require prior knowledge about the
network
-It is a standard method that generally works well
-It does not need any special mention of the features of the function to be
learned.
Types of Backpropagation Networks
Two Types of Backpropagation Networks are:

• Static Back-propagation
• Recurrent Backpropagation

Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues
like optical character recognition.

Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is
achieved. After that, the error is computed and propagated backward.

The main difference between both of these methods is: that the mapping is
rapid in static back-propagation while it is nonstatic in recurrent
backpropagation.

History of Backpropagation

• In 1961, the basics concept of continuous backpropagation were


derived in the context of control theory by J. Kelly, Henry Arthur, and
E. Bryson.
• In 1969, Bryson and Ho gave a multi-stage dynamic system
optimization method.
• In 1974, Werbos stated the possibility of applying this principle in an
artificial neural network.
• In 1982, Hopfield brought his idea of a neural network.
• In 1986, by the effort of David E. Rumelhart, Geoffrey E. Hinton, Ronald
J. Williams, backpropagation gained recognition.
• In 1993, Wan was the first person to win an international pattern
recognition contest with the help of the backpropagation method.

OTHER DIFFERENTIATION ALGORITHMS

• Automatic Differentiation
• Deep learning community has been outside the CS community dealing
with automatic differentiation

29|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

• The back-propagation algorithm is only one approach to automatic


differentiation
• It is a special case of a broader class of techniques called reverse
mode accumulation

Computational Complexity

• In general, determining the order of evaluation that results in


the lowest computational cost is a difficult problem
• Finding the optimal sequence of operations to compute the gradient
is NP-complete (Naumann, 2008)
• – in the sense that it may require simplifying algebraic expressions
into their least expensive form

Future differentiation technology

• Backprop is not the only- or optimal-way of computing the gradient,


but a practical method for deep learning
• In the future, differentiation technology for deep networks may
improve with advances in the broader field of automatic differentiation

Practice Quiz

1. Which of the following CANNOT be achieved by


using machine learning?

A) forecast the outcome variable into the future

B) accurately predict the outcome using supervised learning


algorithms

C) proving causal relationships between variables

D) classify respondents into groups based on their response


pattern

2. Algorithms is _ oriented Elements of the


object model.

A)Procedure-oriented

B) Object-oriented

30|D L - U N I T - 2

BTECH_CSE-SEM 4 1
S
SVEC TIRUPATI

C) Logic-oriented

D) Rule-oriented

. 3) Machine Learning is a field of AI consisting of learning


algorithms that ..............

A) At executing some task

B) Over time with experience

C) improve their performance

D) All of the above

4. Machine learning algorithms build a model based on sample


data, known as .................

A) Training Data

B) Transfer Data

C) Data Training

D) None of the above

5. Machine learning is a subset of ................

A)Deep Learning

B)Artificial Intelligence

C)Data Learining

D) None of the above

6 .................... algorithms enable the computers to learn from data, and


even improve themselves, without being explicitly programmed.

31|D L - U N I T - 2

BTECH_CSE-SEM 4 1
A) Deep Learning

B) Machine Learning

C)Artificial Intelligence

D) None of the above

7. What are the three types of Machine Learning?

A) Supervised Learning

B)Unsupervised Learning

C)Reinforcement Learning

D) All of the above

8. Which of the following is not a supervised learning?

A) PCA

B) Naive Bayesian

C) Linear Regression

D) Decision Tree Answer

9. Which is true for neural networks?

A) It is as set of nodes and connections

B) Each node computes it’s weighted input

C) Node could be in excited state or non-excited state

D) All of the above

32|D L - U N I T - I

You might also like