0% found this document useful (0 votes)
29 views16 pages

Unit-3 Notes

Uploaded by

sipik50968
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views16 pages

Unit-3 Notes

Uploaded by

sipik50968
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT III

Introduction to Deep Learning: Historical Trends in Deep learning, Deep Feed forward
networks, Gradient-Based learning, Hidden Units, Architecture Design, Back- Propagation
and Other Differentiation Algorithms

Introduction

Deep Learning is a subset of Machine Learning that uses mathematical


functions to map the input to the output. These functions can extract non-
redundant information or patterns from the data, which enables them to form a
relationship between the input and the output. This is known as learning, and
the process of learning is called training.

In traditional computer programming, input and a set of rules are combined


together to get the desired output. In machine learning and deep learning, input
and output are correlated to the rules. These rules when combined with new
input-yield desired results.

Modern deep learning models use artificial neural networks or simply neural
networks to extract information.

These neural networks are made up of a simple mathematical function that can
be stacked on top of each other and arranged in the form of layers, giving them
a sense of depth, hence the term Deep Learning.
Deep learning can also be thought of as an approach to Artificial Intelligence, a smart combination of
hardware and software to solve tasks requiring human intelligence
Importance of Deep Learning

Deep learning algorithms play a crucial role in determining the features and can
handle the large number of processes for the data that might be structured or
unstructured. Although, deep learning algorithms can overkill some tasks that
might involve complex problems because they need access to huge amounts of
data so that they can function effectively. For example, there's a popular deep
learning tool that recognizes images namely Imagenet that has access to 14
million images in its dataset-driven algorithms. It is a highly comprehensive
tool that has defined a next-level benchmark for deep learning tools that aim
images as their dataset.

Deep learning algorithms are highly progressive algorithms that learn about the
image that we discussed previously by passing it through each neural network
layer. The layers are highly sensitive to detect low-level features of the image
like edges and pixels and henceforth the combined layers take this information
and form holistic representations by comparing it with previous data. For
example, the middle layer might be programmed to detect some special parts of
the object in the photograph which other deep trained layers are programmed to
detect special objects like dogs, trees, utensils, etc.

However, if we talk out the simple task that involves less complexity and a
data-driven resource, deep learning algorithms fail to generalize simple data.
This is one of the main reasons deep learning is not considered effective as
linear or boosted tree models. Simple models aim to churn out custom data,
track fraudulent transactions and deal with less complex datasets with fewer
features. Also, there are various cases like multiclass classification where deep
learning can be effective because it involves smaller but more structured
datasets but is not preferred usually.

Why Deep Learning

Applications of Deep Learning :

In computer vision, Deep learning models can enable machines to identify and
understand visual data. Some of the main applications of deep learning in
computer vision include:

Object detection and recognition: Deep learning model can be used to identify
and locate objects within images and videos, making it possible for machines to
perform tasks such as self-driving cars, surveillance, and robotics.

Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications
such as medical imaging, quality control, and image retrieval.

Image segmentation: Deep learning models can be used for image


segmentation into different regions, making it possible to identify specific
features within images.
Natural language processing (NLP):

In NLP, the Deep learning model can enable machines to understand and
generate human language. Some of the main applications of deep learning in
NLP include:

Automatic Text Generation – Deep learning model can learn the corpus of
text and new text like summaries, essays can be automatically generated using
these trained models.

Language translation: Deep learning models can translate text from one
language to another, making it possible to communicate with people from
different linguistic backgrounds.

Sentiment analysis: Deep learning models can analyze the sentiment of a piece
of text, making it possible to determine whether the text is positive, negative, or
neutral. This is used in applications such as customer service, social media
monitoring, and political analysis.

Speech recognition: Deep learning models can recognize and transcribe spoken
words, making it possible to perform tasks such as speech-to-text conversion,
voice search, and voice-controlled devices.

Reinforcement learning:

In reinforcement learning, deep learning works as training agents to take action


in an environment to maximize a reward. Some of the main applications of deep
learning in reinforcement learning include:

Game playing: Deep reinforcement learning models have been able to beat
human experts at games such as Go, Chess, and Atari.

Robotics: Deep reinforcement learning models can be used to train robots to


perform complex tasks such as grasping objects, navigation, and manipulation.

Control systems: Deep reinforcement learning models can be used to control


complex systems such as power grids, traffic management, and supply chain
optimization.
Historical Trends in Deep Learning

Deep Learning have been three waves of development: The first wave started
with cybernetics in the 1940s-1960s, with the development of theories of
biological learning and implementations of the first models such as the
perceptron allowing the training of a single neuron. The second wave started
with the connectionist approach of the 1980-1995 period, with back-propagation
to train a neural network with one or two hidden layers. The current and third
wave, deep learning, started around 2006.
Deep Learning History Timeline
Deep Feedforward Networks
Introduction
 Deep feedforward neural nets are also known as multilayer perceptrons
 Goal is to approximate a function F*(x) by learning a mapping y=F(x;θ) where θ are
the paramters to be learned by the model
 compose together many different functions, which can be represented by a DAG
 the final output of the model is called the output layer, while the intermediary layers
are called hidden layers.

Learning XOR
The XOR function (“exclusive or”) is an operation on two binary values, x1 and x2. When
exactly one of these binary values is equal to 1, the XOR function returns 1. Otherwise, it
returns 0. The XOR function provides the target function

y = f∗(x) that we want to learn. Our model provides a function y = f(x; θ) and our learning
algorithm will adapt the parameters θ to make f as similar as possible to f∗.

We want our network to perform correctly on the four points X = {[0, 0], [0, 1],[1, 0], and [1,
1]}. We will train the network on all four of these points. The only challenge is to t the
training set.

We can treat this problem as a regression problem and use a mean squared error loss
function. In practical applications, MSE is usually not an appropriate cost function for
modeling binary data.

Evaluated on our whole training set, the MSE loss function is


Suppose that we choose a linear model, with θ consisting of w and b. Our model is defined to
be

We can minimize J(θ) in closed form with respect to w and b using the normal equations.

After solving the normal equations, we obtain w = 0 and b=1/2 The linear model simply outputs 0.5
everywhere. Why does this happen? A linear model is not able to represent the XOR function. One
way to solve this problem is to use a model that learns a different feature space in which a linear
model is able to represent the solution.

Specifically, we will introduce a very simple feedforward network with one hidden layer
containing two hidden units.

This feedforward network has a vector of hidden units h that are computed by a function
f(1)(x; W, c). The values of these hidden units are then used as the input for a second layer.
The second layer is the output layer of the network. The output layer is still just a linear
regression model, but now it is applied to h rather than to x . The network now contains two
functions chained together: h = f(1)(x; W, c) and y = f(2)(h; w, b), with

the complete model being f (x; W,C,w,b) = f (2) (f (1) (x))

What function should f(1) compute? Linear models have served us well so far, and it may

be tempting to make f(1) be linear as well. Unfortunately, if f(1) were linear, then the feedforward
network as a whole would remain a linear function of its input. we must use a nonlinear function to
describe the features. Most neural networks do so using an affine transformation controlled by
learned parameters, followed by a fixed, nonlinear function called an activation function. We use
that strategy here, by defining h = g(WT x + c),

where W provides the weights of a linear transformation and c the biases.

We describe an affine transformation from a vector x to a vector h, so an entire vector of bias


parameters is needed. The activation function g is typically chosen to be a function that is applied
element-wise, with hi= g(xT Wi+ ci). In modern neural networks, the default recommendation is
to use the rectified linear unit or ReLU defined by the activation function g(z) = max{0,z}.

We can now specify our complete network as

f (x; W,C,w,b) = wT max{0, W T x + c} + b

We can now specify a solution to the XOR problem. Let


and b = 0

We can now walk through the way that the model processes a batch of inputs. Let X be the design
matrix containing all four points in the binary input space, with one example per row:

The first step in the neural network is to multiply the input matrix by the first layer’s weight matrix:

Next, we add the bias vector c, to obtain

In this space, all of the examples lie along a line with slope 1. As we move along this line, the output
needs to begin at 0, then rise to 1, then drop back down to 0. A linear model cannot implement such a
function. To finish computing the value of h for each example, we apply the rectified linear
transformation:
This transformation has changed the relationship between the examples. They no longer lie on a
single line. They now lie in a space where a linear model can solve the problem. We finish by
multiplying by the weight vector w:

The neural network has obtained the correct answer for every example in the batch.

In this example, we simply specified the solution, then showed that it obtained zero error. In a real
situation, there might be billions of model parameters and billions of training examples, so one cannot
simply guess the solution as we did here. Instead, a gradient- based optimization algorithm can find
parameters that produce very little error.

Gradient-Based Learning
As with other machine learning models, to apply gradient-based learning we must choose a
cost function, and we must choose how to represent the output of the model. Largest
difference between simple ML Models and neural networks are nonlinearity of a neural
network causes most interesting loss functions to become non-convex. This means that neural
networks are usually trained by using iterative, gradient-based optimizers that merely drive
the cost function to a very low value, rather than exact linear equation solvers used to train
linear regression models or the convex optimization algorithms used for logistic regression or
SVMs

Cost Functions
A cost function is an important parameter that determines how well a machine learning model
performs for a given dataset. It calculates the di erence between the expected value and
predicted value and represents it as a single real number

Types of Cost Function

1.Regression Cost Function

 Means Error
 Mean Squared Error
 Mean Absolute Error

2.Binary Classi cation cost Functions


3.Multi-class Classi cation Cost Function.

In most cases, our parametric model defines a distribution p(y | x;θ ) and we simply use the
principle of maximum likelihood. This means we use the cross-entropy between the training
data and the model’s predictions as the cost function.

Sometimes, we rather than predicting a complete probability distribution over y, we merely


predict some statistic of y conditioned on x. Specialized loss functions allow us to train a
predictor of these estimates.

The total cost function used to train a neural network will often combine one of the primary
cost functions described here with a regularization term.

Learning Conditional Distributions with Maximum Likelihood

Most modern neural networks are trained using maximum likelihood. This meansthat the cost
function is simply the negative log-likelihood, equivalently describedas the cross- entropy
between the training data and the model distribution. This cost function is given by:

The specific form of the cost function changes from model to model, depending on the
specific form of log P model.

An advantage of this approach of deriving the cost function from maximum likelihood is that
it removes the burden of designing cost functions for each model. Specifying a model p(y | x)
automatically determines a cost function log p(y | x).

Hidden Units
How to choose the type of hidden unit to use in the hidden layers of the model. The design of
hidden units is an extremely active area of research and does not yet have many definitive
guiding theoretical principles. Rectified linear units are an excellent default choice of hidden
unit.

We discuss motivations behind choice of hidden unit. It is usually impossible to predict in


advance which will work best. The design process consists of trial and error, intuiting that a
kind of hidden unit may work well, and evaluating its performance on a validation set

Some hidden units are not differentiable at all input points. For example, the rectified
linear function. g (z) = max {0, z} is not differentiable at z = 0. This may seem like it
invalidates g for use with a gradient based learning algorithm. In practice, gradient descent
still performs well enough for these models to be used for machine learning tasks

Most hidden units can be described as accepting a vector of inputs x, computing an affine
transformation z = wT h + b, and then applying an element-wise nonlinear function g (z).
Most hidden units are distinguished from each other only by the choice of the form of the
activation function g (z)

Rectified Linear Units and Their Generalizations (ReLU)


Rectified linear units use the activation function g (z) = max {0, z}.

Rectified linear units are easy to optimize due to similarity with linear units.

Only difference with linear units that they output 0 across half its domain
Derivative is 1 everywhere that the unit is active
Thus gradient direction is far more useful than with activation functions
with second-ordereffects
Rectified linear units are typically used on top of an affine transformation:
h = g (W T x + b).
Good practice to set all elements of b to a small value such as 0.1. This makes it likely that
ReLU will be initially active for most training samples and allow derivatives to pass through

ReLU vs other activations:

 Sigmoid and tanh activation functions cannot be with many layers due to
the vanishing gradient problem
 ReLU overcomes the vanishing gradient problem, allowing models to learn faster and
perform better.
 ReLU is the default activation function with MLP and CNN
One drawback to rectified linear units is that they cannot learn via gradient based methods on
examples for which their activation is zero
Three generalizations of rectified linear units are based on using a non-zero slope αi when
Logistic Sigmoid and Hyperbolic Tangent

Most neural networks used the logistic sigmoid activation function prior to rectified
linear units.
g (z) = σ (z)
or the hyperbolic tangent activation function
g (z) = tanh (z)
These activation functions are closely related because
tanh(z) = 2 σ (2z) − 1
We have already seen sigmoid units as output units, used to predict the probability
that a binary variable is 1.
Sigmoidals saturate across most of domain
 Saturate to 1 when z is very positive and 0 when z is very negative
 Strongly sensitive to input when z is near 0
 Saturation makes gradient-learning di cult
Hyperbolic tangent typically performs better than logistic sigmoid. It resembles the
identity function more closely. Because tanh is similar to the identity function near 0,
training a deep neural network ŷ = wT tanh(U T tanh (V T x))resembles training a
linear model ŷ = wTU TV T x so long as the activations of the network can be kept
small.
Architecture Design
The word architecture refers to the overall structure of the network: how many units it should
have and how these units should be connected to each other
Generic Neural Architectures
Most neural networks are organized into groups of units called layers. Most neural network
architectures arrange these layers in a chain structure, with each layer being a function of the
layer that preceded it. In this structure, the first layer is given by
h(1) = g(1) (W (1)Tx + b(1))
the second layer is given by
h(2) = g(2) (W (2)Th(1) + b(2))
In these chain-based architectures, the main architectural considerations are to choose the
depth of the network and the width of each layer.

Universal Approximation Properties and Depth

A feed-forward network with a single hidden layer containing a finite number of neurons can
approximate continuous functions on compact subsets of ℝn, under mild assumptions on the
activation function
 Simple neural networks can represent a wide variety of interesting functions when
given appropriate parameters
 However, it does not touch upon the algorithmic learnability of those parameters

The universal approximation theorem means that regardless of what function we are trying to
learn, we know that a large MLP will be able to represent this function. However, we are not
guaranteed that the training algorithm will be able to learn that function. Even if the MLP is
able to represent the function, learning can fail for two different reasons
 Optimizing algorithms may not be able to nd the value of the parameters that
corresponds to the desired function.
 The training algorithm might choose wrong function due to over- tting
The universal approximation theorem says that there exists a network large enough to
achieve any degree of accuracy we desire, but the theorem does not say how large this
network will be. provides some bounds on the size of a single-layer network needed to
approximate a broad class of functions. Unfortunately, in the worse case, an
exponential number of hidden units may be required This is easiest to see in the binary
case: the number of possible binary functions on vectors v ∈ {0,1}n is 2 2n and
selecting one such function requires 2 n bits, which will in general require O (2n)
degrees of freedom
A feedforward network with a single layer is sufficient to represent any function, But
the layer may be infeasibly large and may fail to generalize correctly. Using deeper
models can reduce no.of units required and reduce generalization error

You might also like