Unit 2-Ann
Unit 2-Ann
• Psychological studies:
• How do animals learn, forget, recognize and perform various types of tasks?
• McCulloch and Pitts introduced the first mathematical model of single neuron, widely
applied in subsequent work.
Biological Motivation
• The study of artificial neural networks (ANNs) has been inspired by the
observation that biological learning systems are built of very complex webs of
interconnected Neurons
• Collects inputs & produces output if the sum of the input exceeds an internal
threshold value
Neuron :
synapse axon
nucleus
cell body
dendrites
Properties of Neural Networks
Examples:
1. Speech recognition
2. Image classification
3. Financial perdition etc..s
Neuron
Neuron structure is used to represent a Node in NN
Artificial Neuron
NEURAL NETWORK REPRESENTATIONS
• A prototypical example of ANN learning is provided by Pomerleau's (1993)
system ALVINN, which uses a learned ANN to steer an autonomous vehicle
driving at normal speeds on public highways.
• The input to the neural network is a 30x32 grid of pixel intensities obtained from
a forward-pointed camera mounted on the vehicle.
The learning problem is to determine a weight vector that causes the perceptron to
produce the correct + 1 or - 1 output for each of the given training examples.
• If the training examples are not linearly separable, the delta rule converges toward
a best-fit approximation to the target concept.
• The key idea behind the delta rule is to use gradient descent to search the
hypothesis space of possible weight vectors to find the weights that best fit the
training examples.
To understand the delta training rule, consider the task of training an unthresholded
perceptron. That is, a linear unit for which the output O is given by
To derive a weight learning rule for linear units, specify a measure for the training
error of a hypothesis (weight vector), relative to the training examples.
Where,
•D is the set of training examples,
•td is the target output for training example d,
•od is the output of the linear unit for training example d
•E [ w ] is simply half the squared difference between the target output td and the linear unit output
od, summed over all training examples.
Visualizing the Hypothesis Space
How to calculate the direction of steepest descent along the error surface?
The direction of steepest can be found by computing the derivative of E with respect
to each component of the vector w . This vector derivative is called the gradient of E
with respect to w , written as
• The gradient specifies the direction of steepest increase of E, the training rule for
gradient descent is
• Here η is a positive constant called the learning rate, which determines the step
size in the gradient descent search.
• The negative sign is present because we want to move the weight vector in the
direction that decreases E
• This training rule can also be written in its component form
derivatives that form the gradient can be obtained by
𝜕𝐸
Calculate the gradient at each step. The vector of
𝜕𝑤
differentiating E from Equation (2), as 𝑖
GRADIENT DESCENT algorithm for training a linear unit
To summarize, the gradient descent algorithm for training linear units is as follows:
•Pick an initial random weight vector.
•Apply the linear unit to all training examples, then compute Δwi for each weight
according to Equation (7).
•Update each weight wi by adding Δwi, then repeat this process
Features of Gradient Descent Algorithm
where t, o, and xi are the target value, unit output, and ith input for the training
example in question
One way to view this stochastic gradient descent is to consider a distinct error
function Ed ( w ) for each individual training example d as follows
Where, td and od are the target value and the unit output value for training example
d.
•Stochastic gradient descent iterates over the training examples d in D, at each
iteration altering the weights according to the gradient with respect to Ed( w )
•The sequence of these weight updates, when iterated over all training examples,
provides a reasonable approximation to descending the gradient with respect to our
original error function Ed ( w )
•By making the value of η sufficiently small, stochastic gradient descent can be
made to approximate true gradient descent arbitrarily closely
The key differences between standard gradient descent and stochastic gradient
descent are
•In standard gradient descent, the error is summed over all examples before
updating weights, whereas in stochastic gradient descent weights are updated upon
examining each training example.
•Summing over multiple examples in standard gradient descent requires more
computation per weight update step. On the other hand, because it uses the true
gradient, standard gradient descent is often used with a larger step size per weight
update than stochastic gradient descent.
•In cases where there are multiple local minima with respect to stochastic gradient
descent can sometimes avoid falling into these local minima because it uses the
various 𝛻Ed ( w ) rather than 𝛻E ( w ) to guide its search
MULTILAYER NETWORKS AND THE BACKPROPAGATION
ALGORITHM
• Sigmoid unit-a unit very much like a perceptron, but based on a smoothed,
differentiable threshold function.
• The sigmoid unit first computes a linear combination of its inputs, then applies a
threshold to the result. In the case of the sigmoid unit, however, the threshold
output is a continuous function of its input.
• More precisely, the sigmoid unit computes its output O as
where,
• outputs - is the set of output units in the network
• tkd and Okd - the target and output values associated with the kth output unit
• d - training example
How Back Propagation works?
• Deriving the stochastic gradient descent rule: Stochastic gradient descent involves
iterating through the training examples one at a time, for each training example d
descending the gradient of the error Ed with respect to this single example
• For each training example d every weight wji is updated by adding to it Δ wji
Here outputs is the set of output units in the network, tk is the target value of unit k for
training example d, and ok is the output of unit k given training example d.
The derivation of the stochastic gradient descent rule is conceptually
straightforward, but requires keeping track of a number of subscripts and variables