0% found this document useful (0 votes)
9 views8 pages

Unit4 1

Uploaded by

akg.uk14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views8 pages

Unit4 1

Uploaded by

akg.uk14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Artificial Neural Network:

Artificial neural networks (ANNs) provide a general, practical method for learning real-valued, discrete-valued, and
vector-valued functions from examples. Algorithms such as BACKPROPAGATION use gradient descent to tune
network parameters to best fit a training set of input-output pairs. ANN learning is robust to errors in the training
data and has been successfully applied to problems such as interpreting visual scenes, speech recognition, and
learning robot control strategies.

Biological Motivation

The biological learning systems are built of very complex webs of interconnected neurons. In rough
analogy, artificial neural networks are built out of a densely interconnected set of simple units, where
each unit takes a number of real-valued inputs (possibly the outputs of other units) and produces a
single real-valued output (which may become the input to many other units).

To develop a feel for this analogy, let us consider a few facts from neuro- biology. The human brain, for example, is
estimated to contain a densely inter- connected network of approximately 10 11 neurons, each connected, on average,
to 104 others. Neuron activity is typically excited or inhibited through connections to other neurons. The fastest
neuron switching times are known to be on the order of 10-3 seconds quite slow compared to computer switching
speed. Yet humans are able to make surprisingly complex decisions, surprisingly quickly. For example, it requires
approximately 1/10 seconds to visually recognizing your mother. Notice the sequence of neuron firings that can take
place during this 1/10 second interval cannot possibly be longer than a few hundred steps, given the switching speed
of single neurons.

One motivation for ANN systems is to capture this kind of highly parallel computation based on distributed
representations. Most ANN software runs on sequential machines emulating distributed processes, although faster
versions of the algorithms have also been implemented on highly parallel machines and on specialized hardware
designed specifically for ANN applications.

Inconsistencies between ANN and Biological system: we can consider here ANNs whose individual units output a
single constant value, whereas biological neurons output a complex time series of spikes.

NEURAL NETWORK REPRESENTATIONS-

A prototypical example of ANN learning is provided by Pomerleau's (1993) system ALVINN, which uses a learned
ANN to steer an autonomous vehicle driving at normal speeds on public highways. The input to the neural network
is a 30 x 32 grid of pixel intensities obtained from a forward-pointed camera mounted on the vehicle. The network
output is the direction in which the vehicle is steered. The ANN is trained to mimic the observed steering commands
of a human driving the vehicle for approximately 5 minutes. ALVINN has used its learned networks to successfully
drive at speeds up to 70 miles per hour and for distances of 90 miles on public highways (driving in the left lane of a
divided public highway, with other vehicles present).
The BACKPROPAGATION algorithm is the most commonly used ANN learning technique. It is appropriate for
problems with the following characteristics:

1. Instances are represented by many attribute-value pairs. The target function to be learned is defined
over instances that can be described by a vector of predefined features, such as the pixel values in the
ALVINN example. These input attributes may be highly correlated or independent of one another. Input
values can be any real values.
2. The target function output may be discrete-valued, real-valued, or a vector of several real- or
discrete-valued attributes. For example, in the ALVINN system the output is a vector of 30 attributes,
each corresponding to a recommendation regarding the steering direction. The value of each output is some
real number between 0 and 1, which in this case corresponds to the confidence in predicting the
corresponding steering direction
3. The training examples may contain errors. ANN learning methods are quite robust to noise in the
training data.
4. Long training times are acceptable. Network training algorithms typically require longer training times
than, say, decision tree learning algorithms. Training times can range from a few seconds to many hours,
depending on factors such as the number of weights in the network, the number of training examples
considered, and the settings of various learning algorithm parameters.
5. Fast evaluation of the learned target function may be required. Although ANN learning times are
relatively long, evaluating the learned network, in order to apply it to a subsequent instance, is typically
very fast. For example, ALVINN applies its neural network several times per second to continually update
its steering command as the vehicle drives forward.
6. The ability of humans to understand the learned target function is not important. The weights learned
by neural networks are often difficult for humans to interpret.

Representational Power of Perceptrons :

The perceptron is represented as a hyperplane decision surface in the n-dimensional space of instances (i.e., points).
The perceptron outputs a 1 for instances lying on one side of the hyperplane and outputs a -1 for instances lying on
the other side, The equation for this decision hyperplane is wx=0. Of course, some sets of positive and negative
examples cannot be separated by any hyperplane. Those that can be separated are called linearly separable sets of
examples.
A single perceptron can be used to represent many boolean functions. For example, if we assume boolean values of
1 (true) and -1 (false), then one way to use a two-input perceptron to implement the AND function is to set the
weights wo = -3, and wl = wz = .5. This perceptron can be made to represent the OR function instead by altering the
threshold to wo = -.3. In fact, AND and OR can be viewed as special cases of m-of-n functions: that is, functions
where at least m of the n inputs to the perceptron must be true. The OR function corresponds to rn = 1 and the AND
function to m = n.

Perceptrons can represent all of the primitive boolean functions AND, OR, NAND, and NOR . Unfortunately,
however, some boolean functions cannot be represented by a single perceptron, such as the XOR function whose
value is 1 if and only if xl != x2. Every boolean function can be represented by some network of perceptrons only
two levels deep, in which the inputs are fed to multiple units, and the outputs of these units are then input to a
second, final stage.

The Perceptron Training Rule:

The precise learning problem is to determine a weight vector that causes the perceptron to produce the correct +-1
output for each of the given training examples. Several algorithms are known to solve this learning problem. Here
we con-sider two: the perceptron rule and the delta rule These two algorithms are guaranteed to converge to
somewhat different acceptable hypotheses, under somewhat different conditions. They are important to ANNs
because they provide the basis for learning networks of many units. One way to learn an acceptable weight vector is
to begin with random weights, then iteratively apply the perceptron to each training example, modifying the
perceptron weights whenever it misclassifies an example.

Weights are modified at each step according to the perceptron training rule, which revises the weight wi associated
with input xi according to the rule:

Here t is the target output for the current training example, o is the output generated by the perceptron, and q is a
positive constant called the learning rate. The role of the learning rate is to moderate the degree to which weights are
changed at each step. It is usually set to some small value (e.g., 0.1)

Gradient Descent and the Delta Rule:


where D is the set of training examples, td is the target output for training example d, and od is the output of the
linear unit for training example d. By this definition, E is simply half the squared difference between the target
output td and the unit output od, summed over all training examples. Here we characterize E as a function of weight
vector, because the linear unit output o depends on this weight vector.

Derivation of Gradient Descent:


.

Because the error surface contains only a single global minimum, this algorithm will converge to a weight vector
with minimum error, regardless of whether the training examples are linearly separable, given a sufficiently small
learning rate is used.

STOCHASTIC APPROXIMATION TO GRADIENT DESCENT:

Gradient descent is an important general paradigm for learning. It is a strategy for searching through a large or
infinite hypothesis space that can be applied whenever (1) the hypothesis space contains continuously parameterized
hypotheses (e.g., the weights in a linear unit), and (2) the error can be differentiated with respect to these hypothesis
parameters. The key practical difficulties in applying gradient descent are (1) converging to a local minimum can
sometimes be quite slow (i.e., it can require many thousands of gradient descent steps), and (2) if there are multiple
local minima in the error surface, then there is no guarantee that the procedure will find the global minimum.

One common variation on gradient descent intended to alleviate these difficulties is called incremental gradient
descent, or alternatively stochastic gradient descent. Whereas the gradient descent training rule presented in
Equation (4.7) computes weight updates after summing over all the training examples in D, the idea behind
stochastic gradient descent is to approximate this gradient descent search by updating weights incrementally,
following the calculation of the error for each individual example.
The key differences between standard gradient descent and stochastic gradient descent are:

1. In standard gradient descent, the error is summed over all examples before updating weights, whereas in
stochastic gradient descent weights are updated upon examining each training example.
2. Summing over multiple examples in standard gradient descent requires more computation per weight
update step. On the other hand, because it uses the true gradient, standard gradient descent is often used
with a larger step size per weight update than stochastic gradient descent.
3. In cases where there are multiple local minima with respect to E, stochastic, gradient descent can
sometimes avoid falling into these local minima because it uses the various Ed rather than E to guide its
search.

MULTILAYER NETWORKS AND THE BACKPROPAGATION ALGORITHM

The multilayer networks learned by the BACKPROPOGATION algorithm are capable of expressing a rich
variety of nonlinear decision surface.
A Differentiable Threshold Unit: We need a unit whose output is a nonlinear function of its inputs, but
whose output is also a differentiable function of its inputs. One solution is the sigmoid unit-a unit very
much like a perceptron, but based on a smoothed, differentiable threshold function. Like the perceptron, the
sigmoid unit first computes a linear combination of its inputs, then applies a threshold to the result. The
threshold output is a continuous function of its input. More precisely, the sigmoid unit computes its output
,often called the sigmoid function or, alternatively, the logistic function. Note its output ranges between 0
and 1, increasing monotonically with its input.

You might also like