0% found this document useful (0 votes)
33 views49 pages

Unit 3 - Ann

Uploaded by

esmritypoudel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views49 pages

Unit 3 - Ann

Uploaded by

esmritypoudel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Machine Learning

Unit 3
Artificial Neural Networks

By
Dr. G. Sunitha
Professor & BoS Chairperson
Department of CSE

Department of Computer Science and Engineering

Sree Sainath Nagar, A. Rangampet, Tirupati – 517 102

1
Artificial Neural Networks
❖ Biologically motivated approach, Similarity with biological network.

2
Artificial Neural Networks
❖ Biologically motivated approach, Similarity with biological network.

3
Artificial Neural Networks . . .
connections neurons

❖ An ANN consists of a set of highly interconnected processing elements such that output of a processing
element is connected through weights as input to other processing elements or itself.
❖ Each neuron takes a number of real-valued inputs (possibly the outputs of other units) and produces a single
real-valued output (which may become the input to many other units).

❖ The neuron operates as a mathematical processor performing specific mathematical operations on its inputs to
generate an output.

❖ ANN is a parallel distributed information processing structure where highly parallel processes are distributed
over many neurons.

❖ Neural network learning methods provide a robust approach to approximating real-valued, discrete-valued, and
vector-valued target functions.

❖ ANN are strongly interconnected systems of neurons which have simple behavior, but when connected they can
solve complex problems. Changes may be made further to enhance its performance.

4
Artificial Neural Networks . . .

Multidisciplinary View of NNs

5
Artificial Neural Networks . . .
❖ Advantages of NNs
• Adaptive Learning – ability to learn starting with given training data
• Self-Organization – create its own organization or representation of information based on training data
• Real-time Operation – parallel computation,
• Fault Tolerance – using redundant information coding
• Robust – to noisy input data

6
Neural Network Learning
❖ The input Xi is fed simultaneously into the input layer.
❖ The weighted outputs of these units are fed into hidden layer.
❖ The weighted outputs of the last hidden layer are inputs to output units of output layer.
❖ The errors are back propagated from output layer back to hidden layers.
❖ Weights and bias values are updated.
❖ Process repeated until converging criteria is met.
❖ Once the neural network model is ready, it can be used for prediction.

7
Neural Network Learning - training sample 1 [ 0.1, 0.08, 0.5 = 0.2 , 0.1 ]

1 2

3 4

8
Adjust weights
Artificial Neural Networks . . .
❖ Input: Classification data
It contains classification attributes

❖ All data must be normalized


Neural Network can work with data in the range of (0,1) or (-1,1)

Ex: Normalization techniques


[1] Max-Min normalization
[2] Decimal Scaling normalization

❖ All categorical data must be converted into numerical data.

9
Artificial Neural Networks . . .
❖ Fundamental processing element of a neural network is a neuron
1. Receives inputs from other source
2. Combines them in someway (Summing Function)
3. Performs a generally nonlinear operation on the result (Activation Function)
4. Outputs the final result

❖ Neural Network is a set of connected Input/Output Units,


where each connection has a weight associated with it.
❖ Neural Network learning is also called connectionist learning
due to the connections between units.
❖ Neural Network learns by adjusting the weights through feedback process.

10
❖ Here x1 and x2 are normalized attribute value of data.
Function of a neuron ❖ y is the output of the neuron , i.e the class label.
❖ Summing Function ∑ : x1 and x2 are multiplied by their
respective weight values.
❖ Given that w1 = 0.5 and w2 = 0.5
Say value of x1 is 0.3 and value of x2 is 0.8,

φ So, weighted sum = w1 x x1 + w2 x x2 = ∑WiXi


= 0.5 x 0.3 + 0.5 x 0.8 = 0.55
❖ Activation Function φ: The neuron receives the weighted
sum ∑ as input and calculates the output as a function
of input as follows :
Ex: y = φ(x) , where
φ(x) = 0 { when x< 0.5 }
φ(x) = 1 { when x >= 0.5 }
For our example, x ( weighted sum ) is 0.55, so y = 1 ,
That means corresponding input attribute values are
classified in class 1.
If for another input values , x = 0.39 , then f(x) = 0,
so we could conclude that input values are classified to
11
class 0.
Bias as extra input to neuron

b Activation
x1 W1
function Output
 X class
Input
Values
 F(𝑋)
y
to neuron xm
Wm Summing
function
weights

12
Bias of a Neuron
❖ We need the bias value to be added to the weighted sum ∑wixi so that we can
transform it from the origin.
v = ∑wixi + b, here b is the bias

x1-x2= -1
x2 x1-x2=0

x1-x2= 1

x1

13
Neuron with Activation
❖ The neuron is the basic information processing unit of a NN. It consists of:
1. A set of links, describing the neuron inputs, with weights W1, W2, …, Wm

2. An adder function (linear combiner) for computing the weighted sum of the inputs

3. Activation function : of a node defines the output of that node given an input or set of inputs
y = F (X)
The function symbolizes the activation of the neuron represented by the unit. It is also called a
logistic, sigmoid, or squashing function.

14
Activation Functions
❖ Summing/Integration Function ∑ - receive multiple inputs and convert into net input for neuron.
❖ Activation Function F - used to make o/p of neurons bounded i.e., the actual o/p of the neuron is conditioned
and is thus controllable.
❖ Identity function - linear function
F(X) = X for all X o/p of neuron is same as i/p
❖ Binary step function - mostly used in single-layer NNs to convert o/p to binary.
F(X) = 0 if X < θ
= 1 if X ≥ θ
Where θ is threshold value.
❖ Bipolar step function -
F(X) = -1 if X < θ
= 1 if X ≥ θ

15
Activation Functions . . .
❖ Binary/Logistic/Unipolar Sigmoid function -

1
F(X) = where λ is steepness parameter, F(X) range is 0 to 1.
1+𝑒 −𝜆𝑋
❖ Bipolar Sigmoid function -

2
FX) = −1 where λ is steepness parameter, F(X) range is -1 to 1.
1+𝑒 −𝜆𝑋
❖ Ramp function -
F(X) = 0 if X < 0
= X if 0 ≤ X ≤ 1
= 1 if X > 1

16
Weight Matrix
❖ Connection Matrix or Weight Matrix - The weights of all connections in the NN can be represented as a matrix.
❖ Assuming n neurons (n rows in matrix)
m connections from each neuron (m columns in matrix)
❖ Once the NN is trained,
the model will be in the form of weights
& structure

17
Neural Network Learning
❖ NN learning are of 3 types
• Supervised
• Unsupervised
• Reinforcement

❖ NN learns through the training process, and adapts by making adjustments to itself.
• Parameter Learning - adjusts connection weights
• Structure Learning - adjusts the network structure (no. of neurons, geometry of connections)

18
Basic Models of ANNs
❖ The models of ANNs are specified by 3 basic entities -
• Synaptic interconnections
• Training or learning rules (for updating and adjusting connection weights)
• Activation functions

19
Supervised Learning
❖ Each i/p vector will have a target vector (desired o/p).
❖ NN learns under supervision of target vector. Prediction
By NN is 4

Unsupervised Learning
❖ NN learns by itself.
❖ Groups are unknown.
❖ NN assigns i/p data with similar type into single group.
20
Reinforcement Learning
❖ Similar to supervised learning.
❖ For each i/p vector, the target vector (desired output) is known correctly - Supervised learning.
❖ For each i/p vector, the target vector (desired output) is correct only by X% - reinforced learning
❖ The feedback available is only evaluative not instructive.
❖ Also called as critic learning.

21
Neural Network Architectures

The arrangement of neurons to form layers and the geometry of their interconnections within/between layers is called
as neural network architecture.
• Single-layer Feedforward
• Multi-layer Feedforward

22
Single-Layer Feedforward NN
❖ Input layer and output layer are directly connected.
❖ Connection structure is varied to form various NN
architectures.
❖ Feedforward Network - Flow of signals can be in one
direction only, from i/p layer to o/p layer via hidden layers.
i.e., A connection will always connect a
neuron of layer Xi with
a neuron in next layer Xi+1.

23
Types of Neural Network Learning
❖ Supervised Learning Networks:
• Perceptron networks

24
Perceptron Networks
❖ A perceptron is a very simple learning machine.
❖ Single layer Feedforward NN.
❖ Perceptron is an algorithm for supervised learning.
❖ A Perceptron takes a
• Vector of real-valued inputs
• Calculates a linear combination of inputs
• Outputs a 1 if the result is > ϴ
-1 otherwise
where ϴ is a threshold value. Here ϴ = 0. W. X>0

25
Single Classification Perceptron Network

Y= -1
or 1

Perceptron

Source:https://en.wikipedia.org/wiki/
A binary classifier is a function which can classify Perceptron
input into one of the two classes.

26
Multi-Class Classification Perceptron Network

Perceptrons Y1, Y2, . . . Ym

27
Perceptron Learning
1) Begin with random weights,
2) Repeat for each training sample in the training dataset
a) Apply training sample to Perceptron network
Epoch
b) Modify weights if training sample is misclassified Iteration
(Perceptron Training Rule)

28
Perceptron Learning Algorithm

❖ ƞ is the learning rate of the perceptron. Learning rate is between 0 and 1. ƞ value nearer to 1 - fast learning,
nearer to 0 slow learning.
❖ Step Activation Function -
Bipolar step function - mostly used in single-layer NNs to generate bipolar output.
F(X) = 1 if X > 0
= -1 otherwise

❖ X is net input to Y,
❖ Y is output of perceptron (output neuron).
❖ Wi is the weight of link i.
❖ Here, threshold value = 0
❖ D is the training dataset comprising of M number of input vectors where each i/p vector is in the form (s : t), s
= (x1, x2, … xn) and t is the target output.

29
Perceptron Learning Algorithm for Single-output Classes
1. Initialize all the weights Wi, bias value b, and learning rate ƞ . Weights may be initialized to a small random
value. Initialize △ Wi =0
2. Repeat the following steps until converging criteria is met.
For each input vector Pk = (X1, X2, . . . Xn, t) in training set D,
a) Activation Function : Calculate actual output Y
𝑛

𝑋 = ෍ 𝑊𝑖 𝑋𝑖 + 𝑏 𝑌=𝐹 𝑋 = 1 𝑖𝑓 𝑋 > 0
𝑖=1 −1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Y= -1
or 1

b) Perceptron Training Rule:


If Y ≠ t Update the weights Wi and bias b
Wi (new) = Wi(old) + △ Wi

where △ Wi = ƞ . t . Xi

b (new) = b(old) + ƞ .t
30
Representational Power of Perceptron networks
❖ A Perceptron can be viewed as representing a hyperplane decision surface in the n-dimensional space
of training data.
❖ If the problem dataset D is linearly separable, then the perceptron is guaranteed to converge; the
successive adjustments made to the weights and bias values, will make learning rules to reach optimal
or near-optimal solution in a finite number of steps.
❖ When to stop repeating learning process - when convergence criteria is met.
❖ Convergence criteria is specified in the form of a stopping condition.
Ex: When there is no change of weights in one epoch.
❖ Perceptron will never converge, if problem dataset D is not linearly separable. In this case, no
"approximate" solution will be gradually approached under the standard learning algorithm, but
instead learning will fail completely.
❖ Many Boolean Functions can be represented using a perceptron network. Ex: AND, OR, NAND, NOR
etc.
❖ Some Boolean Functions cannot be represented using a perceptron network. Ex: XOR

31
Perceptron Networks - Problem on solving AND Problem

32
Perceptron Networks - Problem on solving AND Problem . . .

X
f(X) X
X

Y=0

33
Perceptron Networks - Problem on solving AND Problem . . .

34
Perceptron Networks - Problem on solving AND Problem . . .

35
Perceptron Networks Features
❖ It is a greedy, local algorithm.
❖ The perceptron learning algorithm selects a search direction in weight space according to the incorrect
classification of the last tested vector and does not make use of global information about the shape of the
error function. This can lead to an exponential number of updates of the weight vector.
❖ Perceptron training stops as soon as all training patterns are classified correctly. It may be a better idea to
have a learning procedure that could continue to improve its weights even after the classifications are correct.
❖ It never converges if the input problem is linearly inseparable.

36
Gradient Descent and Delta Rule
❖ Perceptron training rule finds a successful weight vector when the training data are linearly separable. But it can
fail to converge if the training data is linearly inseparable.
❖ Another training rule called as the Delta Rule will overcome this difficulty.
❖ If the training data is linearly inseparable, then the Delta Rule will converge towards a best-fit approximation to
the target concept.
❖ The key idea behind Delta Rule is to use Gradient Descent to search the hypothesis space of possible weight
vectors to find the weights that best fits the training data.
❖ The delta: t – Y
❖ Learning algorithm: Same as Perceptron learning except weight updates are done using the Delta Rule

Wi (new) = Wi(old) + △ Wi where △ Wi = ƞ . (t – Y) . Xi

b (new) = b(old) + ƞ .(t–Y)


❖ This rule is the training rule used in Backpropagation Algorithm.

37
Visualizing the Hypothesis Space

38
Visualizing the Hypothesis Space . . .
❖ To understand the gradient descent algorithm, it is helpful to visualize the entire hypothesis space of
possible weight vectors and their associated E values, as illustrated in Figure.
❖ Here the axes W0 and W1 represent possible values for the two weights of a simple linear unit. The W0 ,
W1 plane therefore represents the entire hypothesis space.
❖ The vertical axis indicates the error E relative to some fixed set of training examples. The error surface
shown in the figure thus summarizes the desirability of every weight vector in the hypothesis space (we
desire a hypothesis with minimum error).
❖ Given the way in which we chose to define E, for linear units this error surface must always be parabolic
with a single global minimum. The specific parabola will depend, of course, on the particular set of
training examples.
❖ Gradient descent search determines a weight vector that minimizes E by starting with an arbitrary
initial weight vector, then repeatedly modifying it in small steps.
❖ At each step, the weight vector is altered in the direction that produces the steepest descent along the
error surface depicted in Figure. This process continues until the global minimum error is reached.

39
Derivation of Gradient Descent Rule

Eq2

Substituting Eq2 in Eq1

Eq1
40
Gradient Descent and Delta Rule . . .
❖ The delta rule changes the weight of a neural connection so as to minimize the difference between net input to
the output unit “Yin” and the target value “t”.
❖ The aim is to minimize the error over all training patterns. However, this is accomplished by reducing the error
for each pattern, one at a time.
❖ Weight corrections can also be accumulated over a number of training patterns for batch updating.
❖ Gradient Descent can be applied whenever
• Hypothesis space contains many different types of continuously parameterized hypotheses.
• Error can be differentiated with respect to these hypothesis parameters.
❖ Key practical difficulties are
• Converging to a local minimum can sometimes be very slow (it can required many thousands of steps)
• If there are multiple local minima in the error surface, then there is no guarantee that the procedure will
find the global minimum.

41
Stochastic Approximation of Gradient Descent Algorithm
(Incremental Updating of Weights)

1. Initialize all the weights Wi, bias value b, and learning rate ƞ . Weights may be initialized to a small random
value. Initialize △ Wi =0
2. Repeat the following steps until converging criteria is met.
For each input vector Pk = (X1, X2, . . . Xn, t) in training set D,
a) Activation Function : Calculate actual output Y
𝑛 Y= -1
𝑌=𝐹 𝑋 = 1 𝑖𝑓 𝑋 > 0 or 1
𝑋 = ෍ 𝑊𝑖 𝑋𝑖 + 𝑏
𝑖=1 −1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

b) Delta Training Rule:


Calculate error & Update the weights Wi and bias b
Wi (new) = Wi(old) + △ Wi

where △ Wi = ƞ . E. Xi Error = ( t – Y )
b (new) = b(old) + ƞ .E
42
Gradient Descent Algorithm (Batch Updating of Weights)
1. Initialize all the weights Wi, bias value b, and learning rate ƞ . Weights may be initialized to a small random
value. Initialize △ Wi =0
2. Repeat the following steps until converging criteria is met.
For each input vector Pk = (X1, X2, . . . Xn, t) in training set D,
a) Activation Function : Calculate actual output Y
𝑛 Y= -1
𝑌=𝐹 𝑋 = 1 𝑖𝑓 𝑋 > 0 or 1
𝑋 = ෍ 𝑊𝑖 𝑋𝑖 + 𝑏
𝑖=1 −1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

b) Delta Training Rule:

Calculate error, △ Wi = △ Wi + ƞ . E . Xi

△b=△b+ ƞ .E
3. Update the weights Wi and bias b Error = ( t – Y )
Wi (new) = Wi(old) + △ Wi

b (new) = b(old) + △ b
43
Multi-Layer Feedforward NN
❖ Input layer and output layer are connected through 1 or more hidden layers.
❖ More the number of hidden layers, more the complexity of the NN.
❖ Fully connected NN - Every neuron in layer Xi is connected to every neuron in layer Xi+1.
❖ Feedforward Network - Flow of signals can be in one direction only, from i/p layer to o/p layer via hidden layers.

44
Multilayer Neural Network
❖ The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic
biological basis, or as output units.
❖ A network containing two hidden layers is called a three-layer neural network, and so on.

❖ INPUT: records without class attribute with normalized attributes values.


❖ INPUT VECTOR: X = { x1, x2, …. xn} where n is the number of (non class) attributes.
❖ INPUT LAYER – there are as many nodes as non-class attributes i.e. as the length of the input vector.
❖ HIDDEN LAYER(S) – the number of nodes in the hidden layer and the number of hidden layers depends on
implementation.
❖ OUTPUT LAYER – corresponds to the class attribute. There are as many nodes as classes (values of the class
attribute).

45
Why do we need Multi-layer network
Single layer nets have limited representation power.

Linear Separable: Linear Separable: ❖ Linear inseparable:


❖ Solution?
+ +
X X
+X

-Y + -Y +
Y Y

+Y
-X -X

46
Separability in Classification

47
Appropriate Problems for Neural Network Learning
❖ ANN learning is well-suited to problems in which the training data corresponds to noisy, complex data. Ex:
Sensor Data, Video Surveillance Data, Audio Data from Microphones etc.
❖ It is also applicable to problems for which more symbolic representations are often used.
❖ Instances are represented by many attribute-value pairs. The target function to be learned is defined
over instances that can be described by a vector of predefined features, such as the pixel values in image data.
These input attributes may be highly correlated or independent of one another. Input values can be any real
values.
❖ The target function output may be discrete-valued, real-valued, or a vector of several real- or
discrete-valued attributes.
❖ The training examples may contain errors. ANN learning methods are quite robust to noise in the training
data.
❖ Long training times are acceptable. Network training algorithms typically require longer training times than,
say, decision tree learning algorithms. Training times can range from a few seconds to many hours, depending
on factors such as the number of weights in the network, the number of training examples considered, and the
settings of various learning algorithm parameters.

48
Appropriate Problems for Neural Network Learning
❖ Fast evaluation of the learned target function may be required. Although ANN learning times are
relatively long, evaluating the learned network, in order to apply it to a subsequent instance, is typically very
fast.
❖ The ability of humans to understand the learned target function is not important. The weights
learned by neural networks are often difficult for humans to interpret. Learned neural networks are less easily
communicated to humans than learned rules.

49

You might also like