ML Module 4
ML Module 4
Bayes' Theorem is a fundamental concept in probability theory and forms the foundation of Bayesian
learning in machine learning. It allows you to update the probability of a hypothesis (or event) based on
new evidence.
At its core, Bayes' Theorem relates current knowledge or belief about an event (the prior probability) to
new data or evidence (the likelihood) to produce an updated belief (the posterior probability).
Where:
• .IN
P(H | D) is the posterior probability: the probability of the hypothesis HH being true given the data
C
DD.
• P(D | H) is the likelihood: the probability of observing the data DD given that hypothesis HH is true.
N
• P(H) is the prior probability: the initial belief about the hypothesis HH before any data is observed.
• P(D) is the marginal likelihood or evidence: the total probability of the data under all possible
SY
hypotheses. This acts as a normalizing constant to ensure that the posterior is a valid probability
distribution.
o
o Example: In a medical test scenario, it could be the prior probability of a person having a
disease before considering the test results (e.g., based on the general population statistics).
2. Likelihood (P(D | H)):
o This is the probability of observing the data, assuming the hypothesis is true. It expresses how
likely it is to see the given data under the assumption of the hypothesis.
o Example: The likelihood would be the probability of getting a positive test result assuming the
person has the disease.
3. Evidence (P(D)):
o This is the total probability of the data across all hypotheses. It serves to normalize the
posterior probability so that it sums to 1.
o Example: The probability of getting a positive test result across all people, whether they have
the disease or not.
4. Posterior Probability (P(H | D)):
o This is the updated belief about the hypothesis after considering the new data (the evidence).
o Example: The posterior would give the probability of a person having the disease after
considering both the prior knowledge and the test results.
• Before you collect any data, you have a prior belief about a hypothesis (e.g., the probability of a patient
having a disease).
• After seeing new data (e.g., the result of a medical test), you update your belief about the hypothesis to
reflect this new evidence.
Bayes’ Theorem lets you do this systematically, ensuring that your updated belief (posterior) is
proportional to the prior belief and the likelihood of observing the new data.
.IN
o This is the probability of getting a positive test result if the person has the disease. Suppose the
test correctly identifies the disease 95% of the time, so P(D∣H)=0.95P(D | H) = 0.95.
3. Evidence (P(D)):
o This is the total probability of a positive test result in the population. It includes both people
C
who have the disease and those who do not.
4. Posterior Probability (P(H | D)):
After receiving a positive test result, we want to calculate the probability that the person
N
o
actually has the disease.
SY
U
VT
VT
U
SY
N
C
.IN
VT
U
SY
N
C
.IN
VT
U
SY
N
C
.IN
VT
U
SY
N
C
.IN
VT
U
SY
N
C
.IN
VT
U
SY
N
C
.IN
VT
U
SY
N
C
.IN
VT
U
SY
N
C
.IN
VT
U
SY
N
C
.IN
VT
U
SY
N
C
.IN
Chapter 10
Artificial Neural Networks
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial intelligence
modelled after the brain.
An Artificial neural network is usually a computational network based on biological neural networks
that construct the structure of the human brain.
Similar to a human brain has neurons interconnected to each other, artificial neural networks also have
neurons that are linked to each other in various layers of the networks. These neurons are known as
nodes.
.IN
C
N
SY
•
Dendrites are tree like networks made of nerve fiber connected to the cell body.
An Axon is a single, long connection extending from the cell body and carrying signals from the
neuron. The end of axon splits into fine strands. It is found that each strand terminated into small
bulb like organs called as synapse. It is through synapse that the neuron introduces its signals to
other nearby neurons. The receiving ends of these synapses on the nearby neurons can be found
both on the dendrites and on the cell body. There are approximately 104 synapses per neuron in the
human body. Electric impulse is passed between synapse and dendrites. It is a chemical process
which results in increase/decrease in the electric potential inside the body of the receiving cell. If
the electric potential reaches a thresh hold value, receiving cell fires & pulse / action potential of
fixed strength and duration is send through the axon to synaptic junction of the cell. After that, cell
has to wait for a period called refractory period.
.IN
C
N
SY
U
VT
ARTIFICIAL NEURONS:
.IN
Artificial neurons are like biological neurons that are linked to each other in various layers of the
C
networks. These neurons are known as nodes.
N
A node or a neuron can receive one or more input information and process it. artificial neurons are
SY
connected by connection links to another neuron. Each connection link is associated with a synaptic
weight. The structure of a single neuron is shown below:
U
VT
Fig: McCulloch-Pitts Neuron Mathematical model. .IN
C
N
SY
Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes
the output through a cable like structure to other connected neurons (axon to synapse to
other neuron’s dendrite).
OR
Working:
The received input are computed as a weighted sum which is given to the activation function
.IN
and if the sum exceeds the threshold value the neuron gets fired.The neuron is the basic
processing unit that receives a set of inputs x1,x2,x3,….xn and their associated weights
w1,w2,w3,….wn. The summation function computes the weighted sum of the inputs
C
received by the neuron.
Sum=∑xiwi
N
SY
Activation functions:
• To make work more efficient and for exact output, some force or activation is given. Like
that, activation function is applied over the net input to calculate the output of an ANN.
U
Information processing of processing element has two major parts: input and output. An
VT
The output is same as the input ie the weighted sum. The function is useful when we do
not apply any threshold. The output value ranged between –∞ and +∞
2. Binary step function: This function can be defined as
𝑓(𝑓) = { 1 𝑓𝑓 𝑓 ≥ 𝑓
0 𝑓𝑓 𝑓 < 𝑓
Where, θ represents threshhold value. It is used in single layer nets to convert
the net input to an output that is binary (0 or 1).
3. Bipolar step function: This function can be defined as
𝑓(𝑥) = { 1 𝑖𝑓 𝑥 ≥ 𝜃
−1 𝑖𝑓 𝑥 < 𝜃
Where, θ represents threshold value. It is used in single layer nets to convert
the net input to an output that is bipolar (+1 or -1).
4. Sigmoid function: It is used in Back propagation nets.
Two types:
a) Binary sigmoid function: It is also termed as logistic sigmoid function or unipolar
sigmoid function. It is defined as
.IN
where, λ represents steepness parameter. The range of sigmoid function is 0
C
to 1
b) Bipolar sigmoid function: This function is defined as
N
SY
and +1.
VT
activation function, and even has the same S-shape with the difference in output range of -1 to
1. In Tanh, the larger the input (more positive), the closer the output value will be to 1.0,
VT
whereas the smaller the input (more negative), the closer the output will be to -1.0.
7. ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time.
The neurons will only be deactivated if the output of the linear transformation is less than 0
8. Softmax function: Softmax is an activation function that scales numbers/logits into
probabilities. The output of a Softmax is a vector (say v) with probabilities of each
possible outcome. The probabilities in vector v sums to one for all possible outcomes or
classes.
.IN
C
N
SY
U
VT
• Knowledge is acquired by the network from its environment through a learning process.
.IN
C
N
PERCEPTRON AND LEARNING THEORY
SY
linear predictor function combining a set of weights with the feature vector.
• One type of ANN system is based on a unit called a perceptron.
VT
OR
• The perceptron can represent all boolean primitive functions AND, OR, NAND , NOR.
• Some boolean functions can not be represented .
– E.g. the XOR function.
.IN
Major components of a perceptron
• Input
C
• Weight
• Bias
N
• Weighted summation
SY
• Step/activation function
• output
WORKING:
U
• Feed the features of the model that is required to be trained as input in the first layer. All
VT
weights and inputs will be multiplied – the multiplied result of each weight and input will be
added up.The Bias value will be added to shift the output function .This value will be
presented to the activation function (the type of activation function will depend on the need)
The value received after the last step is the output value.
The activation function is a binary step function which outputs a value 1, if f(x) is above the
threshold value Θ and a 0 if f(x) is below the threshold value Θ. Then the output of a neuron
is:
.IN
C
PROBLEM:
Design a 2 layer network of perceptron to implement NAND gate. Assume your own weights and
N
biases in the range of [-0.5 0.5]. Use learning rate as 0.4.
SY
Solution:
U
X0
VT
𝚹3 𝚹4
X1 𝑤13
X3 X4
𝑤34
AND NOT
𝑤23
X2
0 0 0 1
0 1 0 1
1 0 0 1
1 1 1 0
ITERATION 1:
Step 1: FORWARD PROPAGATION
1. Calculate net inputs and outputs in input layer as shown in Table 3.
Table 3: Net Input and Output Calculation
Input Layer 𝑰𝒋 𝑶𝒋
𝑿𝟏 0 0
.IN
𝑿𝟐 1 1
C
2. Calculate net inputs and outputs in hidden and output layer as shown in Table 4.
Table 4: Net Input and Output Calculation in Hidden and Output layer
N
= −0.2 =
1 + 𝑒−(−0.2)
VT
= 0.450
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝐍𝐞𝐭 𝐨𝐮𝐭𝐩𝐮𝐭 𝑶𝒌
𝑿𝟒 𝐼4 = 𝑂3𝑊34 + 𝑋0𝚹4 1
𝑶𝟒 =
1 + 𝑒−𝐼4
= (0.450 ∗ 0.3) + 1(−0.3)
1
= −0.165 =
1 + 𝑒−(−0.165)
= 0.458
3. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.458
𝐸𝑟𝑟𝑜𝑟 = 0.542
.IN
= 0.458(1 − 0.458)(1 − 0.458)
= 0.134
C
For each hidden layer 𝑬𝒓𝒓𝒐𝒓𝒋
N
𝒖𝒏𝒊𝒕𝒋
SY
= 0.0099
ITERATION 2:
Step 1: FORWARD PROPAGATION
.IN
𝑿𝟑 𝐼3 = 𝑋1𝑊13 + 𝑋2𝑊23 + 𝑋0𝚹3
𝑶𝟑 =
1 + 𝑒−𝐼3
= 0(0.1) + 1(−0.396) + 1(0.203)
1
= −0.193 =
1 + 𝑒−(−0.193)
C
= 0.451
N
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒌
1
SY
𝑿𝟒 𝐼4 = 𝑂3𝑊34 + 𝑋0𝚹4 𝑶𝟒 =
1 + 𝑒−𝐼4
= (0.451 ∗ 0.324) + 1(−0.246)
1
= −0.099 =
1 + 𝑒−(−0.099)
U
= 0.475
VT
2. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.475
𝐸𝑟𝑟𝑜𝑟 = 0.525
ITERATION ERROR
1 0.542 =0.542-0.525
=0.017
2 0.525
In iteration 2 the error gets reduced to 0.525. This process will continue until desired output
is achieved.
How a Multi-Layer Perceptron does solves the XOR problem. Design an MLP with back
propagation to implement the XOR Boolean function.
Solution:
X1 X2 Y
0 0 1
0 1 0
1 0 0
1 1 1
X0
.IN
0.1
X1 -0.3
-0.2
0.4
C
0.4
X3 0.2
0.2
N
X2 X5
SY
-0.3
-0.3
X4
U
Table 11: Error Calculation for each unit in the Output layer and Hidden layer
For Output Layer Errork
VT
Unit k
X5 Error 5 = O5 (1-O5) (1 – O5)
= 0.407 * (1-0.407) * (1- 0.407)
= 0.143
For Hidden layer Errorj
Unit j
X4 Error 4 = O4 (1-O4) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O4 (1-O4) 𝐸𝑟𝑟𝑜𝑟5 𝑊45
= 0.622 (1-0.622) *- 0.3 *0.143
= -0.010
X3 Error 3 = O3 (1-O3) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O3 (1-O3) 𝐸𝑟𝑟𝑜𝑟5 𝑊35
= 0.549 (1- 0.549) * 0.143 * 0.2
= -0.007
.IN
W23 W23 = W23 + 0.8 * Error 3* O2 0.2
= 0.2 + 0.8 * 0.007 *0
C
W24 W24 = W24+ 0.8 * Error 4 * O2 -0.3
= -0.3+ 0.8 * -0.001 *0
N
W35 W35 = W35 + 0.8 * Error 5* O3 0.154
SY
∆θj = = ∝∗ Error j
θj = θj + ∆θj
Table 13: Bias Updation
θj θj = θj + ∝∗ Error j New Bias
𝜃3 Θ3 = θ3 + ∝∗ Error 3 0.405
= 0.4 + 0.8 * 0.007
𝜃4 θ 4 = θ4 + ∝∗ Error 4 0.092
= 0.1 + 0.8 *- 0.01
𝜃5 θ 5 = θ5 + ∝∗ Error 5 -0.185
= -0.3 + 0.8 * 0.143
Iteration 2
Now with the updated weights and biases,
1. Calculate Input and Output in the Input Layer shown in Table 14.
Table 14: Net Input and Output Calculation
Input Layer Ij Oj
X1 1 1
X2 0 0
2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 15.
Table 15: Net Input and Output Calculation in the Hidden Layer and Output Layer
Unit j Net Input Ij Output Oj
1 1
X3 I3 = X1*W13 + X2*W23+ X0*θ3 O3 = = =
1+𝑒−𝐼3 1+𝑒−0.211
I3 = 1*-0.194 + 0*0.2+ 1*0.405 = 0.211 0.552
.IN
1 1
X4 I4 = X1*W14 + X2*W24+ X0*θ4 O4 = = =
1+𝑒−𝐼4 1+𝑒−0.484
I4 = 1*0.392 + 0*-0.3+ 1*0.092 = 0.484 0.618
C
1 1
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 O5 = = =0.429
1+𝑒−𝐼5 1+𝑒0.282
N
I5 = 0.552* 0.154 + 0.618* -0.288 + 1*-0.185 = -
0.282
SY
Consider the Network architecture with 4 input units and 2 output units. Consider four training
samples each vector of length 4.
Training samples
i1: (1, 1, 1, 0)
i2: (0, 0, 1, 1)
i3: (1, 0, 0, 1)
i4: (0, 0, 1, 0)
Output Units: Unit 1, Unit 2
Learning rate η(t) = 0.6
Initial Weight matrix
0.2 0.8 0.5 0.1
[Unit 1]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Identify an algorithm to learn without supervision? How do you cluster them as we
expected?
Solution:
Use Self Organizing Feature Map (SOFM)
Iteration 1:
Training Sample X1: (1, 1, 1, 0)
Weight matrix
0.2 0.8 0.5 0.1
[Unit 1]: [
.IN
]
Unit 2 0.3 0.5 0.4 0.6
Unit 1 wins
Update the weights of the winning unit
New Unit 1 weights = [0.2 0.8 0.5 0.2] + 0.6 ([1 1 1 0] - [0.2 0.8 0.5 0.2])
= [0.2 0.8 0.5 0.2] + 0.6 [0.8 0.2 0.5 -0.2]
= [0.2 0.8 0.5 0.2] + [0.48 0.12 0.30 -0.12]
= [0.68 0.92 0.80 0.08]
.IN
= [0.3 0.5 0.4 0.6] + [-0.18 -0.30 0.36 0.24]
= [0.12 0.2 0.76 0.84]
Iteration 4:
Training Sample X4: (0, 0, 1, 0)
Weight matrix
.IN
= 1.36
Compute Euclidean distance between X1: (0, 0, 1, 0) and Unit 2 weights.
C
d2 = (0.65- 0)2 + (0.08 – 0)2 + (0.3 -1)2 + (0.94– 0)2
N
= 1.8025
Unit 1 wins
SY
This process is continued for many epochs until the feature map doesn’t change.
Learning Rules
Learning in NN is performed by adjusting the network weights in order to minimize the
difference between the desired and estimated output.
🞂 The Delta difference is measured as an error function or also called as cost function.
VT
TYPES OF ANN
1. Feed Forward Neural Network
2. Fully connected Neural Network
3. Multilayer Perceptron
4. Feedback Neural Network
Feed Forward Neural Network:
Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enters the layer and are
multiplied by the weights in this model. The weighted input values are then summed together to form a total.
If the sum of the values is more than a predetermined threshold, which is normally set at zero, the output
value is usually 1, and if the sum is less than the threshold, the output value is usually -1.
The single-layer perceptron is a popular feed-forward neural network model that is frequently used for
classification.
The model may or may not contain hidden layer and there is no backpropagation.
Based on the number of hidden layers they are further classified into single-layered and multilayered feed
forward network.
.IN
C
N
SY
• A fully connected neural network consists of a series of fully connected layers that connect
VT
• The major advantage of fully connected networks is that they are “structure agnostic” i.e. there
are no special assumptions needed to be made about the input.
Multilayer Perceptron:
A multi-layer perceptron has one input layer and for each input, there is one neuron (or node), it has
one output layer with a single node for each output and it can have any number of hidden layers and
each hidden layer can have any number of nodes.
The information flows in both directions.
The weight adjustment training is done via backpropagation.
Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid activation
function takes real values as input and converts them to numbers between 0 and 1 using the sigmoid
formula.
.IN
C
N
SY
It allows feedback loops in the network. Feedback networks are dynamic in nature, powerful and
VT
Limitations of ANN
.IN
C
N
SY
U
VT
Challenges of Artificial Neural Networks
.IN
C
N
SY
U
VT