Module 5 AIML Notes
Module 5 AIML Notes
Artificial
Neural
Networks,
Clustering
sdsdf
Similar to a human brain has neurons interconnected to each other, artificial neural
networks also have neurons that are linked to each other in various layers of the networks.
• Dendrites are tree like networks made of nerve fiber connected to the cell body.
An Axon is a single, long connection extending from the cell body and carrying signals
from the neuron. The end of axon splits into fine strands. It is found that each strand
terminated into small bulb like organs called as synapse. It is through synapse that the
neuron introduces its signals to other nearby neurons. The receiving ends of these synapses
on the nearby neurons can be found both on the dendrites and on the cell body. There are
approximately 104 synapses per neuron in the human body. Electric impulse is passed
between synapse and dendrites. It is a chemical process which results in increase/decrease
in the electric potential inside the body of the receiving cell. If the electric potential
reaches a thresh hold value, receiving cell fires & pulse / action potential of fixed strength
sdsdf
and duration is send through the axon to synaptic junction of the cell. After that, cell has to
wait for a period called refractory period.
ARTIFICIAL NEURONS:
Artificial neurons are like biological neurons that are linked to each other in various layers
of the networks. These neurons are known as nodes.
A node or a neuron can receive one or more input information and process it. artificial
neurons are connected by connection links to another neuron. Each connection link is
associated with a synapticweight. The structure of a single neuron is shown below:
Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes
the output through a cable like structure to other connected neurons (axon to synapse to
other neuron’s dendrite).
The received input are computed as a weighted sum which is given to the activation functionand if
the sum exceeds the threshold value the neuron gets fired.The neuron is the basic processing unit that
sdsdf
receives a set of inputs x1,x2,x3,….xn and their associated weights w1,w2,w3,….wn. The
summation function computes the weighted sum of the inputs received by the neuron.
Sum=∑xiwi
Activation functions:
Activation functions:
The output is same as the input ie the weighted sum. The function is useful when we
donot apply any threshold. The output value ranged between –∞ and +∞
2. Binary step function: This function can be defined as
�(�) = { 1 �� � ≥
�
0 �� � < �
Where, θ represents threshhold value. It is used in single layer nets to convertthe net
input to an output that is binary (0 or 1).
sdsdf
7. ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient. The
main catch here is that the ReLU function does not activate all the neurons at the
sametime. The neurons will only be deactivated if the output of the linear
transformation is less than 0
.
8. Softmax function: Softmax is an activation function that scales
numbers/logits into probabilities. The output of a Softmax is a vector (say v)
with probabilities of each possible outcome. The probabilities in vector v sums
to one for all possible outcomes orclasses.
Components of a perceptron
• A perceptron, the basic unit of a neural network, comprises essential components that
collaborate in information processing.
• Input Features: The perceptron takes multiple input features, each input feature represents a
characteristic or attribute of the input data.
7
• Weights: Each input feature is associated with a weight, determining the significance of
each input feature in influencing the perceptron’s output. During training, these weights are
adjusted to learn the optimal values.
• Summation Function: The perceptron calculates the weighted sum of its inputs using the
summation function. The summation function combines the inputs with their respective
weights to produce a weighted sum.
• Activation Function: The weighted sum is then passed through an activation function. which
take the summed values as input and compare with the threshold and provide the output as 0
or 1.
• Output: The final output of the perceptron, is determined by the activation function’s result.
For example, in binary classification problems, the output might represent a predicted class
(0 or 1).
• Bias: A bias term is often included in the perceptron model. The bias allows the model to
make adjustments that are independent of the input. It is an additional parameter that is
learned during training.
• Learning Algorithm (Weight Update Rule): During training, the perceptron learns by
adjusting its weights and bias based on a learning algorithm. A common approach is the
perceptron learning algorithm, which updates weights based on the difference between the
predicted output and the true output.
8
9
Learning Rules
10
Delta Learning Rule and Gradient Descent
Developed by Widrow and Hoff, the delta rule, is one of the most common learning rules.
It is supervised learning.
Delta rule is derived from gradient descent method(Back-propogation).
It is Non-linearly separable. Also called as continuous perceptron Learning rule.
It updates the connection weights with the difference between the target and the output
value. It is the least mean square learning algorithm.
The Delta difference is measured as an error function or also called as cost function.
Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enters the layer
and are multiplied by the weights in this model. The weighted input values are then summed
together to form a total. If the sum of the values is more than a predetermined threshold, which is
normally set at zero, the output value is usually 1, and if the sum is less than the threshold, the
output value is usually -1. The single-layer perceptron is a popular feed-forward neural network
model that is frequently used for classification. The model may or may not contain hidden layer and
there is no backpropagation. Based on the number of hidden layers they are further classified into
single-layered and multilayered feed forward network.
11
Fully connected Neural Network:
A fully connected neural network consists of a series of fully connected layers that connect
every neuron in one layer to every neuron in the other layer.
The major advantage of fully connected networks is that they are “structure agnostic” i.e. there
are no special assumptions needed to be made about the input.
Multilayer Perceptron:
A multi-layer perceptron has one input layer and for each input, there is one neuron (or node), it has
one output layer with a single node for each output and it can have any number of hidden layers and
each hidden layer can have any number of nodes. The information flows in both directions. The
weight adjustment training is done via backpropagation. Every node in the multi-layer perception
uses a sigmoid activation function. The sigmoid activation function takes real values as input and
converts them to numbers between 0 and 1 using the sigmoidformula.
12
Feedback Neural Network:
Feedback networks also known as recurrent neural network or interactive neural network are
the deep learning models in which information flows in backward direction. It allows feedback loops
in the network. Feedback networks are dynamic in nature, powerful andcan get much complicated at
some stage of execution Neuronal connections can be made in any way.RNNs may process input
sequences of different lengths by using their internal state, which canrepresent a form of memory.
They can therefore be used for applications like speech recognition or handwriting recognition.
A Multilayer Perceptron has input and output layers, and one or more hidden layers with many
neurons stacked together. And while in the Perceptron the neuron must have an activation function
that imposes a threshold, like ReLU or sigmoid, neurons in a Multilayer Perceptron can use any
arbitrary activation function.
Multilayer Perceptron falls under the category of feedforward algorithms, because inputs are combined
with the initial weights in a weighted sum and subjected to the activation function, just like in the
Perceptron. But the difference is that each linear combination is propagated to the next layer.
Each layer is feeding the next one with the result of their computation, their internal representation of
the data. This goes all the way through the hidden layers to the output layer.
If the algorithm only computed the weighted sums in each neuron, propagated results to the output
layer, and stopped there, it wouldn’t be able to learn the weights that minimize the cost function. If the
algorithm only computed one iteration, there would be no actual learning.
13
Backpropagation
Backpropagation is the learning mechanism that allows the Multilayer Perceptron to iteratively adjust
the weights in the network, with the goal of minimizing the cost function.
There is one hard requirement for backpropagation to work properly. The function that combines
inputs and weights in a neuron, for instance the weighted sum, and the threshold function, for instance
ReLU, must be differentiable. These functions must have a bounded derivative, because Gradient
Descent is typically the optimization function used in MultiLayer Perceptron.
14
Radial Basis Function Neural Network
This networks have a fundamentally different architecture than most neural network architectures. Most
neural network architecture consists of many layers and introduces nonlinearity by repetitively applying
nonlinear activation functions.
RBF network on the other hand only consists of an input layer, a single hidden layer, and an output
layer.
The input layer is not a computation layer, it just receives the input data and feeds it into the special
hidden layer of the RBF network. The computation that is happened inside the hidden layer is very
different from most neural networks, and this is where the power of the RBF network comes from. The
output layer performs the prediction task such as classification or regression.
RBF Neural networks are conceptually similar to K-Nearest Neighbor (k-NN) models.
It is useful for interpolation, function approximation ,time series prediction and classification.
Different approaches are followed to determine the centres for the hidden layer RBF neurons, comprising:
Random selection of fixed cluster centers
Self-organized selection of centers using k-means clustering
Supervised selection of centers
SOM doesn’t learn by backpropagation with Stochastic Gradient Descent(SGD) ,it use
competitive learning to adjust weights in neurons. Artificial neural networks often utilize
competitive learning models to classify input without the use of labeled data.
Used: In dimension reduction to reduce our data by creating a spatially organized representation,
also it help us to discover the correlation between data.
Self organizing maps have two layers, the first one is the input layer and the second one is
theoutput layer or the feature map.
SOM doesn’t have activation function in neurons, we directly pass weights to output layer without
doing anything.
The mapped vectors are then examined to determine which weight most accurately represents the chosen
sample using a sample random vector. Neighboring weights that are near each weighted vector are
present. The chosen weight is allowed to turn into a vector for a random sample. This encourages the
map to develop and take on new forms. In a 2D feature space, they typically form hexagonal or square
shapes. More than 1,000 times are spent repeatedly performing this entire process.
To determine whether appropriate weights are similar to the input vector, each node is
analyzed.The best matching unit is the term used to describe the appropriate node.
The Best Matching Unit's neighborhood value is then determined. Over time, the neighbors
tendto decline in number.
The appropriate weight further evolves into something more resembling the sample vector. The
surrounding areas change similarly to the selected sample vector. A node's weights change more
as it gets closer to the Best Matching Unit (BMU), and less as it gets farther away from its
neighbor. For N iterations, repeat step two.
Advantages and Disadvantages of ANN
Clustering
Clustering is an unsupervised learning strategy to group the given set of data points into a number of
groups or clusters.
Arranging the data into a reasonable number of clusters helps to extract underlying patterns in the data
and transform the raw data into meaningful knowledge. Example application areas include the
following:
Pattern recognition
Image segmentation
Profiling users or customers
Categorization of objects into a number of categories or groups
Detection of outliers or noise in a pool of data items
Clusters are represented by centroids.Example: If the input points or data is(3,3),(2,6) and(7,9).
Centroid : (3+2+7,3+6+9)=(4,6). The clusters should not overlap and everycluster should represent only
one class.
Clustering algorithms need a measure to find the similarity or dissimilarity among the objects to
group them. Similarity and Dissimilarity are collectively known as proximity measures. This is
used by a number of data mining techniques, such as clustering, nearest neighbour classification,
and anomaly detection.
Distance measures are known as dissimilarity measures, as these indicate how oneobject is
different from another. Measures like cosine similarity indicate the similarity among objects.
Distance measures and similarity measures are two sides of a same coin, as moredistance
indicates more similarity and vice-versa.
If all the conditions are satisfied, then the distance measure is called metric.
1.Quantitative variables
A)Euclidean distance: It is one of the most important and common distance measure. It is also
called L2 norm.
Advantage: The distance does not change with the addition of new object.
Disadvantage:
i) If the unit changes, the resulting Euclidean or squared Euclidean Changes drastically.
C) Chebyshev Distance: Also known as maximum value distance. This isthe absolute magnitude
of the differences between the coordinates of a pair of objects.This distance is called
supremum distance or Lmax or L∞ norm
D) Minkowski Distance: In general, all the above distances measures can be generalized as:
Here Q is the parameter, When the value of q is 1, the distance measure is called city block distance.
When the value of q is 2, the distance measure is called Euclidean distance. When q is infinity, then
this is Chebyshev distance.
Binary Attributes: Binary Attributes have only two values. Distance measures have discussed above
cannot be applied to find the distance between objects that have binary attributes. For finding the
distance among objects with binary objects, the contingency table is used.
Hamming Distance: Hamming distance is a metric for comparing two binary data strings. While
comparing two binary strings of equal length, Hamming distance is the number of bit positions in
which the two bits are different. It is used for error detection or error correction when data is
transmitted over computer networks.
Example
Suppose there are two strings 1101 1001 and 1001 1101.
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance,
d(11011001, 10011101) = 2.
Cosine Similarity
Cosine similarity is a metric used to measure how similar the documents are
irrespective of their size.
It measures the cosine of the angle between two vectors projected in a multi-
dimensional space.
The cosine similarity is advantageous because even if the two similar
documents are far apart by the Euclidean distance (due to the size of thedocument),
chances are they may still be oriented closer together.
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.
In single linkage hierarchical clustering, the distance between two clusters is defined as the
shortest distance between two points in each cluster. For example, the distance between clusters
“A” and “B” to the left is equal to the length of the line between their two closest points.
Average Linkage : In average linkage hierarchical clustering, the distance between two clusters is
defined as the average distance between each point in one cluster to every point in the other cluster.
For example, the distance between clusters “A” and “B” is equal to the average length each arrow
between connecting the points of one cluster to the other.
Mean-shift algorithm basically assigns the datapoints to the clusters iteratively by shifting points
towards the highest density of datapoints i.e. cluster centroid.
The difference between K-Means algorithm and Mean-Shift is that later one does not need to specify
the number of clusters in advance because the number of clusters will be determined by the algorithm
w.r.t data.
Advantages:
No model assumptions
Suitable for all non-convex shapes
Only one parameter of the window, that is bandwidth is required
Robust to noise
No issues of local minima or premature termination
Disadvantages
Selecting the bandwidth is a challenging task. If it is larger, then many clusters are missed. If it is
small, then many points are missed and convergence occurs as the problem
No. Of clusters cannot be specified and user has no control over this parameter
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster centroid.
�
SSE= ∑ �( �) = ∑ (����(�� , x)2)
�=1
A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
Density-Based Clustering refers to unsupervised learning methods that identify distinctive
groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region
of high point density, separated from other such clusters by contiguous regions of low point
density.
There are three types of points after the DBSCAN clustering is complete:
Core — This is a point that has at least m points within distance n from itself.
Border — This is a point that has at least one Core point at a distance n.
Noise — This is a point that is neither a Core nor a Border. And it has less than m points
within distance n from itself.
Grid-Based Approaches
grid-based clustering method takes a space-driven approach by partitioning the embedding
space into cells independent of the distribution of the input objects.
The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the
object space into a finite number of cells that form a grid structure on which all of the
operations for clustering are performed.
The main advantage of the approach is its fast processing time, which is typically independent
of the number of data objects, yet dependent on only the number of cells.
Subspace Clustering
CLIQUE is a density-based and grid-based subspace clustering algorithm, useful for finding
clustering in subspace.
Concept of Dense cell
CLIQUE partitions each dimension into several overlapping intervals and intervals it into
cells. Then, algorithm determines whether the cells is dense or sparse. The cell is considered
dense if it exceeds a threshold value.
It is defined as the ratio of number of points and volume of the region. In one pass, the
algorithm finds the number of cells , number of points etc and then combines the dense cells.
For that the algorithm uses the contiguous intervals and a set of dense cells.
MONOTONICITY Property
CLIQUE uses anti- monotonicity property or apriori algorithm. It means that all the
subsets of a frequent itemset are frequent. Similarly if the subset is infrequent then its
superset are infrequent.
Two popular probability model-based clustering methods are Gaussian Mixture Models (GMMs) and
Hidden Markov Models (HMMs). other than these we have other set of model . those are:
1. Fuzzy Clustering
2. EM algorithm
Fuzzy Clustering :
Fuzzy Clustering is a type of clustering algorithm in machine learning that allows a data point to belong
to more than one cluster with different degrees of membership. Unlike traditional clustering algorithms,
such as k-means or hierarchical clustering, which assign each data point to a single cluster, fuzzy
clustering assigns a membership degree between 0 and 1 for each data point for each cluster.
Let us consider ci and cj then an element say x, can belong to both the cluster.The strength of
the association of an object with the cluster is given as wij . The value of wij lies between 0
and 1. The sum of the weights of an object, if added, gives 1.
Expectation Maximization Algorithm
The Expectation-Maximization (EM) algorithm is a statistical method used for estimating
parameters in statistical models when you have incomplete or missing data. It's commonly used
in unsupervised machine learning tasks such as clustering and Gaussian Mixture Model (GMM)
fitting.
Given a mix of distributions, data can be generated by randomly picking a distribution and
generating the point. Gaussian distribution is a bell shaped curve.
1. Initialization: Start with initial estimates of the model parameters. These initial values can be
random or based on some prior knowledge.
2. E-step (Expectation):
In this step, you compute the expected values (expectation) of the latent (unobserved)
variables given the observed data and the current parameter estimates.
This involves calculating the posterior probabilities or likelihoods of the missing data or
latent variables.
Essentially, you're estimating how likely each possible value of the latent variable is,
given the current model parameters.
3. M-step (Maximization):
In this step, you update the model parameters to maximize the expected log-likelihood
found in the E-step.
This involves finding the parameters that make the observed data most likely given the
estimated values of the latent variables.
The M-step involves solving an optimization problem to find the new parameter values.
4. Iteration:
Repeat the E-step and M-step alternately until convergence criteria are met. Common
convergence criteria include a maximum number of iterations, a small change in
parameter values, or a small change in the likelihood.
5. Termination:
Once the EM algorithm converges, you have estimates of the model parameters that
maximize the likelihood of the observed data.
6. Result:
The final parameter estimates can be used for various purposes, such as clustering,
density estimation, or imputing missing data.
The EM algorithm is widely used in various fields, including machine learning, image
processing, and bioinformatics.
One of its notable applications is in Gaussian Mixture Models (GMMs), where it's used to
estimate the means and covariances of Gaussian distributions that are mixed to model
complex data distributions.
It's important to note that the EM algorithm can sometimes get stuck in local optima, so the
choice of initial parameter values can affect the results. To mitigate this, you may run the
algorithm multiple times with different initializations and select the best result.
Here,α and β are parameters. DUNN index is a useful measure that can combine both cohension and
separation.
Silhouette Coefficient
This metric measures how well each data point fits into its assigned cluster and ranges from -1 to
1. A high silhouette coefficient indicates that the data points are well-clustered, while a low
coefficient indicates that the data points may be assigned to the wrong cluster.
--
Hierarchical Clustering Algorithm
Use dataset and apply hierarchical methods. Show the dendrogram.
SNo. X Y
1. 3 5
2. 7 8
3. 12 5
4. 16 9
5. 20 8
Solution
The similarity table among the variables is computed and is shown in Table 134.4. Euclidean
distance is computed and is shown in the following Table 143.57.
Objects 0 1 2 3 4
0 - 5 9 9.85 17.26
1 - 5.83 9.49 13
2 - 5.66 8.94
3 - 4.12
4 -
The minimum distance is 4.12. Therefore, the items 1 and 4 are clustered together. The resultant
table is given as shown in the following Table.
Table After Iteration 1
Clusters {1,4} 2 3 5
2 - 5.83 13
3 - 8.94
5 -
The distance between the group {1, 4} and items 2, 3, 5 are computed using this formula.
Thus, the distance between {1,4} and {2} is:
Minimum { {1,4}, {2} = Minimum {(1,2),(4,2)=5
The distance between {1,4} and {3} is given as:
Minimum { {1,3}, {4,3} } = Minimum {9,5.66}=5.66
The distance between {1,4} and {5} is given as:
Minimum { {1,5}, {2,5} } = Minimum {17.26,4.12} = 4.12
The minimum distance of above table is 4.12. Therefore, {1,4} and {5} are combined. This
results in the following Table.
Clusters {1,4,5} 2 5
{1,4,5} - 5 5.66
2 - 5.83
5 -
The minimum is 5. Therefor {1,4,5} and {2} is combined. And finally, it is combined with
{3}.
therefore, the order of cluster is {1,4} then {5}, then {2} and finally {3}.
Complete Linkage or MAX or Clique
Here from the first iteration table minimum is taken and {1,4} is combined. Then maximum
is computed as
Clusters {1,4} 2 3 5
3 - 8.94
5 -
So, the minimum is 8.94. Therefore, {3,5} is combined. This is shown in the following Table.
2 -
The minimum is 9.49. Therefore {1,4,2} are combined. The order of cluster is {1,4}, {1,4}and {2}, and
{3,5}.
Hint: The same is used for average link algorithm where the average distance of all pairs ofpoints across
the clusters is used to form clusters.
K-means clustering problems
Consider the following data shown in Table 143.125. Use k-means algorithm with k
= 2 and show the result.
Table Sample Data
SNO X Y
1. 3 5
2. 7 8
3. 12 5
4. 16 9
Solution
Let us assume the seed points are (3,5) and (16,9). This is shown in the following table
as starting clusters.
Table Initial Cluster Table
Cluster 1 Cluster 2
(3,5) (16,9)
Iteration 1: Compare all the data points or samples with the centroid and assigned to the
nearest sample.
Take the sample object 2 and compare it with the two centroids as follows:
Dist(2,centroid 1) = 5
(7 3)2 (8 5)2 16 9 25
Dist(2,centroid 2) = (7 16)2 (8 9)2 811 82 9.05
Object 2 is closer to centroid of cluster 1 and hence assign it to the cluster 1. This is shown in
Table. For the object 3:,
Dist(3,centroid 1) = 9
(12 3)2 (5 5)2 81
Dist(3,centroid 2) = (12 16)2 (5 9)2 16 16 32 5.66
Object 3 is closer to centroid of cluster 2. and hence remains in the same cluster 1
Cluster 1 Cluster 2
(3,5) (12,4)
(7,8) (10,4)
Dist(1,centroid 1) = 6.25
(7 5)2 (8 6.5)2
Dist(1,centroid 2) = (12 14)2 (8 7)2 49 1 50 7.07
Object 1 is closer to centroid of cluster 1 and hence remains in the same cluster. Take the
sample object 3, compute again
Solution:
X0
�3 �4
X1 �13
X3 X4
�34
AND NOT
�23
X2
0 0 0 1
0 1 0 1
1 0 0 1
1 1 1 0
ITERATION 1:
Step 1: FORWARD PROPAGATION
1. Calculate net inputs and outputs in input layer as shown in Table 3.
Table 3: Net Input and Output Calculation
Input Layer �� ��
�� 0 0
�� 1 1
2. Calculate net inputs and outputs in hidden and output layer as shown in Table 4.
Table 4: Net Input and Output Calculation in Hidden and Output layer
= 0.450
����� ��� ����� �� ��� ������ ��
�� �4 = �3�34 + �0�4 1
�� =
1 + �−�4
= (0.450 ∗ 0.3) + 1(−0.3)
1
= −0.165 =
1 + �−(−0.165)
= 0.458
3. Calculate Error
����� = �������� − ����������
= 1 − 0.458
����� = 0.542
ITERATION 2:
Step 1: FORWARD PROPAGATION
�� �4 = �3�34 + �0�4 1
�� =
1 + �−�4
= (0.451 ∗ 0.324) + 1(−0.246)
1
= −0.099 =
1 + �−(−0.099)
= 0.475
2. Calculate Error
����� = �������� − ����������
= 1 − 0.475
����� = 0.525
ITERATION ERROR
1 0.542 =0.542-0.525
=0.017
2 0.525
In iteration 2 the error gets reduced to 0.525. This process will continue until desired output
is achieved.
How a Multi-Layer Perceptron does solves the XOR problem. Design an MLP with back
propagation to implement the XOR Boolean function.
Solution:
X1 X2 Y
0 0 1
0 1 0
1 0 0
1 1 1
X0
0.1
X1 -0.3
-0.2
0.4
0.4
0.2
X3 0.2
X2 X5
-0.3
-0.3
X4
Table 11: Error Calculation for each unit in the Output layer and Hidden layer
For Output Layer Errork
Unit k
X5 Error 5 = O5 (1-O5) (1 – O5)
= 0.407 * (1-0.407) * (1- 0.407)
= 0.143
For Hidden layer Errorj
Unit j
X4 Error 4 = O4 (1-O4) ∑� ������ ��� = O4 (1-O4) �����5 �45
= 0.622 (1-0.622) *- 0.3 *0.143
= -0.010
X3 Error 3 = O3 (1-O3) ∑� ������ ��� = O3 (1-O3) �����5 �35
= 0.549 (1- 0.549) * 0.143 * 0.2
= -0.007
2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 15.
Table 15: Net Input and Output Calculation in the Hidden Layer and Output Layer
Unit j Net Input Ij Output Oj
1 1
X3 I3 = X1*W13 + X2*W23+ X0*θ3 O3 = = =
1+�−�3 1+�−0.211
I3 = 1*-0.194 + 0*0.2+ 1*0.405 = 0.211
0.552
1 1
X4 I4 = X1*W14 + X2*W24+ X0*θ4 O4 = = =
1+�−�4 1+�−0.484
I4 = 1*0.392 + 0*-0.3+ 1*0.092 = 0.484
0.618
1 1
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 O5 = = =0.429
1+�−�5 1+�0.282
I5 = 0.552* 0.154 + 0.618* -0.288 + 1*-0.185 = -
0.282
Consider the Network architecture with 4 input units and 2 output units. Consider four training
samples each vector of length 4.
Training samples
i1: (1, 1, 1, 0)
i2: (0, 0, 1, 1)
i3: (1, 0, 0, 1)
i4: (0, 0, 1, 0)
Output Units: Unit 1, Unit 2
Learning rate η(t) = 0.6
Initial Weight matrix
0.2 0.8 0.5 0.1
[Unit 1 ]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Identify an algorithm to learn without supervision? How do you cluster them as we
expected?
Solution:
Use Self Organizing Feature Map (SOFM)
Iteration 1:
Training Sample X1: (1, 1, 1, 0)
Weight matrix
0.2 0.8 0.5 0.1
[Unit 1]: [ ]
Unit 2 0.3 0.5 0.4 0.6
Iteration 3:
Training Sample X3: (1, 0, 0, 1)
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1 ]:[ ]
Unit 2 0.12 0.2 0.76 0.84
Iteration 4:
Training Sample X4: (0, 0, 1, 0)
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1 ]:[ ]
Unit 2 0.65 0.08 0.3 0.94
This process is continued for many epochs until the feature map doesn’t change.
documents are far apart by the Euclidean distance (due to the size of the
document), chances are they may still be oriented closer together.
The smaller the angle, higher the cosine similarity.
Consider 2 documents P1 and P2.
◦ If distance is more, then less similar.
◦ If distance is less, then more similar.
1. Consider the following data and, calculate the Euclidean, Manhattan and
Chebyshev distances.
a. (2 3 4) and (1 5 6)
Solution
2 1 3 5) 4 6 1 2 2 5
Manhattan distance =
b. (2 2 9) and (7 8 9)
25 36 09 61
Euclidean Distance = (2 7) (2 8) (9 9)
2 2 2 7.81
Manhattan Distance = 2 7 2 8) 9 9 5 6 0 11
2. Find cosine similarity, SMC and Jaccard coefficients for the following binary
data:
a. (1 0 1 1) and (1 1 0 0)
Solution
10 11
110 0
C = 2, b = 1, d = 1,
ad 1
SMC = 0.25
a bc d 4
d 1
0.25
Jaccard Coefficient =
bcd 4
Cosine Similarity =
(11 01
3 12 0 1 0) 31 2
b. (1 0 0 0 1) and (1 0 0 0 0 1)
Solution
No match
(1 0 0 0 1) and (1 1 0 0 0)
10001
11000
A=2, b= 1, c = 1, d= 1
ad 2
SMC = 0.5
abc d 5
d 1
Jaccard Coefficient = 0.33
bcd 3
Solution
It differs in two positions; therefore Hamming distance is 2
b. (1 1 1 0 0) and (0 0 1 1 1)
Solution
It differs in four positions; therefore, Hamming distance is 4
Solution
1 2
Therefore, the distance between (yellow, red) = 1 1 0.5
2 2 2
2 3 1 1
Distance between (red, green) = 0.5
2
2 2
3 1 2
Distance between (green, yellow) =
1
2 2
Therefore, distance between (Yellow, red, green) and (red, green, yellow) is (0.5,0.5,1).
b. (bread, butter, milk) and (milk, sandwich, Tea)
Solution
2 1
The distance between (bread, milk) = 1 3
5 1 4 2
2 1
The distance between (butter, sandwich) = 2 4
5 1 4 2
1
The distance between (Milk, Tea) = 3 5 2
5 1 4 2
Therefore, the distance
between (bread, butter, milk)
and (milk, sandwich, Tea) =
1 1 1
, ,
2 2 2