0% found this document useful (0 votes)
514 views77 pages

Module 5 AIML Notes

The document discusses artificial neural networks and their structure and components. It describes how artificial neurons are modeled after biological neurons and connected in layers. It also explains key concepts like activation functions, perceptrons, weights, and how neural networks learn through adjusting weights.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
514 views77 pages

Module 5 AIML Notes

The document discusses artificial neural networks and their structure and components. It describes how artificial neurons are modeled after biological neurons and connected in layers. It also explains key concepts like activation functions, perceptrons, weights, and how neural networks learn through adjusting weights.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Module 5

Artificial
Neural
Networks,
Clustering
sdsdf

Module 5: Artificial Neural Networks, Clustering


The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modelled after the brain.

An Artificial neural network is usually a computational network based on biological


neural networks that construct the structure of the human brain.

Similar to a human brain has neurons interconnected to each other, artificial neural
networks also have neurons that are linked to each other in various layers of the networks.

These neurons are known as nodes.

The biological neuron consists of main four parts:

• dendrites: nerve fibres carrying electrical signals to the cell .


• cell body: computes a non-linear function of its inputs
• axon: single long fiber that carries the electrical signal from the cell body to other neurons
• synapse: the point of contact between the axon of one cell and the dendrite of
another,regulating a chemical connection whose strength affects the input to the cell.

• Dendrites are tree like networks made of nerve fiber connected to the cell body.

An Axon is a single, long connection extending from the cell body and carrying signals
from the neuron. The end of axon splits into fine strands. It is found that each strand
terminated into small bulb like organs called as synapse. It is through synapse that the
neuron introduces its signals to other nearby neurons. The receiving ends of these synapses
on the nearby neurons can be found both on the dendrites and on the cell body. There are
approximately 104 synapses per neuron in the human body. Electric impulse is passed
between synapse and dendrites. It is a chemical process which results in increase/decrease
in the electric potential inside the body of the receiving cell. If the electric potential
reaches a thresh hold value, receiving cell fires & pulse / action potential of fixed strength
sdsdf

and duration is send through the axon to synaptic junction of the cell. After that, cell has to
wait for a period called refractory period.

ARTIFICIAL NEURONS:

Artificial neurons are like biological neurons that are linked to each other in various layers
of the networks. These neurons are known as nodes.

A node or a neuron can receive one or more input information and process it. artificial
neurons are connected by connection links to another neuron. Each connection link is
associated with a synapticweight. The structure of a single neuron is shown below:

Simple Model of an ANN

The first mathematical model of a biological neuron was designed by McCulloch-Pitts


in 1943.It includes 2 steps:

1. It receives weighted inputs from other neurons.


2. It operates with a threshold function or activation function.

Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes
the output through a cable like structure to other connected neurons (axon to synapse to
other neuron’s dendrite).

The received input are computed as a weighted sum which is given to the activation functionand if
the sum exceeds the threshold value the neuron gets fired.The neuron is the basic processing unit that
sdsdf

receives a set of inputs x1,x2,x3,….xn and their associated weights w1,w2,w3,….wn. The
summation function computes the weighted sum of the inputs received by the neuron.

Sum=∑xiwi

Activation functions:

An activation function is a function that is added to an artificial neural network in order to


help the network learn complex patterns in the data. When comparing with a neuron-based
model that is in our brains, the activation function is at the end deciding what is to be fired
to the next neuron. Typical activation functions can be linear or non linear. Linear functions
are useful when the input values can be classified into any one of the two groups and are
generally used in binary perceptrons. Non linear functions are continuous functions that
map the input in the range of (0,1) or (-1,1) etc. These functions are useful in learning high-
dimensional data or complex data such as audio, video and images.

Activation functions:

1. Identity function or Linear Function: It is a linear function which is defined as �(�)


=
� ��� ��� �

The output is same as the input ie the weighted sum. The function is useful when we
donot apply any threshold. The output value ranged between –∞ and +∞
2. Binary step function: This function can be defined as
�(�) = { 1 �� � ≥

0 �� � < �
Where, θ represents threshhold value. It is used in single layer nets to convertthe net
input to an output that is binary (0 or 1).
sdsdf

3. Bipolar step function: This function can be defined as


�(�) = { 1 �� � ≥ �
−1 �� � < � }
Where, θ represents threshold value. It is used in single layer nets to convertthe net
input to an output that is bipolar (+1 or -1).

4. Sigmoid function: It is used in Back


propagation nets. Two types:
a) Binary sigmoid function: It is also termed as logistic sigmoid function or
unipolar sigmoid function. It is defined as

where, λ represents steepness parameter. The range of sigmoid


function is 0 to 1
b) Bipolar sigmoid function: This function is defined as

Where λ represents steepness parameter and the sigmoid range is


between -1 and +1.
5. Ramp function: The ramp function is defined as:

It is a linear function whose upper and lower limits are fixed.

6. Tanh-Hyperbolic tangent function : Tanh function is very similar to the


sigmoid/logistic activation function, and even has the same S-shape with the
difference in output range of -1 to
1. In Tanh, the larger the input (more positive), the closer the output value will be to
1.0,whereas the smaller the input (more negative), the closer the output will be to -1.0.
sdsdf

7. ReLU Function
ReLU stands for Rectified Linear Unit.

Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient. The
main catch here is that the ReLU function does not activate all the neurons at the
sametime. The neurons will only be deactivated if the output of the linear
transformation is less than 0

.
8. Softmax function: Softmax is an activation function that scales
numbers/logits into probabilities. The output of a Softmax is a vector (say v)
with probabilities of each possible outcome. The probabilities in vector v sums
to one for all possible outcomes orclasses.

Artificial Neural Network Structure


Artificial Neural Networks Computational models inspired by the human brain: – Massively
parallel, distributed system, made up of simple processing units (neurons) – Synaptic
connection strengths among neurons are used to store the acquired knowledge. Knowledge is
acquired by the network from its environment through a learning process.
• The Neural Network is constructed from 3 type of layers:
• Input layer — initial data for the neural network.
• Hidden layers — intermediate layer between input and output layer and place where all the
computation is done.
• Output layer — produce the result for given inputs.
PERCEPTRON AND LEARNING THEORY
• The perceptron is also a simplified model of a biological neuron.
• The perceptron is an algorithm for supervised learning of binary classifiers. It is a type of
linear classifier, i.e. a classification algorithm that makes all of its predictions based on a
linear predictor function combining a set of weights with the feature vector.
• One type of ANN system is based on a unit called a perceptron.

Components of a perceptron

• A perceptron, the basic unit of a neural network, comprises essential components that
collaborate in information processing.

• Input Features: The perceptron takes multiple input features, each input feature represents a
characteristic or attribute of the input data.

7
• Weights: Each input feature is associated with a weight, determining the significance of
each input feature in influencing the perceptron’s output. During training, these weights are
adjusted to learn the optimal values.

• Summation Function: The perceptron calculates the weighted sum of its inputs using the
summation function. The summation function combines the inputs with their respective
weights to produce a weighted sum.

• Activation Function: The weighted sum is then passed through an activation function. which
take the summed values as input and compare with the threshold and provide the output as 0
or 1.

• Output: The final output of the perceptron, is determined by the activation function’s result.
For example, in binary classification problems, the output might represent a predicted class
(0 or 1).

• Bias: A bias term is often included in the perceptron model. The bias allows the model to
make adjustments that are independent of the input. It is an additional parameter that is
learned during training.

• Learning Algorithm (Weight Update Rule): During training, the perceptron learns by
adjusting its weights and bias based on a learning algorithm. A common approach is the
perceptron learning algorithm, which updates weights based on the difference between the
predicted output and the true output.

8
9
Learning Rules

Learning in NN is performed by adjusting the network weights in order to minimize the


difference between the desired and estimated output.

10
Delta Learning Rule and Gradient Descent

Developed by Widrow and Hoff, the delta rule, is one of the most common learning rules.
It is supervised learning.
Delta rule is derived from gradient descent method(Back-propogation).
It is Non-linearly separable. Also called as continuous perceptron Learning rule.
It updates the connection weights with the difference between the target and the output
value. It is the least mean square learning algorithm.

The Delta difference is measured as an error function or also called as cost function.

Types of Artificial Neural Network


1. Feed Forward Neural Network
2. Fully connected Neural Network
3. Multilayer Perceptron
4. Feedback Neural Network

Feed Forward Neural Network:

Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enters the layer
and are multiplied by the weights in this model. The weighted input values are then summed
together to form a total. If the sum of the values is more than a predetermined threshold, which is
normally set at zero, the output value is usually 1, and if the sum is less than the threshold, the
output value is usually -1. The single-layer perceptron is a popular feed-forward neural network
model that is frequently used for classification. The model may or may not contain hidden layer and
there is no backpropagation. Based on the number of hidden layers they are further classified into
single-layered and multilayered feed forward network.

11
Fully connected Neural Network:

 A fully connected neural network consists of a series of fully connected layers that connect
every neuron in one layer to every neuron in the other layer.

 The major advantage of fully connected networks is that they are “structure agnostic” i.e. there
are no special assumptions needed to be made about the input.

Multilayer Perceptron:

A multi-layer perceptron has one input layer and for each input, there is one neuron (or node), it has
one output layer with a single node for each output and it can have any number of hidden layers and
each hidden layer can have any number of nodes. The information flows in both directions. The
weight adjustment training is done via backpropagation. Every node in the multi-layer perception
uses a sigmoid activation function. The sigmoid activation function takes real values as input and
converts them to numbers between 0 and 1 using the sigmoidformula.

12
Feedback Neural Network:

Feedback networks also known as recurrent neural network or interactive neural network are
the deep learning models in which information flows in backward direction. It allows feedback loops
in the network. Feedback networks are dynamic in nature, powerful andcan get much complicated at
some stage of execution Neuronal connections can be made in any way.RNNs may process input
sequences of different lengths by using their internal state, which canrepresent a form of memory.

They can therefore be used for applications like speech recognition or handwriting recognition.

Learning in a multi layer perceptron

A Multilayer Perceptron has input and output layers, and one or more hidden layers with many
neurons stacked together. And while in the Perceptron the neuron must have an activation function
that imposes a threshold, like ReLU or sigmoid, neurons in a Multilayer Perceptron can use any
arbitrary activation function.

Multilayer Perceptron falls under the category of feedforward algorithms, because inputs are combined
with the initial weights in a weighted sum and subjected to the activation function, just like in the
Perceptron. But the difference is that each linear combination is propagated to the next layer.

Each layer is feeding the next one with the result of their computation, their internal representation of
the data. This goes all the way through the hidden layers to the output layer.

If the algorithm only computed the weighted sums in each neuron, propagated results to the output
layer, and stopped there, it wouldn’t be able to learn the weights that minimize the cost function. If the
algorithm only computed one iteration, there would be no actual learning.

This is where Backpropagation comes into play.

13
Backpropagation

Backpropagation is the learning mechanism that allows the Multilayer Perceptron to iteratively adjust
the weights in the network, with the goal of minimizing the cost function.

There is one hard requirement for backpropagation to work properly. The function that combines
inputs and weights in a neuron, for instance the weighted sum, and the threshold function, for instance
ReLU, must be differentiable. These functions must have a bounded derivative, because Gradient
Descent is typically the optimization function used in MultiLayer Perceptron.

14
Radial Basis Function Neural Network

This networks have a fundamentally different architecture than most neural network architectures. Most
neural network architecture consists of many layers and introduces nonlinearity by repetitively applying
nonlinear activation functions.

RBF network on the other hand only consists of an input layer, a single hidden layer, and an output
layer.

The input layer is not a computation layer, it just receives the input data and feeds it into the special
hidden layer of the RBF network. The computation that is happened inside the hidden layer is very
different from most neural networks, and this is where the power of the RBF network comes from. The
output layer performs the prediction task such as classification or regression.

RBF Neural networks are conceptually similar to K-Nearest Neighbor (k-NN) models.

It is useful for interpolation, function approximation ,time series prediction and classification.

RBNFF Neural Network


RBF Neural networks are generally trained to determine the following parameters:
 The number of neurons in the hidden layer
 The center of each hidden layer RBF neuron
 The radius or variance of each RBF neuron
 The weights assigned from the hidden layer to the output layer for the summation function

Different approaches are followed to determine the centres for the hidden layer RBF neurons, comprising:
 Random selection of fixed cluster centers
 Self-organized selection of centers using k-means clustering
 Supervised selection of centers

Self-organizing Feature Map


SOM is trained using unsupervised learning.

SOM doesn’t learn by backpropagation with Stochastic Gradient Descent(SGD) ,it use
competitive learning to adjust weights in neurons. Artificial neural networks often utilize
competitive learning models to classify input without the use of labeled data.

Used: In dimension reduction to reduce our data by creating a spatially organized representation,
also it help us to discover the correlation between data.

Self organizing maps have two layers, the first one is the input layer and the second one is
theoutput layer or the feature map.

SOM doesn’t have activation function in neurons, we directly pass weights to output layer without
doing anything.

Network Architecture and operationsIt


consists of 2 layers:
1. Input layer
2. Output layer
No Hidden layer.
The initialization of the weight to vectors initiates the mapping processes of the Self-Organizing
Maps.

The mapped vectors are then examined to determine which weight most accurately represents the chosen
sample using a sample random vector. Neighboring weights that are near each weighted vector are
present. The chosen weight is allowed to turn into a vector for a random sample. This encourages the
map to develop and take on new forms. In a 2D feature space, they typically form hexagonal or square
shapes. More than 1,000 times are spent repeatedly performing this entire process.

To put it simply, learning takes place in the following ways:

 To determine whether appropriate weights are similar to the input vector, each node is
analyzed.The best matching unit is the term used to describe the appropriate node.

 The Best Matching Unit's neighborhood value is then determined. Over time, the neighbors
tendto decline in number.

The appropriate weight further evolves into something more resembling the sample vector. The
surrounding areas change similarly to the selected sample vector. A node's weights change more
as it gets closer to the Best Matching Unit (BMU), and less as it gets farther away from its
neighbor. For N iterations, repeat step two.
Advantages and Disadvantages of ANN
Clustering
Clustering is an unsupervised learning strategy to group the given set of data points into a number of
groups or clusters.

Arranging the data into a reasonable number of clusters helps to extract underlying patterns in the data
and transform the raw data into meaningful knowledge. Example application areas include the
following:

 Pattern recognition
 Image segmentation
 Profiling users or customers
 Categorization of objects into a number of categories or groups
 Detection of outliers or noise in a pool of data items

Clusters are represented by centroids.Example: If the input points or data is(3,3),(2,6) and(7,9).

Centroid : (3+2+7,3+6+9)=(4,6). The clusters should not overlap and everycluster should represent only
one class.

Difference between classification and clustering


Challenges of Clustering Algorithms

1. Collection of data with higher dimensions.


2. Designing a proximity measure is another challenge.
3. The curse of dimensionality
PROXIMITY MEASURES

Clustering algorithms need a measure to find the similarity or dissimilarity among the objects to
group them. Similarity and Dissimilarity are collectively known as proximity measures. This is
used by a number of data mining techniques, such as clustering, nearest neighbour classification,
and anomaly detection.

Distance measures are known as dissimilarity measures, as these indicate how oneobject is
different from another. Measures like cosine similarity indicate the similarity among objects.
Distance measures and similarity measures are two sides of a same coin, as moredistance
indicates more similarity and vice-versa.

If all the conditions are satisfied, then the distance measure is called metric.

Some of proximity measures:

1.Quantitative variables

A)Euclidean distance: It is one of the most important and common distance measure. It is also
called L2 norm.

Advantage: The distance does not change with the addition of new object.

Disadvantage:

i) If the unit changes, the resulting Euclidean or squared Euclidean Changes drastically.

ii) Computational complexity is high, because it involves square root andsquare.


B) City Block Distance: Known as Manhattan Distance or L1 norm.

C) Chebyshev Distance: Also known as maximum value distance. This isthe absolute magnitude
of the differences between the coordinates of a pair of objects.This distance is called
supremum distance or Lmax or L∞ norm

D) Minkowski Distance: In general, all the above distances measures can be generalized as:

Here Q is the parameter, When the value of q is 1, the distance measure is called city block distance.
When the value of q is 2, the distance measure is called Euclidean distance. When q is infinity, then
this is Chebyshev distance.

Binary Attributes: Binary Attributes have only two values. Distance measures have discussed above
cannot be applied to find the distance between objects that have binary attributes. For finding the
distance among objects with binary objects, the contingency table is used.
Hamming Distance: Hamming distance is a metric for comparing two binary data strings. While
comparing two binary strings of equal length, Hamming distance is the number of bit positions in
which the two bits are different. It is used for error detection or error correction when data is
transmitted over computer networks.

Example

Suppose there are two strings 1101 1001 and 1001 1101.

11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance,
d(11011001, 10011101) = 2.
Cosine Similarity
Cosine similarity is a metric used to measure how similar the documents are
irrespective of their size.
It measures the cosine of the angle between two vectors projected in a multi-
dimensional space.
The cosine similarity is advantageous because even if the two similar
documents are far apart by the Euclidean distance (due to the size of thedocument),
chances are they may still be oriented closer together.

The smaller the angle, higher the cosine similarity.


Consider 2 documents P1 and P2.
◦ If distance is more, then less similar.
◦ If distance is less, then more similar.
21CS54 Artificial Intelligence and Machine Learning

Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.

The hierarchical clustering technique has two approaches:


1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with
taking all data points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.

Dr.SNK,VS,AG 2023-24 Page 37


21CS54 Artificial Intelligence and Machine Learning

Types of linkages in Hierarchical clustering

Single Linkage or MIN algorithm

In single linkage hierarchical clustering, the distance between two clusters is defined as the
shortest distance between two points in each cluster. For example, the distance between clusters
“A” and “B” to the left is equal to the length of the line between their two closest points.

Dr.SNK,VS,AG 2023-24 Page 38


21CS54 Artificial Intelligence and Machine Learning

Dr.SNK,VS,AG 2023-24 Page 39


21CS54 Artificial Intelligence and Machine Learning

Complete Linkage or MAX or Clique


In complete linkage hierarchical clustering, the distance between two clusters is defined as
the longest distance between two points in each cluster. For example, the distance between
clusters “A” and “B” is equal to the length of the arrow between their two furthest points.

Dr.SNK,VS,AG 2023-24 Page 40


21CS54 Artificial Intelligence and Machine Learning

Average Linkage : In average linkage hierarchical clustering, the distance between two clusters is
defined as the average distance between each point in one cluster to every point in the other cluster.
For example, the distance between clusters “A” and “B” is equal to the average length each arrow
between connecting the points of one cluster to the other.

Dr.SNK,VS,AG 2023-24 Page 41


21CS54 Artificial Intelligence and Machine Learning

Mean Shift Algorithm

Mean-shift algorithm basically assigns the datapoints to the clusters iteratively by shifting points
towards the highest density of datapoints i.e. cluster centroid.

The difference between K-Means algorithm and Mean-Shift is that later one does not need to specify
the number of clusters in advance because the number of clusters will be determined by the algorithm
w.r.t data.

Dr.SNK,VS,AG 2023-24 Page 42


21CS54 Artificial Intelligence and Machine Learning

Advantages:
 No model assumptions
 Suitable for all non-convex shapes
 Only one parameter of the window, that is bandwidth is required
 Robust to noise
 No issues of local minima or premature termination

Disadvantages
 Selecting the bandwidth is a challenging task. If it is larger, then many clusters are missed. If it is
small, then many points are missed and convergence occurs as the problem
 No. Of clusters cannot be specified and user has no control over this parameter

Dr.SNK,VS,AG 2023-24 Page 43


PARTITIONAL CLUSTERING ALGORITHMS

It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster centroid.

K means can be viewed as greedy algorithm as it involves partitioning ‘n’ samples to k


clusterd to minimize sum of squared Error. SSE is a metric that is a measure of error that gives
the sum of the squared Euclidean distances of each data to its closet centroid.


SSE= ∑ �( �) = ∑ (����(�� , x)2)
�=1

Here ci = centroid of ith cluster


x=sample data
PROBLEM
Density-Based Clustering

A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
Density-Based Clustering refers to unsupervised learning methods that identify distinctive
groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region
of high point density, separated from other such clusters by contiguous regions of low point
density.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm


for density-based clustering. It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
The DBSCAN algorithm uses two parameters:
minPts: The minimum number of points (a threshold) clustered together for a region to be
considered dense.
eps (ε): A distance measure that will be used to locate the points in the neighborhood of any
point.
These parameters can be understood if we explore two concepts called Density Reachability
and Density Connectivity.
Reachability in terms of density establishes a point to be reachable from another if it lies within
a particular distance (eps) from it. Connectivity, on the other hand, involves a transitivity based
chaining-approach to determine whether points are located in a particular cluster. For example,
p and q points could be connected if p->r->s->t->q, where a->b means b is in the neighborhood
of a.

There are three types of points after the DBSCAN clustering is complete:

 Core — This is a point that has at least m points within distance n from itself.
 Border — This is a point that has at least one Core point at a distance n.
 Noise — This is a point that is neither a Core nor a Border. And it has less than m points
within distance n from itself.
Grid-Based Approaches
grid-based clustering method takes a space-driven approach by partitioning the embedding
space into cells independent of the distribution of the input objects.

The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the
object space into a finite number of cells that form a grid structure on which all of the
operations for clustering are performed.

The main advantage of the approach is its fast processing time, which is typically independent
of the number of data objects, yet dependent on only the number of cells.

Subspace Clustering
CLIQUE is a density-based and grid-based subspace clustering algorithm, useful for finding
clustering in subspace.
Concept of Dense cell
CLIQUE partitions each dimension into several overlapping intervals and intervals it into
cells. Then, algorithm determines whether the cells is dense or sparse. The cell is considered
dense if it exceeds a threshold value.

It is defined as the ratio of number of points and volume of the region. In one pass, the
algorithm finds the number of cells , number of points etc and then combines the dense cells.
For that the algorithm uses the contiguous intervals and a set of dense cells.

MONOTONICITY Property

CLIQUE uses anti- monotonicity property or apriori algorithm. It means that all the
subsets of a frequent itemset are frequent. Similarly if the subset is infrequent then its
superset are infrequent.

Algorithm works in 2 stages:


PROBABILITY MODEL BASED METHODS
Probability model-based methods in clustering are a class of techniques that use statistical models to
represent the underlying probability distributions of data points in a dataset.
These methods are used to group similar data points together into clusters based on their likelihood of
belonging to a particular cluster according to the assumed probability distribution.

Two popular probability model-based clustering methods are Gaussian Mixture Models (GMMs) and
Hidden Markov Models (HMMs). other than these we have other set of model . those are:

1. Fuzzy Clustering
2. EM algorithm
Fuzzy Clustering :
Fuzzy Clustering is a type of clustering algorithm in machine learning that allows a data point to belong
to more than one cluster with different degrees of membership. Unlike traditional clustering algorithms,
such as k-means or hierarchical clustering, which assign each data point to a single cluster, fuzzy
clustering assigns a membership degree between 0 and 1 for each data point for each cluster.

Let us consider ci and cj then an element say x, can belong to both the cluster.The strength of
the association of an object with the cluster is given as wij . The value of wij lies between 0
and 1. The sum of the weights of an object, if added, gives 1.
Expectation Maximization Algorithm
The Expectation-Maximization (EM) algorithm is a statistical method used for estimating
parameters in statistical models when you have incomplete or missing data. It's commonly used
in unsupervised machine learning tasks such as clustering and Gaussian Mixture Model (GMM)
fitting.

Given a mix of distributions, data can be generated by randomly picking a distribution and
generating the point. Gaussian distribution is a bell shaped curve.

The function of Gaussian distribution is given by:


The EM algorithm iteratively optimizes a likelihood function in two steps: the E-step
(Expectation) and the M-step (Maximization).
Here's a high-level overview of how the EM algorithm works:

1. Initialization: Start with initial estimates of the model parameters. These initial values can be
random or based on some prior knowledge.
2. E-step (Expectation):
 In this step, you compute the expected values (expectation) of the latent (unobserved)
variables given the observed data and the current parameter estimates.
 This involves calculating the posterior probabilities or likelihoods of the missing data or
latent variables.
 Essentially, you're estimating how likely each possible value of the latent variable is,
given the current model parameters.
3. M-step (Maximization):
 In this step, you update the model parameters to maximize the expected log-likelihood
found in the E-step.
 This involves finding the parameters that make the observed data most likely given the
estimated values of the latent variables.
 The M-step involves solving an optimization problem to find the new parameter values.
4. Iteration:
 Repeat the E-step and M-step alternately until convergence criteria are met. Common
convergence criteria include a maximum number of iterations, a small change in
parameter values, or a small change in the likelihood.
5. Termination:
 Once the EM algorithm converges, you have estimates of the model parameters that
maximize the likelihood of the observed data.
6. Result:
 The final parameter estimates can be used for various purposes, such as clustering,
density estimation, or imputing missing data.

The EM algorithm is widely used in various fields, including machine learning, image
processing, and bioinformatics.

One of its notable applications is in Gaussian Mixture Models (GMMs), where it's used to
estimate the means and covariances of Gaussian distributions that are mixed to model
complex data distributions.
It's important to note that the EM algorithm can sometimes get stuck in local optima, so the
choice of initial parameter values can affect the results. To mitigate this, you may run the
algorithm multiple times with different initializations and select the best result.

CLUSTER EVALUATION METHODS


Evaluation of clustering algorithm is a difficult task, as domain knowledge is absent most of the times.
SO, clustering algorithms validation is difficult as compared to the validation of classification
algorithms.
Evaluation of Clustering
1. Internal
2. External
3. Relative
Cohesion and separation

Here, N – No. of cluster,


C – set of centroids
Xi – centroid
Mj – samples.

Here, x – centroid of the entire dataset


Xi – centroid of the cluster
Ci – size of the cluster
DUNN Index
This metric measures the ratio between the distance between the clusters and the distance within
the clusters. A high Dunn index indicates that the clusters are well-separated and distinct.
DUNN index is calculated as:

Here,α and β are parameters. DUNN index is a useful measure that can combine both cohension and
separation.

Silhouette Coefficient
This metric measures how well each data point fits into its assigned cluster and ranges from -1 to
1. A high silhouette coefficient indicates that the data points are well-clustered, while a low
coefficient indicates that the data points may be assigned to the wrong cluster.

--
Hierarchical Clustering Algorithm
Use dataset and apply hierarchical methods. Show the dendrogram.

SNo. X Y
1. 3 5
2. 7 8
3. 12 5
4. 16 9
5. 20 8

Table Sample Data

Solution

The similarity table among the variables is computed and is shown in Table 134.4. Euclidean
distance is computed and is shown in the following Table 143.57.

Table 134.57: Proximity Matrix

Objects 0 1 2 3 4
0 - 5 9 9.85 17.26
1 - 5.83 9.49 13
2 - 5.66 8.94
3 - 4.12
4 -

The minimum distance is 4.12. Therefore, the items 1 and 4 are clustered together. The resultant
table is given as shown in the following Table.
Table After Iteration 1

Clusters {1,4} 2 3 5

{1,4} - 5 5.66 4.12

2 - 5.83 13

3 - 8.94

5 -
The distance between the group {1, 4} and items 2, 3, 5 are computed using this formula.
Thus, the distance between {1,4} and {2} is:
Minimum { {1,4}, {2} = Minimum {(1,2),(4,2)=5
The distance between {1,4} and {3} is given as:
Minimum { {1,3}, {4,3} } = Minimum {9,5.66}=5.66
The distance between {1,4} and {5} is given as:
Minimum { {1,5}, {2,5} } = Minimum {17.26,4.12} = 4.12

The minimum distance of above table is 4.12. Therefore, {1,4} and {5} are combined. This
results in the following Table.

Table After Iteration 2

Clusters {1,4,5} 2 5

{1,4,5} - 5 5.66

2 - 5.83

5 -

Thus, the distance between {1,4,5} and {2} is:


Minimum {(1,2),(4,2},(5,2)}= {5,9.49,13} = 5

Thus, the distance between {1,4,5} and {3} is:


Minimum { {1,3}, {4,3},{5,3)} = Minimum {9,5.66,8.94} = 5.66

The minimum is 5. Therefor {1,4,5} and {2} is combined. And finally, it is combined with
{3}.

therefore, the order of cluster is {1,4} then {5}, then {2} and finally {3}.
Complete Linkage or MAX or Clique
Here from the first iteration table minimum is taken and {1,4} is combined. Then maximum
is computed as

Thus, the distance between {1,4} and {2} is:


Max{ {1,4}, {2} = Max {(1,2),(4,2)= 9.49
The distance between {1,4} and {3} is given as:
Max { {1,3}, {4,3} } = Max {9,5.66}=9
The distance between {1,4} and {5} is given as:
Max{ {1,5}, {2,5} } = Max {17.26,4.12} = 17.26

This results in a Table

Clusters {1,4} 2 3 5

{1,4} - 9.49 9 17.26


2 - 5.83 13

3 - 8.94

5 -

So, the minimum is 8.94. Therefore, {3,5} is combined. This is shown in the following Table.

Clusters {1,4} {3,5} 2


{1,4} - 17.26 9.49
{3,5} - 13

2 -

The minimum is 9.49. Therefore {1,4,2} are combined. The order of cluster is {1,4}, {1,4}and {2}, and
{3,5}.
Hint: The same is used for average link algorithm where the average distance of all pairs ofpoints across
the clusters is used to form clusters.
K-means clustering problems
Consider the following data shown in Table 143.125. Use k-means algorithm with k
= 2 and show the result.
Table Sample Data
SNO X Y
1. 3 5
2. 7 8
3. 12 5
4. 16 9

Solution

Let us assume the seed points are (3,5) and (16,9). This is shown in the following table
as starting clusters.
Table Initial Cluster Table

Cluster 1 Cluster 2
(3,5) (16,9)

Centroid (3,5) Centroid (16,9)

Iteration 1: Compare all the data points or samples with the centroid and assigned to the
nearest sample.

Take the sample object 2 and compare it with the two centroids as follows:

Dist(2,centroid 1) =   5
(7  3)2  (8  5)2 16  9 25
Dist(2,centroid 2) = (7 16)2  (8  9)2  811  82  9.05
Object 2 is closer to centroid of cluster 1 and hence assign it to the cluster 1. This is shown in
Table. For the object 3:,

Dist(3,centroid 1) =  9
(12  3)2  (5  5)2 81
Dist(3,centroid 2) = (12 16)2  (5  9)2  16 16  32  5.66

Object 3 is closer to centroid of cluster 2. and hence remains in the same cluster 1

This is shown in the following Table.


Table Cluster Table After Iteration 1

Cluster 1 Cluster 2
(3,5) (12,4)
(7,8) (10,4)

Centroid (10/2,13/2)=(5,6.5) Centroid (28/2,14/2)=(14,7)

The second iteration is started again. Compute again,

Dist(1,centroid 1) =  6.25
(7  5)2  (8  6.5)2
Dist(1,centroid 2) = (12 14)2  (8  7)2  49 1  50  7.07

Object 1 is closer to centroid of cluster 1 and hence remains in the same cluster. Take the
sample object 3, compute again

Dist(3,centroid 1) = (12  5)2  (5  6.5)2  51.25  7.16

Dist(3,centroid 2) = (16 14)2  (9  7)2  4  4  8  2.82

Object 3 is closer to centroid of cluster 2 and remains in the same cluster.


Therefore, the resultant clusters are
{(3,5), (7,80} and {(12,5),(16,9)}.
Design a 2 layer network of perceptron to implement NAND gate. Assume your own weights
andbiases in the range of [-0.5 0.5]. Use learning rate as 0.4.

Solution:

X0

�3 �4
X1 �13

X3 X4
�34
AND NOT
�23
X2

Figure 1 Two Layer Network for NAND gate

Table 1: Weights and Biases


�� �� �������� ��� ��� ��� �� �� ��

0 1 1 0.1 -0.4 0.3 0.2 -0.3 1


Table 2: Truth Table of NAND Gate
�� �� �� ��� �� ���� = ���(�� ��� ��)

0 0 0 1
0 1 0 1
1 0 0 1
1 1 1 0

ITERATION 1:
Step 1: FORWARD PROPAGATION
1. Calculate net inputs and outputs in input layer as shown in Table 3.
Table 3: Net Input and Output Calculation
Input Layer �� ��

�� 0 0

�� 1 1

2. Calculate net inputs and outputs in hidden and output layer as shown in Table 4.
Table 4: Net Input and Output Calculation in Hidden and Output layer

����� ��� ����� �� ��� ������ ��

�� �3 = �1�13 + �2�23 + �0�3 1


�� =
1 + �−�3
= 0(0.1) + 1(−0.4) + 1(0.2)
1
= −0.2 =
1 + �−(−0.2)

= 0.450
����� ��� ����� �� ��� ������ ��

�� �4 = �3�34 + �0�4 1
�� =
1 + �−�4
= (0.450 ∗ 0.3) + 1(−0.3)
1
= −0.165 =
1 + �−(−0.165)

= 0.458

3. Calculate Error
����� = �������� − ����������
= 1 − 0.458
����� = 0.542

Step 2: BACKWARD PROPAGATION


1. For each ����� in the output layer
������ = �� ∗ (� − ��) ∗ (�������� − ��)

For each ����� in the hidden layer

������ = �� ∗ (� − ��) ∗ (∑ ����� * ���)


Table 5: Error Calculation


For each output ������
layer �����
�4 ������ = �� ∗ (1 − ��) ∗ (�������� − ��)
= 0.458(1 − 0.458)(1 − 0.458)
= 0.134

For each hidden layer ������


�����

�3 ������ = �� ∗ (1 − ��) ∗ (∑ ����� ∗ ���)


= 0.450 ∗ (1 − 0.450) ∗ 0.134 ∗ 0.3


= 0.0099

2. Update Weights and biases


Table 6: Weight and Bias Calculation

��� ��� = ��� + (� ∗ ������ ∗ ��) Net Weight

�13 �13 = �13 + (0.4 ∗ �����3 ∗ �1) 0.1


= 0.1 ∗ (0.4 ∗ 0.0099 ∗ 0)
�23 �23 = �23 + (0.4 ∗ �����3 ∗ �2) -0.396
= −0.4 ∗ (0.4 ∗ 0.0099 ∗ 1)
�24 �24 = �24 + (0.4 ∗ �����4 ∗ �2) 0.324
= 0.3 ∗ (0.4 ∗ 0.134 ∗ 0.450)
�� �� = �� + (� ∗ ������) Net Bias

�3 �3 = �3 + (0.4 ∗ �����3) 0.203


= 0.2 + (0.4 ∗ 0.0099)
�4 �4 = �4 + (0.4 ∗ �����4) -0.246
= −0.3 + (0.4 ∗ 0.134

ITERATION 2:
Step 1: FORWARD PROPAGATION

1. Calculate net inputs and outputs in hidden and output layer


Table 7: Inputs and Outputs in Hidden and Output layer

����� ��� ����� �� ��� ������ ��

�� �3 = �1�13 + �2�23 + �0�3 1


�� =
1 + �−�3
= 0(0.1) + 1(−0.396) + 1(0.203)
1
= −0.193 =
1 + �−(−0.193)
= 0.451
����� ��� ����� �� ��� ������ ��

�� �4 = �3�34 + �0�4 1
�� =
1 + �−�4
= (0.451 ∗ 0.324) + 1(−0.246)
1
= −0.099 =
1 + �−(−0.099)
= 0.475

2. Calculate Error
����� = �������� − ����������
= 1 − 0.475
����� = 0.525

ITERATION ERROR
1 0.542 =0.542-0.525
=0.017
2 0.525

In iteration 2 the error gets reduced to 0.525. This process will continue until desired output
is achieved.
How a Multi-Layer Perceptron does solves the XOR problem. Design an MLP with back
propagation to implement the XOR Boolean function.
Solution:

X1 X2 Y
0 0 1
0 1 0
1 0 0
1 1 1

X0

0.1

X1 -0.3
-0.2
0.4
0.4

0.2
X3 0.2

X2 X5
-0.3
-0.3

X4

Figure 2: Multi Layer Perceptron for XOR

Learning rate: =0.8


Table 8: Weights and Biases
X1 X2 W13 W14 W23 W24 W35 W45 �3 �4 �5
1 0 -0.2 0.4 0.2 -0.3 0.2 -0.3 0.4 0.1 -0.3

Step 1: Forward Propagation


1. Calculate Input and Output in the Input Layer shown in Table 9.
Table 9: Net Input and Output Calculation
Input Layer Ij Oj
X1 1 1
X2 0 0
2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 10.
Table 10: Unit j at Hidden Layer and Output Layer – Net Input and Output Calculation
Unit j Net Input Ij Output Oj
1 1
X3 I3 = X1*W13 + X2*W23+ X0*θ3 O3 = = = 0.549
1+�−�3 1+�−0.2
I3 = 1*-0.2 + 0*0.2+ 1*0.4 = 0.2
1 1
X4 I4 = X1*W14 + X2*W24+ X0*θ4 O4 = = = 0.622
1+�−�4 1+�−0.5
I4 = 1*0.4 + 0*-0.3+ 1*0.1 = 0.5
1 1
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 O5 = = =0.407
1+�−�5 1+�0.376
I5 = 0.549 * 0.2 + 0.622 * -0.3 + 1*-0.3 = -0.376

3. Calculate Error = Odesired – OEstimated


So error for this network is,
Error = Odesired – O7 = 1 – 0.407 = 0.593

Step 2: Backward Propagation


1. Calculate Error at each node as shown in Table 11.
For each unit k in the output layer, calculate
Error k = Ok (1-Ok) (YN – Ok)
For each unit j in the hidden layer, calculate
Error j = Oj (1-Oj) ∑� ������ ���

Table 11: Error Calculation for each unit in the Output layer and Hidden layer
For Output Layer Errork
Unit k
X5 Error 5 = O5 (1-O5) (1 – O5)
= 0.407 * (1-0.407) * (1- 0.407)
= 0.143
For Hidden layer Errorj
Unit j
X4 Error 4 = O4 (1-O4) ∑� ������ ��� = O4 (1-O4) �����5 �45
= 0.622 (1-0.622) *- 0.3 *0.143
= -0.010
X3 Error 3 = O3 (1-O3) ∑� ������ ��� = O3 (1-O3) �����5 �35
= 0.549 (1- 0.549) * 0.143 * 0.2
= -0.007

2. Update weight using the below formula,


Learning rate α = 0.8
∆Wij = �∗ Error j* Oi
Wij = Wij+ ∆Wij
The updated weight and bias is shown in Table 12 and Table 13.
Table 12: Weight Updation
Wij Wij = Wij+ �∗ Error j* Oi New Weight
W13 W13 = W13 + 0.8 * Error 3* O1 -0.194
= -0.2 + 0.8 * 0.007 * 1
W14 W14 = W14 + 0.8 * Error 4* O1 0.392
= 0.4+ 0.8 * -0.01 *1
W23 W23 = W23 + 0.8 * Error 3* O2 0.2
= 0.2 + 0.8 * 0.007 *0
W24 W24 = W24+ 0.8 * Error 4 * O2 -0.3
= -0.3+ 0.8 * -0.001 *0
W35 W35 = W35 + 0.8 * Error 5* O3 0.154
= 0.2 + 0.8 *0.143* 0.4
W45 W45 = W45 + 0.8 * Error 5* O4 -0.288
= 0.3 + 0.8 * 0.143* 0.1

Update bias using the below formula,


∆θj = = �∗ Error j
θj = θj + ∆θj
Table 13: Bias Updation
θj θj = θj + �∗ Error j New Bias
�3 Θ3 = θ3 + �∗ Error 3 0.405
= 0.4 + 0.8 * 0.007
�4 θ 4 = θ4 + �∗ Error 4 0.092
= 0.1 + 0.8 *- 0.01
�5 θ 5 = θ5 + �∗ Error 5 -0.185
= -0.3 + 0.8 * 0.143
Iteration 2
Now with the updated weights and biases,
1. Calculate Input and Output in the Input Layer shown in Table 14.
Table 14: Net Input and Output Calculation
Input Layer Ij Oj
X1 1 1
X2 0 0

2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 15.
Table 15: Net Input and Output Calculation in the Hidden Layer and Output Layer
Unit j Net Input Ij Output Oj
1 1
X3 I3 = X1*W13 + X2*W23+ X0*θ3 O3 = = =
1+�−�3 1+�−0.211
I3 = 1*-0.194 + 0*0.2+ 1*0.405 = 0.211
0.552
1 1
X4 I4 = X1*W14 + X2*W24+ X0*θ4 O4 = = =
1+�−�4 1+�−0.484
I4 = 1*0.392 + 0*-0.3+ 1*0.092 = 0.484
0.618
1 1
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 O5 = = =0.429
1+�−�5 1+�0.282
I5 = 0.552* 0.154 + 0.618* -0.288 + 1*-0.185 = -
0.282

The output we receive in the network at node 5 is 0.407.


Error = 1 - 0.429= 0.571
Now when we compare the error, we get in the previous iteration and in the current iteration, the
network has learnt which reduces the error by 0.022.
Error is reduced by 0.055: 0.593 – 0.571.

Consider the Network architecture with 4 input units and 2 output units. Consider four training
samples each vector of length 4.
Training samples
i1: (1, 1, 1, 0)
i2: (0, 0, 1, 1)
i3: (1, 0, 0, 1)
i4: (0, 0, 1, 0)
Output Units: Unit 1, Unit 2
Learning rate η(t) = 0.6
Initial Weight matrix
0.2 0.8 0.5 0.1
[Unit 1 ]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Identify an algorithm to learn without supervision? How do you cluster them as we
expected?

Solution:
Use Self Organizing Feature Map (SOFM)

Iteration 1:
Training Sample X1: (1, 1, 1, 0)
Weight matrix
0.2 0.8 0.5 0.1
[Unit 1]: [ ]
Unit 2 0.3 0.5 0.4 0.6

Compute Euclidean distance between X1: (1, 1, 1, 0) and Unit 1 weights.

d2 = (0.2 -1)2 + (0.8 – 1)2 + (0.5 -1)2 + (0.1 – 0)2


= 0.94
Compute Euclidean distance between X1: (1, 1, 1, 0) and Unit 2 weights.

d2 = (0.3 -1)2 + (0.5 – 1)2 + (0.4 -1)2 + (0.6– 0)2


= 1.46
Unit 1 wins
Update the weights of the winning unit
New Unit 1 weights = [0.2 0.8 0.5 0.2] + 0.6 ([1 1 1 0] - [0.2 0.8 0.5 0.2])
= [0.2 0.8 0.5 0.2] + 0.6 [0.8 0.2 0.5 -0.2]
= [0.2 0.8 0.5 0.2] + [0.48 0.12 0.30 -0.12]
= [0.68 0.92 0.80 0.08]
[Unit 1 ]:[ 0.68 0.92 0.80 0.08]
Unit 2 0.3 0.5 0.4 0.6
Iteration 2:
Training Sample X2: (0, 0, 1, 1)
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1 ]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Compute Euclidean distance between X2: (0, 0, 1, 1) and Unit 1 weights.

d2 = (0.68 -0)2 + (0.92 – 0)2 + (0.80 -1)2 + (0.08 – 1)2


= 2.1952
Compute Euclidean distance between X2: (0, 0, 1, 1) and Unit 2 weights.

d2 = (0.3 -0)2 + (0.5 – 0)2 + (0.4 -1)2 + (0.6– 1)2


= 0.86
Unit 2 wins
Update the weights of the winning unit
New Unit 2 weights = [0.3 0.5 0.4 0.6] + 0.6 ([0 0 1 1] - [0.3 0.5 0.4 0.6])
= [0.3 0.5 0.4 0.6] + 0.6 [-0.3 -0.5 0.6 0.4]
= [0.3 0.5 0.4 0.6] + [-0.18 -0.30 0.36 0.24]
= [0.12 0.2 0.76 0.84]
[Unit 1 ]:[0.68 0.92 0.80 0.08]
Unit 2 0.12 0.2 0.76 0.84

Iteration 3:
Training Sample X3: (1, 0, 0, 1)
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1 ]:[ ]
Unit 2 0.12 0.2 0.76 0.84

Compute Euclidean distance between X3: (1, 0, 0, 1) and Unit 1 weights.

d2 = (0.68 -1)2 + (0.92 – 0)2 + (0.80 -0)2 + (0.08 – 1)2


= 2.44
Compute Euclidean distance between X3: (1, 0, 0, 1) and Unit 2 weights.

d2 = (0.12 -1)2 + (0.2 – 0)2 + (0.76 -0)2 + (0.84– 1)2


= 1.42
Unit 2 wins
Update the weights of the winning unit
New Unit 2 weights = [0.12 0.2 0.76 0.84] + 0.6 ([1 0 0 1] - [0.12 0.2 0.76 0.84])
= [0.12 0.2 0.76 0.84] + 0.6 [0.88 -0.2 -0.76 0.16]
= [0.12 0.2 0.76 0.84] + [0.53 -0.12 -0.46 0.096]
= [0.65 0.08 0.3 0.94]
[Unit 1 ]:[0.68 0.92 0.80 0.08]
Unit 2 0.65 0.08 0.3 0.94

Iteration 4:
Training Sample X4: (0, 0, 1, 0)
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1 ]:[ ]
Unit 2 0.65 0.08 0.3 0.94

Compute Euclidean distance between X4: (0, 0, 1, 0) and Unit 1 weights.

d2 = (0.68 -0)2 + (0.92 –0)2 + (0.80 -1)2 + (0.08 – 0)2


= 1.36
Compute Euclidean distance between X1: (0, 0, 1, 0) and Unit 2 weights.

d2 = (0.65- 0)2 + (0.08 – 0)2 + (0.3 -1)2 + (0.94– 0)2


= 1.8025
Unit 1 wins
Update the weights of the winning unit
New Unit 1 weights = [0.68 0.92 0.80 0.08] + 0.6 ([0 0 1 0] - [0.68 0.92 0.80 0.08])
= [0.68 0.92 0.80 0.08] + 0.6 [-0.68 -0.92 0.2 -0.08]
= [0.68 0.92 0.80 0.08] + [-0.408 -0.552 0.12 -0.258]
= [0.27 0.37 0.92 -0.178]
0.27 0.37 0.92 − 0.178
[Unit 1 ]:[ ]
Unit 2 0.65 0.08 0.3 0.94

Best mapping unit for each of the sample taken are,


X1: (1, 1, 1, 0)  Unit 1
X2: (0, 0, 1, 1)  Unit 2
X3: (1, 0, 0, 1)  Unit 2
X4: (0, 0, 1, 0)  Unit 1

This process is continued for many epochs until the feature map doesn’t change.
documents are far apart by the Euclidean distance (due to the size of the
document), chances are they may still be oriented closer together.
The smaller the angle, higher the cosine similarity.
Consider 2 documents P1 and P2.
◦ If distance is more, then less similar.
◦ If distance is less, then more similar.

1. Consider the following data and, calculate the Euclidean, Manhattan and
Chebyshev distances.

a. (2 3 4) and (1 5 6)

Solution

Euclidean distance = (2 1)2  (3  5)2  (4  6)2  9  3

2 1  3  5)  4  6  1 2  2  5
Manhattan distance =

Chebyshev Distance = max 2 1 , 3  5) , 4  6   max{1, 2, 2}  2

b. (2 2 9) and (7 8 9)

25  36  09 61
Euclidean Distance = (2  7)  (2  8)  (9  9) 
2 2 2   7.81
Manhattan Distance = 2  7  2  8)  9  9  5  6  0  11

Chebyshev Distance = max{ 2  7  2  8)  9  9 }  {5, 6, 0}  6

2. Find cosine similarity, SMC and Jaccard coefficients for the following binary
data:

a. (1 0 1 1) and (1 1 0 0)

Solution
10 11
110 0

C = 2, b = 1, d = 1,
ad 1
SMC =   0.25
a bc d 4

d 1
  0.25
Jaccard Coefficient =
bcd 4

Cosine Similarity =
(11 01
3 12 0 1 0)  31 2
b. (1 0 0 0 1) and (1 0 0 0 0 1)

Solution
No match

(1 0 0 0 1) and (1 1 0 0 0)

10001
11000

A=2, b= 1, c = 1, d= 1
ad 2
SMC =   0.5
abc d 5
d 1
Jaccard Coefficient =   0.33
bcd 3

Cosine Similarity = (11 01 0 0  0 0 1 0)  1



1
 0.5
2 2 2 2 2
3. Find Hamming distance for the following binary data:
a. (1 1 1) and (1 0 0)

Solution
It differs in two positions; therefore Hamming distance is 2
b. (1 1 1 0 0) and (0 0 1 1 1)
Solution
It differs in four positions; therefore, Hamming distance is 4

4. Find the distance between:


a. Employee ID: 1000 and 1001
Solution
They are not equal. Therefore, distance is 0

b. Employee name – John & John and John & Joan


Solution
The distance between John and John is 1
The distance between John and Joan is 0

5. Find the distance between:


a. (Yellow, red, green) and (red, green, yellow)

Solution

Yellow = 1, red = 2, Green = 3

1 2
Therefore, the distance between (yellow, red) =  1  1  0.5
2 2 2
2  3 1 1
Distance between (red, green) =    0.5
2
2 2

3 1 2
Distance between (green, yellow) =
 1
2 2
Therefore, distance between (Yellow, red, green) and (red, green, yellow) is (0.5,0.5,1).
b. (bread, butter, milk) and (milk, sandwich, Tea)

Solution

Bread =1, Butter =2, Milk = 3, Sandwich = 4, Tea = 5

2 1
 
The distance between (bread, milk) = 1 3
5 1 4 2

2 1
The distance between (butter, sandwich) = 2  4  
5 1 4 2
1
 
The distance between (Milk, Tea) = 3  5 2
5 1 4 2
Therefore, the distance
between (bread, butter, milk)
and (milk, sandwich, Tea) =
1 1 1
 , , 
2 2 2

You might also like