0% found this document useful (0 votes)
10 views

PatternRecognition

The document provides an overview of pattern recognition, detailing its definition, approaches, and phases including sensing, segmentation, feature extraction, classification, and clustering. It discusses various types of pattern recognition methods such as statistical, structural, neural network-based, and fuzzy approaches, along with the role of decision trees in classification tasks. Additionally, it covers decision tree terminology, assumptions, advantages, disadvantages, and the CART algorithm for creating decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

PatternRecognition

The document provides an overview of pattern recognition, detailing its definition, approaches, and phases including sensing, segmentation, feature extraction, classification, and clustering. It discusses various types of pattern recognition methods such as statistical, structural, neural network-based, and fuzzy approaches, along with the role of decision trees in classification tasks. Additionally, it covers decision tree terminology, assumptions, advantages, disadvantages, and the CART algorithm for creating decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Pattern Recognition Part-1

Introduction
Pattern Recognition - scientific process of classifying objects into number of categories or
classes.

Objects - images, signals, or measurements needing classification and the objects used for
classification are called patterns.

For example - Speech Recognition, Handwriting Classification, Fault Detection in


machinery, and diagnosis etc.

Types of Approaches:

Statistical - based on statistical analysis.

Syntactic - based on structural analysis

Features - measurable properties of an object. Features can be crucial for classifying and
differentiating between classes.

Numerical Features - continuous values (eg. weight, height etc.)

Categorical Features - discrete categories (eg. color, type etc.) often converted into
numerical formats for machine learning.

Feature Vector - is an ‘n’ dimensional vector encapsulating numerical features of an object


processed by machine learning.

Input Representation - feature vectors convert raw data into numerical formats that
machine learning models can use effectively.

Dimensionality Reduction - techniques like Principle Component Analysis (PCA)


reduce dataset dimensions while keeping essential info.

Classification & Clustering - Algorithms like K-nearest neighbors (KNN) and Support
Vector Machines (SVM) use feature vectors to classify group data.

Classifiers - Algorithms that categorize data points using feature vectors various classifiers
use these vectors to make predictions.

Nearest Neighbor classification - classifies data points by the majority class of their
nearest neighbors in feature space.

Neural Networks - learn complex patterns by processing feature vectors across


multiple layers.

Statistical Techniques - Bayesian classifiers use probability distributions over feature


vectors for predictions.

Phases in Pattern Recognition System

Pattern Recognition Part-1 1


Sensing - collects raw data from various sources, converting it into a usable format.

Segmentation - isolates individual objects or regions of interest within the data.

Feature Extraction - Identifies and extracts relevant characteristics from the segmented
data.

Classification - The extracted features are used to classify data into categories involves
training a model to learn the patterns and relationships between features and classes

Clustering - data may not have predefined labels, then these techniques are used to group
similar data points together based on their feature similarities.

Regression - regression techniques are used if goal is to predict the continuous numerical
values and models learn relationship between feature and target variable

Types of Pattern Recognition


Statistical PR -

relies on probability and statistical methods to learn from examples

suitable for numerical values / data

ex:- Bayesian classification - Bayes theorem

Discriminant Analysis

Hidden Markov Models (HMM) - sequential data [Speech / Time]

Structural PR -

Deals with complex patterns

Focuses on relationships between elements

Template Matching - Compare pattern with a template to find matches

Syntactic PR - patterns as strings of symbols and use grammar rules to recognize


them.

Graph Matching - compares graphs representing patterns to find similarities

Neural Network Based PR -

powerful for complex patterns and can learn from examples and adapt to new data.

Artificial Neural Network (ANNs)

Convolutional Neural Network (CNNs)

Pattern Recognition Part-1 2


Recurrent Neural Network (RNNs)

Fuzzy PR -

Handles uncertainty and imprecision

Allows partial membership

Decision Tree
Decision Tree which has a hierarchical structure made up of root, branches, internal and
leaf nodes, is a non-parametric, supervised learning approach.

Role in Pattern Recognition:

Feature Selection - Decision trees can help identify the most informative features in a
dataset, aiding in feature engineering and dimensionality reductions.

Classification - By partitioning the feature spaces into decision regions, decision trees,
can effectively classify patterns into categories.

Interoperability - DTs are inherently interpretable, allowing for easier understanding of


the decision making process.

Decision Tree Terminologies -

Root Node - initial node at beginning of the Decision Tree, where entire population
starts dividing based on various features / conditions.

Decision Nodes - These result from splitting the root nodes, these nodes represent the
intermediate decisions / conditions.

Leaf Nodes - Nodes where further splitting is not possible.

Branch / Subtree - A subsection of the entire decision tree

Pruning - process of removing branches / nodes in a decision tree to simplify it or to


prevent overfitting or to improve generalization.

Splitting - process of dividing a dataset into subsets on the bases of a value of an


attribute.

Edge - connection between nodes representing possible outcomes of a test.

Assumptions when building decision tree models


Feature Independence - Features are assumed to be independent from each other.

Binary Splits - Decision Trees use binary splits to divide data at each node into two
subsets.

Homogeneity - DTs aim to create homogeneous subgroups in each node, i.e. each leaf
node should ideally contain similar instances regarding the target variable.

Top Down Greedy Approach - DTs are created using greedy approach where each split is
chosen to maximize information gain / minimize the impurity at the current node.

Overfitting - DTs are prone to overfitting when they capture noise in the data - when the
tree is allowed to grow too deep, this occurs when the model learns the training data too

Pattern Recognition Part-1 3


well, capturing noise rather than the underlying patterns.

Consequences Prevention

Poor Generalization Pruning

Reduced Accuracy Feature Selection

Increased Complexity Ensemble methods

No outliers - DTs are sensitive to outliers, and extreme values can influence their
construction. To handle outliers effectively pre processing / robust methods may be
needed.

Sensitive to Sample Size - small datasets may lead to overfitting and larger datasets may
result in complex trees. The sample size and tree depth should be balanced.

Advantages of DT Disadvantages of DT

easy to understand overfitting

do not make assumptions about


small changes in dataset can lead to change in structure of DT
underlying data distribution

Handle both categorical and can introduce bias, if data is imbalanced as the algorithm tends to
numerical data favour majority classes

lack of globalization - DTs are greedy algorithm which makes decision


handles missing data - create
at each node based on local best choice This can prevent them from
branches for missing values
finding global optimal solution

Quantifying the Decision Trees


Entropy - measures the randomness or uncertainty in a dataset

m
Entropy(S) = − ∑[P i log 2 P i]
​ ​

i=1

where S is the dataset Pi is the proportion of instances belonging to class i.

A higher entropy value indicates greater uncertainty, while a lower entropy value
indicates a more homogeneous dataset. (lower the better)

Gini Impurity - (Gini Index) - measures the probability of a randomly chosen element being
incorrectly classified.

n
Gini(S) = 1 − ∑(P i)2 ​

i=1

A higher Gini impurity value indicates greater uncertainty, while a lower impurity value
indicates more homogeneous dataset. (lower the better)

Information Gain - measures the reduction in entropy achieve by splitting on a particular


attribute.

Pattern Recognition Part-1 4


v
∣Sv ∣
Information Gain(S, A) = Entropy(S) − [∑ ] ∗ Entropy(Sv )

∣S ∣
​ ​ ​

i=1

where S is the dataset, A is the attribute and Sv is the subset of S that has value V for
attribute A.

A higher information gain value indicates a more informative attribute for splitting.

Gain Ratio - Adjusts information gain by taking into account the essential information of the
attribute [split info].

v
∣Ai ∣ ∣Ai ∣
Split Info(T, A) = ∑ ∗ log 2
​ ​

∣T ∣ ∣T ∣
​ ​ ​ ​

i=1

Information Gain(T, A)
Gain Ratio =
Split Info(T, A)

Helps to overcome the bias of the information gain towards attributes with many values
by normalizing it

Normalizing Information Gain -

Adaptation of standard information gain measure, used to address some of its


limitations, particularly the bias towards attributes with a large no. of distinct values.

Normalized info_gain adjusts information fain by normalizing it against the maximum


possible information gain (Entropy) for a given attribute. This provides more balanced
comparison between attributes

Information Gain(A))
Normalized Information Gain (A) =
Entropy(S)

CART Algorithm
Classification and Regression Tree

variant of decision tree algorithm

can handle both classification and regression

supervised learning algorithm that learns from labelled data to predict unseen data.

always create a binary tree.

predefined stopping condition

step size should be limited.

How CART Works?


The CART algorithm works recursively partitioning the input data into subsets based on the
value of features, ultimately creating a decision tree that can be used for classification and
regression tasks.

Initial Split - choose the best feature and value to split the data

Pattern Recognition Part-1 5


Recursive Partitioning - repeat splitting for each subset.

Stopping criteria - maximum depth of a tree, a minimum number of instances in each leaf
node.

Tree Pruning - Remove unnecessary branches to prevent overfitting

Prediction - follow the tree to leaf node to get the predicted class or value.

Steps
Calculate Gini Impurity for the entire dataset. This is the impurity of the root node.

For each input variable, calculate Gini Impurity for all possible split points. The split point
that results in minimum Gini impurity is chosen.

Data Input is split into two subsets based on the chosen split point and new node is created
for each subset.

Step 2 and 3 are repeated for each node until a stopping condition is met.

resulting tree is the decision tree.

Advantages of CART Disadvantages of CART

tends to overfit, especially if tree is grown


simple and intuitive algorithm that is easy to understand
too deep

handle both numerical and categorical data greedy algorithm [may not be optimal]

handle missing values by imputing them with surrogate


bias towards many categories
values

it can handle multi class classification problems by using an unstable - small changes in dataset can
extension called a multi class CART affect the tree structure

Day Outlook Temperature Humidity Wind Decision

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Pattern Recognition Part-1 6


Outlook Yes No No of instances

Sunny 2 3 5

Overcast 4 0 4

Rain 3 2 5

total number of instances = 14


Gini(Outlook) = Gini(Sunny) + Gini(Overcast) + Gini(Rain)
Gini(Sunny) = 1 − P (yes)2 − P (no)2
2 3
= 1 − ( )2 − ( )2 = 0.48
5 5
​ ​

Gini(Overcast) = 1 − P (yes)2 − P (no)2


4 0
= 1 − ( )2 − ( )2 = 0
4 4
​ ​

Gini(Rain) = 1 − P (yes)2 − P (no)2


3 2
= 1 − ( )2 − ( )2 = 0.48
5 5
​ ​

5 4 5
Gini(Outlook) = ∗ 0.48 + ∗0+ ∗ 0.48 = 0.342
14 14 14
​ ​ ​

Temperature Yes No No of instances

Hot 2 2 4

Cool 3 1 4

Mild 4 2 6

total number of instances = 5


Gini(Temp) = Gini(Hot) + Gini(Cool) + Gini(Mild)
Gini(Hot) = 1 − P (yes)2 − P (no)2
2 2
= 1 − ( )2 − ( )2 = 0.5
4 4
​ ​

Gini(Cool) = 1 − P (yes)2 − P (no)2


3 1
= 1 − ( )2 − ( )2 = 0.375
4 4
​ ​

Gini(Mild) = 1 − P (yes)2 − P (no)2


4 2
= 1 − ( )2 − ( )2 = 0.44
6 6
​ ​

4 4 6
Gini(Temp) = ∗ 0.5 + ∗ 0.375 + ∗ 0.44 = 0.439
14 14 14
​ ​ ​

Humidity Yes No No of instances

High 3 4 7

Normal 6 1 7

Pattern Recognition Part-1 7


total number of instances = 14
Gini(Humidity) = Gini(High) + Gini(Normal)
Gini(High) = 1 − P (yes)2 − P (no)2
3 4
= 1 − ( )2 − ( )2 = 0.489
7 7
​ ​

Gini(Normal) = 1 − P (yes)2 − P (no)2


6 1
= 1 − ( )2 − ( )2 = 0.244
7 7
​ ​

7 7
Gini(Humidity) = ∗ 0.489 + ∗ 0.244 = 0.366
14 14
​ ​

Wind Yes No No of instances

Weak 6 2 8

Strong 3 3 6

total number of instances = 14


Gini(Wind) = Gini(Weak) + Gini(Strong)
Gini(Weak) = 1 − P (yes)2 − P (no)2
6 2
= 1 − ( )2 − ( )2 = 0.375
8 8
​ ​

Gini(Strong) = 1 − P (yes)2 − P (no)2


3 3
= 1 − ( )2 − ( )2 = 0.5
6 6
​ ​

8 6
Gini(Wind) = ∗ 0.375 + ∗ 0.5 = 0.428
14 14
​ ​

When Comparing the Gini Values of Outlook, Temperature, Humidity, and Wind the Outlook
attribute should be chosen because it has the minimum impurity, then the Root node is the
Outlook

Now we have to take the Outlook node and split them into attributes and the new table is

Outlook = Sunny

Temperature Humidity Wind Decision

Hot High Weak No

Hot High Strong No

Mild High Weak No

Cool Normal Weak Yes

Mild Normal Strong Yes

Pattern Recognition Part-1 8


Now we have to repeat the steps we did before for finding the next layer of nodes for each
branch

Temperature Yes No No of instances

Hot 0 2 2

Cool 1 0 1

Mild 1 1 2

total number of instances = 5


Gini(Temp) = Gini(Hot) + Gini(Cool) + Gini(Mild)
Gini(Hot) = 1 − P (yes)2 − P (no)2
0 2
= 1 − ( )2 − ( )2 = 0
2 2
​ ​

Gini(Cool) = 1 − P (yes)2 − P (no)2


1 0
= 1 − ( )2 − ( )2 = 0
1 1
​ ​

Gini(Mild) = 1 − P (yes)2 − P (no)2


1 1
= 1 − ( )2 − ( )2 = 0.5
2 2
​ ​

2 1 2
Gini(Temp) = ∗ 0 + ∗ 0 + ∗ 0.5 = 0.2
5 5 5
​ ​ ​

Humidity Yes No No of instances

High 0 3 3

Normal 2 0 2

total number of instances = 5


Gini(Humidity) = Gini(High) + Gini(Normal)
Gini(High) = 1 − P (yes)2 − P (no)2
0 3
= 1 − ( )2 − ( )2 = 0
3 3
​ ​

Gini(Normal) = 1 − P (yes)2 − P (no)2


2 0
= 1 − ( )2 − ( )2 = 0
2 2
​ ​

3 2
Gini(Humidity) = ∗ 0 + ∗ 0 = 0.366
5 5
​ ​

Wind Yes No No of instances

Weak 1 2 3

Strong 1 1 2

Pattern Recognition Part-1 9


total number of instances = 5
Gini(Wind) = Gini(Weak) + Gini(Strong)
Gini(Weak) = 1 − P (yes)2 − P (no)2
1 2
= 1 − ( )2 − ( )2 = 0.44
3 3
​ ​

Gini(Strong) = 1 − P (yes)2 − P (no)2


1 1
= 1 − ( )2 − ( )2 = 0.5
2 2
​ ​

3 2
Gini(Wind) = ∗ 0.44 + ∗ 0.5 = 0.464
5 5
​ ​

After comparing the attributes for choosing the right node for the sunny branch, i.e.
Temperature, Humidity and Wind the node will be Humidity since it has the minimum Gini
Impurity

Now the Decision Tree Will be like:

Now we look at the Overcast branch and the table will be

Temperature Humidity Wind Decision

Hot High Weak Yes

Cool Normal Strong Yes

Mild High Strong Yes

Hot Normal Weak Yes

Here all the attributes have yes so it is a pure node

💡 If every attribute in a branch is having same label yes or same label no it is the pure
node

Now the updated Decision Tree will be like:

Pattern Recognition Part-1 10


Outlook = Rain

Temperature Humidity Wind Decision

Mild High Weak Yes

Cool Normal Weak Yes

Cool Normal Strong No

Mild Normal Weak Yes

Mild High Strong No

Temperature Yes No No of instances

Mild 1 2 3

Cool 1 1 2

total number of instances = 5


Gini(Temp) = Gini(Mild) + Gini(Cool)
Gini(Mild) = 1 − P (yes)2 − P (no)2
2 1
= 1 − ( )2 − ( )2 = 0.44
3 3
​ ​

Gini(Cool) = 1 − P (yes)2 − P (no)2


1 1
= 1 − ( )2 − ( )2 = 0.5
2 2
​ ​

3 2
Gini(Temp) = ∗ 0.44 + ∗ 0.5 = 0.464
5 5
​ ​

Humidity Yes No No of instances

High 1 1 2

Normal 2 1 3

Pattern Recognition Part-1 11


total number of instances = 5
Gini(Humidity) = Gini(High) + Gini(Normal)
Gini(High) = 1 − P (yes)2 − P (no)2
1 1
= 1 − ( )2 − ( )2 = 0.5
2 2
​ ​

Gini(Cool) = 1 − P (yes)2 − P (no)2


2 1
= 1 − ( )2 − ( )2 = 0.44
3 3
​ ​

2 3
Gini(Temp) = ∗ 0.5 + ∗ 0.44 = 0.464
5 5
​ ​

Wind Yes No No of instances

Weak 3 0 3

Strong 0 2 2

total number of instances = 5


Gini(Wind) = Gini(Weak) + Gini(Strong)
Gini(Weak) = 1 − P (yes)2 − P (no)2
3 0
= 1 − ( )2 − ( )2 = 0
3 3
​ ​

Gini(Strong) = 1 − P (yes)2 − P (no)2


0 2
= 1 − ( )2 − ( )2 = 0
2 2
​ ​

3 2
Gini(Temp) = ∗ 0 + ∗ 0 = 0
5 5
​ ​

After comparing the attributes for choosing the right node for the rain branch, i.e. Temperature,
Humidity and Wind the node will be Wind since it has the minimum Gini Impurity

Now the final decision tree will be like:

ID3 Algorithm
Iterative Dichotomiser 3

primarily used for classification tasks

How ID3 works?

Pattern Recognition Part-1 12


Input Data - ID3 requires a dataset with labelled instances where each instance has a set of
features and a corresponding class label.

Calculate Entropy - For each attribute ID3 calculates the entropy, which measures the
impurity or disorder in the dataset

Information Gain - For each attribute, ID3 computes the information gain which measures
how much uncertainty in the class label is reduced after splitting the dataset based on the
attribute.

Select the best attribute - Choose the attribute with the highest info gain. This attribute is
the best choice for splitting the dataset at this node.

Create branches - divide the dataset into subsets based on the selected attributes values,
creating branches in the decision tree.

Stopping Criteria - for each subset check if all instances belong to the same class (leaf
node), no attributes are left to split on (leaf node), the subset is empty (prune this branch)

repeat step 2 to step 6 for each subset, treating each as a new dataset once tree is fully
grown, assign class labels to leaf nodes based on the majority class in that subset.

Day Outlook Temperature Humidity Wind Decision

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Entropy(Outlook)

m
Entropy(S) = − ∑[P i log 2 P i]
​ ​

i=1
Entropy(S) = P(yes) + P(no)
9 9 5 5
E(S) = − log 2 − log 2
14 14 14 14
​ ​ ​ ​ ​ ​

= 0.409 + 0.530 = 0.939 ≈ 0.94

Pattern Recognition Part-1 13


2 2 3 3
Entropy(Sunny) = − log 2 − log 2
5 5 5 5
​ ​ ​ ​ ​ ​

= 0.528 + 0.442 = 0.97


3 3 2 2
Entropy(Rain) = − log 2 − log 2
5 5 5 5
​ ​ ​ ​ ​ ​

= 0.442 + 0.528 = 0.97


4 4 0 0
Entropy(Overcast) = − log 2 − log 2
4 4 5 5
​ ​ ​ ​ ​ ​

=0
5 5 4
Gain(Outlook) = 0.94 − [ ∗ 0.97 + ∗ 0.97 + ∗ 0]
14 14 14
​ ​ ​

= 0.247

Temperature

2 2 2 2
Entropy(Hot) = − log 2 − log 2
4 4 4 4
​ ​ ​ ​ ​ ​

=1
4 4 2 2
Entropy(Mild) = − log 2 − log 2
6 6 6 6
​ ​ ​ ​ ​ ​

= 0.389 + 0.528 = 0.917


3 3 1 1
Entropy(Cool) = − log 2 − log 2
4 4 4 4
​ ​ ​ ​ ​ ​

= 0.311 + 0.5 = 0.811


4 6 4
Gain(Temperature) = 0.94 − [ ∗ 1 + ∗ 0.917 + ∗ 0.811]
14 14 14
​ ​ ​

= 0.03

Humidity

4 4 3 3
Entropy(High) = − log 2 − log 2
7 7 7 7
​ ​ ​ ​ ​ ​

= 0.461 + 0.523 = 0.984


6 6 1 1
Entropy(Normal) = − log 2 − log 2
7 7 7 7
​ ​ ​ ​ ​ ​

= 0.190 + 0.401 = 0.591


7 7
Gain(Humidity) = 0.94 − [ ∗ 0.984 + ∗ 0.591]
14 14
​ ​

= 0.16

Wind

6 6 2 2
Entropy(Weak) = − log 2 − log 2
8 8 8 8
​ ​ ​ ​ ​ ​

= 0.311 + 0.5 = 0.811


3 3 3 3
Entropy(Strong) = − log 2 − log 2
6 6 6 6
​ ​ ​ ​ ​ ​

=1
8 6
Gain(Wind) = 0.94 − [ ∗ 0.811 + ∗ 1]
14 14
​ ​

= 0.05

Pattern Recognition Part-1 14


After the first iteration we will select outlook as the root node since it has the largest gain

Sunny

Temperature Humidity Wind Decision

Hot High Weak No

Hot High Strong No

Mild High Weak No

Cool Normal Weak Yes

Mild Normal Strong Yes

2 2 3 3
Entropy(Sunny) = − log 2 − log 2 = 0.52 + 0.442 = 0.962
5 5 5 5
​ ​ ​ ​ ​ ​

Temperature

0 0 2 2
Entropy(Hot) = − log 2 − log 2 = 0
2 2 2 2
​ ​ ​ ​ ​ ​

1 1 1 1
Entropy(Mild) = − log 2 − log 2 = 1
2 2 2 2
​ ​ ​ ​ ​ ​

0 0 1 1
Entropy(Cool) = − log 2 − log 2 = 0
1 1 1 1
​ ​ ​ ​ ​ ​

2 2 1
Gain(Temperature) = 0.962 − [ ∗ 0 + ∗ 1 + ∗ 0]
5 5 5
​ ​ ​

= 0.962 − 0.4 = 0.562

Humidity

0 0 3 3
Entropy(High) = − log 2 − log 2 = 0
3 3 3 3
​ ​ ​ ​ ​ ​

0 0 2 2
Entropy(Normal) = − log 2 − log 2 = 0
2 2 2 2
​ ​ ​ ​ ​ ​

3 2
Gain(Humidity) = 0.962 − [ ∗ 0 + ∗ 0]
5 5
​ ​

= 0.962 − 0 = 0.962

Wind

Pattern Recognition Part-1 15


2 2 1 1
Entropy(Weak) = − log 2 − log 2
3 3 3 3
​ ​ ​ ​ ​ ​

= 0.389 + 0.528 = 0.917


1 1 1 1
Entropy(Strong) = − log 2 − log 2 = 1
2 2 2 2
​ ​ ​ ​ ​ ​

3 2
Gain(Wind) = 0.962 − [ ∗ 0.917 + ∗ 1] = 0.02
5 5
​ ​

After Second Iteration we found that Humidity is the best node for Outlook=sunny and
Outlook=overcast is a pure node so we have formed two sub branches

Outlook = Rain

Temperature Humidity Wind Decision

Mild High Weak Yes

Cool Normal Weak Yes

Cool Normal Strong No

Mild Normal Weak Yes

Mild High Strong No

Temperature

2 2 1 1
Entropy(Mild) = − log 2 − log 2
3 3 3 3
​ ​ ​ ​ ​ ​

= 0.389 + 0.528 = 0.917


1 1 1 1
Entropy(Cool) = − log 2 − log 2 = 1
2 2 2 2
​ ​ ​ ​ ​ ​

3 2
Gain(Temp) = 0.962 − [ ∗ 0.917 + ∗ 1]
5 5
​ ​

= 0.02

Humidity

Pattern Recognition Part-1 16


1 1 1 1
Entropy(High) = − log 2 − log 2
2 2 2 2
​ ​ ​ ​ ​ ​

=1
3 3 0 0
Entropy(Normal) = − log 2 − log 2 = 0
3 3 3 3
​ ​ ​ ​ ​ ​

2 3
Gain(Humidity) = 0.962 − [ ∗ 1 + ∗ 0] = 0.02
5 5
​ ​

Wind

0 0 2 2
Entropy(Strong) = − log 2 − log 2
2 2 2 2
​ ​ ​ ​ ​ ​

=0
3 3 0 0
Entropy(Weak) = − log 2 − log 2 = 0
3 3 3 3
​ ​ ​ ​ ​ ​

2 3
Gain(Wind) = 0.962 − [ ∗ 0 + ∗ 0] = 0.962
5 5
​ ​

After 3rd iteration we got that Outlook=Wind is the third node for Rain branch and the final
decision tree will look like this:

Disadvantages of ID3
overfitting

bias towards attributes with more values

limited to categorical data

prone to irrelevant features

handling missing data - ID3 does not deal with the missing values

greedy approach

C4.5 Algorithm
extension of ID3 algorithm which is designed to improve ID3’s shortcomings

improved flexibility and accuracy

Advantages

Pattern Recognition Part-1 17


Improved accuracy

Handling Different Types of data - both continuous and categorical data

Resilience to noisy data - pruning helps preventing overfitting

Flexibility - handles missing values.

Disadvantages
Computational Complexity - can be slower compared to simpler algorithms, especially with
larger dataset

memory consumption - although more efficient than ID3, C4.5 uses significant memory and
processing power.

Steps
Compute entropy information of whole dataset based on target attribute

m
Entropy Info(T) = − ∑ Pi log 2 Pi ​ ​ ​ ​

i=1

next for each attribute in training set compute the entropy information

v
∣Ai ∣
Entropy Info(T,A) = ∑

EntropyInfo(Ai)
∣T ∣
​ ​

i=1

we want to calculate the information gain

Information Gain = Entropy(T ) − Entropy(T , A)

split info and gain ratio is given by

v
∣Ai ∣ ∣Ai ∣
Split Info(T,A) = ∑ ∗ log 2
​ ​

∣T ∣ ∣T ∣
​ ​ ​ ​

i=1
InformationGain(A)
Gain Ratio =
SplitInfo(T , A)

choose the attribute for which gain ratio is maximum as best split attribute

best split attribute will be the root node

root nodes are branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute.

recursively apply same operations for the subsets until you get the leaf nodes.

CGPA Interactiveness Practical Knowledge Communicative Skills Job Offer

≥9 Yes v good good Yes

≥8 No good moderate Yes

≥9 No avg poor No

Pattern Recognition Part-1 18


CGPA Interactiveness Practical Knowledge Communicative Skills Job Offer

<8 No avg good No

≥8 Yes good moderate Yes

≥9 Yes good moderate Yes

<8 Yes good poor No

≥9 No v good good Yes

≥8 Yes good good Yes

≥8 Yes avg good Yes

7 7 3 3
Entropy(T = Job Offer) = − log 2 − log 2
10 10 10 10
​ ​ ​ ​ ​ ​

= 0.360 + 0.521 = 0.881

CGPA

v
∣Ai ∣
Entropy Info(T, CGPA) = ∑ Entropy(Ai )

∣T ∣
​ ​ ​

i=1
4 3 3 1 1
= [− log 2 − log 2 ] +
10 4 4 4 4
​ ​ ​ ​ ​ ​ ​

4 4 4 0 0
[− log 2 − log 2 ] +
10 4 4 4 4
​ ​ ​ ​ ​ ​ ​

2 0 0 2 2
[− log 2 − log 2 ]
10 2 2 2 2
​ ​ ​ ​ ​ ​ ​

= 0.324
Info Gain = Entropy(T ) − Entropy(T , CGP A) = 0.881 − 0.324 − 0.557
v
∣Ai ∣ ∣Ai ∣
Split Info(T, CGPA) = ∑ ∗ log 2
​ ​

∣T ∣ ∣T ∣
​ ​ ​ ​

i=1
4 4 4 4 2 2
= − log 2 − log 2 − log 2
10 10 10 10 10 10
​ ​ ​ ​ ​ ​ ​ ​ ​

= 0.528 + 0.528 + 0.464 = 1.5


InfoGain 0.557
Gain Ratio = = = 0.366
1.52
​ ​

SplitInfo

Interactiveness

Pattern Recognition Part-1 19


v
∣Ai ∣
Entropy Info(T, Interactiveness) = ∑ Entropy(Ai )

∣T ∣
​ ​ ​

i=1
6 5 5 1 1
= [− log 2 − log 2 ] +
10 6 6 6 6
​ ​ ​ ​ ​ ​ ​

4 2 2 2 2
[− log 2 − log 2 ]+
10 4 4 4 4
​ ​ ​ ​ ​ ​ ​

= 0.789
Info Gain = Entropy(T ) − Entropy(T , Interactiveness) = 0.881 − 0.789 = 0.092
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Interactiveness) = ∑ ∗ log 2
​ ​

∣T ∣ ∣T ∣
​ ​ ​ ​

i=1
6 6 4 4
= − log 2 − log 2
10 10 10 10
​ ​ ​ ​ ​ ​

= 0.442 + 0.528 = 0.97


InfoGain 0.092
Gain Ratio = = = 0.0948
0.97
​ ​

SplitInfo

Practical Knowledge

v
∣Ai ∣
Entropy Info(T, Pract. Knowledge) = ∑ Entropy(Ai )

∣T ∣
​ ​ ​

i=1
2 2 2 0 0
= [− log 2 − log 2 ] +
10 2 2 2 2
​ ​ ​ ​ ​ ​ ​

5 4 4 1 1
[− log 2 − log 2 ]+
10 5 5 5 5
​ ​ ​ ​ ​ ​ ​

3 1 1 2 2
[− log 2 − log 2 ] +
10 3 3 3 3
​ ​ ​ ​ ​ ​ ​

= 0.6356
Info Gain = Entropy(T ) − Entropy(T , P ract.Knowledge) = 0.881 − 0.6356 = 0.2454
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Pract. Knowledge) = ∑ ∗ log 2
​ ​

∣T ∣ ∣T ∣
​ ​ ​ ​

i=1
2 2 5 5 3 3
= − log 2 − log 2 − log 2
10 10 10 10 10 10
​ ​ ​ ​ ​ ​ ​ ​ ​

= 0.464 + 0.5 + 0.5210 = 1.485


InfoGain 0.2454
Gain Ratio = = = 0.165
1.485
​ ​

SplitInfo

Communication Skills

Pattern Recognition Part-1 20


v
∣Ai ∣
Entropy Info(T, Communication) = ∑ Entropy(Ai )

∣T ∣
​ ​ ​

i=1
5 4 4 1 1
=[− log 2 − log 2 ] +
10 5 5 5 5
​ ​ ​ ​ ​ ​ ​

3 3 3 0 0
[− log 2 − log 2 ]+
10 3 3 3 3
​ ​ ​ ​ ​ ​ ​

2 0 0 2 2
[− log 2 − log 2 ] +
10 2 2 2 2
​ ​ ​ ​ ​ ​ ​

= 0.3605
Info Gain = Entropy(T ) − Entropy(T , Communication) = 0.881 − 0.3605 = 0.5205
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Communication) = ∑ ∗ log 2
​ ​

∣T ∣ ∣T ∣
​ ​ ​ ​

i=1
5 5 3 3 2 2
= − log 2 − log 2 − log 2
10 10 10 10 10 10
​ ​ ​ ​ ​ ​ ​ ​ ​

= 0.5 + 0.5210 + 0.464 = 1.485


InfoGain 0.5205
Gain Ratio = = = 0.350
1.485
​ ​

SplitInfo

The root node is CGPA since it has the max gain ratio

CGPA(≥9)

Interactiveness Practical Knowledge Communicative Skills Job Offer

Yes v good good Yes

No avg poor No

Yes good moderate Yes

No v good good Yes

3 3 1 1
Entropy(T = Job Offer) = − log 2 − log 2
4 4 4 4
​ ​ ​ ​ ​ ​

= 0.311 + 0.5 = 0.881

Interactiveness

Pattern Recognition Part-1 21


v
∣Ai ∣
Entropy Info(T, Interactiveness) = ∑ Entropy(Ai )

∣T ∣
​ ​ ​

i=1
2 2 2 0 0
= [− log 2 − log 2 ] +
4 2 2 2 2
​ ​ ​ ​ ​ ​ ​

2 1 1 1 1
[− log 2 − log 2 ]+
4 2 2 2 2
​ ​ ​ ​ ​ ​ ​

= 0.5
Info Gain = Entropy(T ) − Entropy(T , Interactiveness) = 0.811 − 0.5 = 0.311
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Interactiveness) = ∑ ∗ log 2
​ ​

∣T ∣ ∣T ∣
​ ​ ​ ​

i=1
2 2 2 2
= − log 2 − log 2
4 4 4 4
​ ​ ​ ​ ​ ​

=1
InfoGain 0.311
Gain Ratio = = = 0.311
1
​ ​

SplitInfo

Practical Knowledge

v
∣Ai ∣
Entropy Info(T, Pract. Knowledge) = ∑ Entropy(Ai )

∣T ∣
​ ​ ​

i=1
2 2 2 0 0
= [− log 2 − log 2 ] +
4 2 2 2 2
​ ​ ​ ​ ​ ​ ​

1 0 0 1 1
[− log 2 − log 1 ]+
4 1 1 1 1
​ ​ ​ ​ ​ ​ ​

1 1 1 0 0
[− log 2 − log 2 ]
4 1 1 1 1
​ ​ ​ ​ ​ ​ ​

=0
Info Gain = Entropy(T ) − Entropy(T , P ract.Knowledge) = 0.811 − 0 = 0.811
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Pract. Knowledge) = ∑ ∗ log 2
​ ​

∣T ∣ ∣T ∣
​ ​ ​ ​

i=1
2 2 1 1 1 1
= − log 2 − log 2 − log 2
4 4 4 4 4 4
​ ​ ​ ​ ​ ​ ​ ​ ​

= 1.5
InfoGain 0.811
Gain Ratio = = = 0.540
1.5
​ ​

SplitInfo

Communication

Pattern Recognition Part-1 22


v
∣Ai ∣
Entropy Info(T, Communication) = ∑ Entropy(Ai )

∣T ∣
​ ​ ​

i=1
2 2 2 0 0
= [− log 2 − log 2 ] +
4 2 2 2 2
​ ​ ​ ​ ​ ​ ​

1 0 0 1 1
[− log 2 − log 1 ]+
4 1 1 1 1
​ ​ ​ ​ ​ ​ ​

1 1 1 0 0
[− log 2 − log 2 ]
4 1 1 1 1
​ ​ ​ ​ ​ ​ ​

=0
Info Gain = Entropy(T ) − Entropy(T , Communication) = 0.811 − 0 = 0.811
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Communication) = ∑ ∗ log 2
​ ​

∣T ∣ ∣T ∣
​ ​ ​ ​

i=1
2 2 1 1 1 1
= − log 2 − log 2 − log 2
4 4 4 4 4 4
​ ​ ​ ​ ​ ​ ​ ​ ​

= 1.5
InfoGain 0.811
Gain Ratio = = = 0.540
1.5
​ ​

SplitInfo

💡 If two classes have same maximum gain ratio then choose any one of them

Now we have choosen the Communication Skills attribute as node for ≥9 and then ≥8 and <8
are pure node so the final tree will look like

Random Forest
robust tree based learning technique in Machine Learning

operates by constructing multiple decision trees during the training phase.

each tree is built using a random subset of the dataset and measures a random subset of
features at each partition.

Working Mechanism

Pattern Recognition Part-1 23


1. Training Phase

a. Tree Construction - multiple decision trees are created using different subset of training
data.

b. Feature Selection - each tree uses random subset of features to make decision at each
node.

2. Prediction Phase

a. Aggregation - for classification tasks, the algorithm aggregates results from all trees
through majority voting. For regression tasks, it averages the result of all trees.

Advantages
Reduction in Overfitting - by averaging multiple trees, the risk of overfitting is significantly
reduced.

High accuracy - random forests are known for their high accuracy in predictions (achieved
through algorithm’s unique approach by constructing multiple decision trees during training
phase)

Handling missing data - the algo can estimate missing data efficiency

Scalability - It runs efficiently on larger datasets.

Disadvantages
Computational resources - requires more computational resources compared to simpler
algorithms like decision trees.

Time Consumption - takes more time to train compared to a single decision tree.

Complexity - the model can become less intuitive with an extensive collection of decision
trees, making it harder to interpret.

Bagging
Bootstrap Aggregation

Bagging involves creating different training subsets from the original data with replacement

Process

Bootstrapping - original data is arranged into bootstrap samples

Training models - individual models are trained on these samples.

Aggregation - results from these individual models are combined using majority voting
to generate the final output.

Boosting
Boosting combines weak learners into strong learners by creating sequential models. Each
model attempts to correct the errors made by the previous ones.

Weak Learner - machine learning model which performs only slightly better than random
guessing (non-zero predictive power)

Pattern Recognition Part-1 24


Process

Sequential Training - models are trained sequentially with each new model focusing on
the errors of previous models.

weighted voting - instances misclassified by previous models are given more weight
and final prediction is made by weighted voting

for ex:- ADABoost, XGBoost

Pattern Recognition Part-1 25


Pattern Recognition Part-2
Bayesian Decision Theory
Bayesian Decision theory is a fundamental statistical approach to the problem of pattern classification. It is considered
as the ideal pattern classifier and often used as the benchmark for other algorithms because its decision rule
automatically minimizes its loss function.

Formal Framework for making decisions in the presence of uncertainty

combines principle of probability theory with utility theory.

Bayes theorem states that

P (B∣A)P (A)
P (A∣B) =
P (B)

Where A = state of the nature or the class of an entry

and B = input feature vector

P (A∣B)= called the posterior probability, it is the probability of the predicted class to be A for a given entry of feature
(B). [A is true after we see the evidence B]

P (B∣A)= class-conditional probability density function for feature. We call it likelihood of A with respect to B, a term
chosen to indicate that, other things being equal, the category (or class) for which it is large is more “likely” to be the
true category. [Probability of seeing the evidence B if A is true].

P (A)= A priori probability (or simply prior) of class A. It is usually pre-determined and depends on the external factors.
It means how probable the occurrence of class A out of all the classes. [ What we believe about A before seeing
evidence B].

P (B)= called the evidence, it is merely a scaling factor that guarantees that the posterior probabilities sum to one. It is
also called Marginal Likelihood [total probability of observing the evidence B across all possible propositions].

Proof
Let ω1 , ω2 to be two classes in which our patterns belong.
​ ​

The probability density function pdf(x∣ωi )is sometimes referred to as likelihood function of ωi with respect to x, where
​ ​

xis a random variable / feature vector.


ωi is a specific class

Pdf is used in classification problems to understand how data points xare distributed for each class ωi  ​

Bayes Rule,

P (x∣ωi )P (ωi )
P (ωi ∣x) =
​ ​

P (x)
​ ​

where

2
P (x) = ∑ P (x∣ωi )P (ωi )
​ ​ ​

i=1

Bayes classification rule can be stated as:

If P (ω1 ∣x) ​ > P (ω2 ∣x), xis classified to P (ω1 )


​ ​

If P (ω1 ∣x) ​ < P (ω2 ∣x), xis classified to P (ω2 )


​ ​

Pattern Recognition Part-2 1


Minimizing the classification Error Probability
Let R1 - region of feature space in favor for w1 
​ ​

Let R2 - region of feature space in favor for w2 


​ ​

Then an error is made if x ∈ R1 although it belongs to ω2 or if x ∈ R2 although it belongs to ω1 


​ ​ ​

Pe = P (x ∈ R2 , ω1 ) + P (x ∈ R1 , ω2 )
​ ​ ​ ​ ​

where P() is the join probability of two events

Pe = P (x ∈ R2 ∣ω1 )P (ω1 ) + P (x ∈ R2 ∣ω2 )P (ω2 )


​ ​ ​ ​ ​ ​ ​

= P (ω1 ) ∫ ​ ​ P (x∣ω1 )dx + P (ω2 ) ∫ ​ ​ ​ P (x∣ω2 )dx ​

R2 ​ R1 ​

or using Bayes rule

Pe = ∫ ​ ​ P (ω1 ∣x)P (x)dx + ∫ ​ ​ P (ω2 ∣x)P (x)dx ​

R2 ​ R1 ​

It is now easy to see that error is minimized if the probability regions R1 and R2 of the feature space are chosen so that ​ ​

R1 : P (ω1 ∣x) > P (ω2 ∣x)


​ ​ ​

R2 : P (ω2 ∣x) > P (ω1 ∣x)


​ ​ ​

Indeed since the union of regions R1 , R1 covers all the space, then from the definition of a pdf we have that
​ ​

∫ ​ P (ω1 ∣x)P (x)dx + ∫


​ ​ P (ω1 ∣x)P (x)dx = P (ω1 )
​ ​

R1 ​
R2 ​

By combining previous two equations, we get

Pe = P (ω1 ) − ∫ (P (ω1 ∣x) − P (ω2 ∣x))P (x)dx


​ ​ ​ ​ ​

R1 ​

This suggests that the probability of error is minimized if R1 is the region space which P (ω1 ∣x) ​ ​ > P (ω2 ∣x). Then R2 
​ ​

becomes the region where reverse is true.

Bayesian Decision Theory in Parametric Generative Models


Parametric Generative Methods - goal is to model the data distributions for tasks like classification, clustering, and
density function. These methods assume that the data is generated from a known probability distribution with a set of
parameters and aim to infer these parameters to perform PR tasks.

Key Concepts

Generative Model - parametric generative methods explicitly model the joint probability distribution P(X,Y), where
X represents the features and Y the class labels.

Once P (X, Y )is estimated, it can be used to derive:


P (X ,Y )
Posterior Probability: P (Y ∣X) = P (X ) for classification tasks.

Likelihood: P (X∣Y ), useful for evaluating the fit of the model.

Pattern Recognition Part-2 2


Bayes Rule -

these methods often leverage Bayes’ theorem for classification

P (X∣Y )P (Y )
P (Y ∣X) =
P (X)

Bayes decision theory’s decision rule - on the basis of Bayes’ theorem if the posterior probability is maximum
then the data point should be assigned to that class.

Parametric Model - assume that the data is generated from a distribution characterized by a set of parameters θ.

Parameters are estimated using methods like Maximum Likelihood Estimation (MLE) or Maximum a Posteriori
(MAP).

Bayesian Parameter Estimation

In Bayesian Estimation, the parameters of the model θare treated as random variables with their own probability
distributions.

The goal is to estimated the posterior distribution of the parameters P (θ∣D)where D is the observed data
(features)

P (D∣θ)P (θ)
P (θ∣D) =
P (D)

P (θ∣D)- likelihood (probability of observing the data D given the parameters θ.)
P (θ)- Prior (Encodes prior beliefs about the parameters before observing the data.)
P (θ∣D)- Posterior (The updated belief about the parameters after observing the data)
P (D)- Evidence (The marginal likelihood of the data, often a normalizing constant)

total ‘m’ classes (e.g. m = 2) ⇒ D1 , D2 , ......, Dm 


​ ​ ​

for each classes there is a feature vector Xi  ​

for each classes they will be having nimages

D1 = {xi1 , xi2 , xi3 , ...., Xin }where each feature may not be same and
​ ​ ​ ​

Let the parameter which should be obtained be θ = {σ, μ}which is the parameter estimation
(mean and variance).

Lets take some assumptions:

1st Assumption: Di gives no information about Dj (independent), If there is multiple nparameters θ


​ ​
=
{θi1 , θi2 , θi3 , ......, θin }
​ ​ ​ ​

2nd Assumption: Samples are independent of each other.

We have to find a distribution which fit our data ⇒ Di  ​

take the distribution with maximum datapoints


Take the mean of data points i.e. mean = avg(datapoints)
Take the datapoint which has the mean value (probability)

Pattern Recognition Part-2 3


Then the points going lesser than the probability to one class and others to another class
n
You should find summarization ∑x using natural log values in which parameter has maximum probability.

i ​

that’s the parameter estimation⇒ max of θ


n
P (D∣θi ) = ∏i=1
i
P (X∣θi )which is not possible because the values are basically between 0 and 1 then multiplying then the

​ ​ ​

product becomes indistinguishably close to zero which can lead to underflow in computer systems.

ni
ln(P (Di ∣θi )) = ∑ ln(P (x∣θi ))

​ ​ ​ ​

k=1

Maximum Likelihood Method for Bayesian parameter estimation


i.e. Consider that value of parameter that gives you maximum probability

θmL = argmax(P (Di ∣θi ))


​ ​ ​

θmL = argmax(ln(P (Di ∣θi )))


​ ​ ​

ni
⟹ θmL = ∑ ln(P (x∣θi ))

​ ​ ​ ​

i=1

The Curse of Dimensionality


It is a common problem in machine learning, this is because the complexity of the model increases with the number of
features, and it becomes more difficult to find a solution.
High-dimensional data can also lead to overfitting
There are two main approaches to dimensionality reduction

1. Feature selection

2. Feature Extraction

Dimensionality Reduction
It is the process of taking data in a high dimensional space and mapping it into a new space whose dimensionality is
much smaller.

It is the technique used to reduce the number of features in a dataset while retaining as much of the important
information as possible.

Reasons:

High dimensional data impose computational challenges

Pattern Recognition Part-2 4


High dimensionality may lead to poor generalization abilities of learning algorithm.

Can be used for interpretability of the data, for finding meaningful structure of data, and for illustration purposes.

Feature Selection
selecting a subset of the original features that are most relevant to the problem at hand.

The goal is to reduce the dimensionality of the dataset while retaining the most important features

Feature Extraction
Involves creating new features by combining or transforming the original features

The goal is to create a set of features that captures the essence of the original data in a lower-dimensional space.

PCA is a popular technique that projects the original features onto a lower dimensional space while preserving as much
of the variance as possible.

Advantages
helps in data compression, and hence reduced storage space.

reduces computation time.

also helps remove redundant features

improved visualization

overfitting prevention

improved performance

Disadvantages
may lead to some amount of data loss

PCA tends to find linear correlations between variables, which is sometimes undesirable.

the reduced dimensions may not be easily interpretable, and it may be difficult to understand the relationship between
the original features and the reduced dimensions.

Principle Component Analysis (PCA)


One of the technique to handle curse of dimensionality in machine learning

statistical procedure that uses an orthogonal transformation that converts a set of correlated variables to a set of
uncorrelated variables. These new transformed features are called the Principle Components.

PCA can be used for both classification and regression.

PCA algo is based on some mathematical concepts like:

Variance and Covariance

Eigen values and Eigen Vectors

Terminologies
Dimensionality - number of features or variables present in the given dataset (number of columns present in the
dataset)

Correlation - how strongly two variables are related to each other (ranges from -1 to +1) : -1 if two variables are inversely
proportional to each other, +1 if two variables are directly proportional to each other.

Orthogonal - variables are not correlated to each other, correlation between them will be zero.

Eigen Vectors - if there is a square matrix M, and a non-zero vector v is given. Then v will be eigen vector if Av is the
scalar multiple of v.

Covariance matrix - A matrix containing covariance between the pair of variables is called the Covariance matrix.

Algorithm
1. Dataset Representation

features eg1 eg2 ………………. egN

Pattern Recognition Part-2 5


X1 X11 X12 ………………. X1N

X2 X21 X22 ………………. X2N

. . . . .
. . . . .
. . . . .

Xn Xn1 Xn2 ………………. XnN

2. Compute the Mean of the each variable

1
xˉi =
​ ​ (xi1 + xi2 + .... + xiN )
​ ​ ​ ​

N
3. Calculate Covariance matrix

a. Find the Covariance of all ordered pairs (Xi , Xj ) ​ ​

If there are n features you can create n2 pairs

(x,y) = (x,x), (x,y), (y,x), (y,y)

N
1
Cov(Xi , Xj ) = ∑(Xik − X
ˉi )(Xjk − Xˉj )
N −1
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

i=1

b. Construct nxn matrix S called Covariance matrix

cov(x1 , x1 ) cov(x1 , x2 ) ​ ​ ​ ​ ... cov(x1 , xN ) ​ ​

. . ... .
S=
. . ... .
​ ​ ​ ​ ​ ​ ​

cov(xn , x1 ) cov(xn , x2 ) ​ ​ ​ ​
... cov(xn , xN ) ​ ​

4. Calculate the eigen values and normalized eigen vectors

a. To find the eigen value solve the equation det(S − λI) = 0 we get n roots i.e. λ1 , λ2 , ..., λn which are eigen ​ ​ ​ ​

values such that λ1 ​ > λ2 > ... > λn 


​ ​

b. Corresponding eigen vector is a

u1 ​

u2 ​

vector U = . ​ ​ ​

.
un ​

which is an nx1 matrix such that ˋ I)U = 0 


(S − λ ​

c. Normalize the eigen vector

Ui
ei = ∀ ∥Ui ∥ = u21 + u22 + .... + u2n

∥Ui ∥
​ ​ ​ ​ ​ ​ ​

💡 The unit eigen vector corresponding to the largest eigen value is the first principle component

5. Derive the new dataset

feature eg1 eg2 ……………….. egN

PC1 P11 P12 ……………….. P1N

PC2 P21 P22 ……………….. P2N

. . . ……………….. .

. . . ……………….. .

PCn Pn1 Pn2 ……………….. PnN

such that
ˉ

Pattern Recognition Part-2 6


X1j − Xˉ1 ​ ​ ​

X2j − Xˉ2 ​ ​ ​

Pij = eTi ​ ​ ​ . ​ ​

.
Xnj − Xˉn ​ ​ ​

Given the following data use PCA to reduce the dimension from two to one
feature eg1 eg2 eg3 eg4

x 4 8 13 7

y 11 4 5 14

Solution
Step 1. Data Representation

feature eg1 eg2 eg3 eg4

x 4 8 13 7

y 11 4 5 14

(n) No of features = 2
(N) no of samples = 4
Step 2. Find the Mean of two feature x and y

1
xˉ = (4 + 8 + 13 + 7) = 8
4

1
yˉ = (11 + 4 + 5 + 14) = 8.5
4
​ ​

Step 3. Computation of Covariance matrix

a. find covariance of all ordered pairs

N
1
cov(Xi , Xj ) = ∑(Xik − X
ˉi )(Xjk − Xˉj )
N − 1 i=1
​ ​ ​ ​ ​ ​ ​ ​ ​ ​

1
cov(x, x) = [(4 − 8)2 + (8 − 8)2 + (13 − 8)2 + (7 − 8)2 ]
4−1

1
= (16 + 0 + 25 + 1)
3

42
=
3

cov(x, x) = 14 ​

1
cov(x, y) = [(4 − 8)(11 − 8.5) + (8 − 8)(4 − 8.5) + (13 − 8)(5 − 8.5) + (7 − 8)(14 − 8.5)]
4−1

1
= (−10 + 0 − 17.5 − 5.5)
3

−33
=
3

cov(x, y) = −11 ​

cov(y, x) = cov(x, y) = −11

1
cov(y, y) = [(11 − 8.5)2 + (4 − 8.5)2 + (5 − 8.5)2 + (14 − 8.5)2 ]
4−1

1
= (6.25 + 20.25 + 12.25 + 30.25)
3

69
=
3

cov(y, y) = 23 ​

b. Construct covariance matrix nxn matrix = 2x2

Pattern Recognition Part-2 7


S=[ ]
cov(x, x) cov(x, y)
cov(y, x) cov(y, y)
​ ​

14 − 11
S=[ ]
−11 23
​ ​

Step 4. Construct eigen value, eigen vector, and normalized eigen vector.

1. Eigen Values det(S − λI) = 0 


1 0
I=[ ]
0 1
​ ​

λ 0
λI = [ ]
0 λ
​ ​

14 − λ −11
S − λI ⟹ [ ]
−11 23λ
​ ​

14 − λ −11
=0
−11 23 − λ
​ ​ ​ ​

(14 − λ)(23 − λ) − (−11 ∗ −11) = 0


λ2 − 37λ + 201 = 0
λ = 30.3849, 6.6151
λ1 > λ2 ​ ​

λ1 = 30.3849 ⟹ First PC

λ2 = 6.6151 ​

2. Eigen vector of λ1 

(S − λ1 I )U1 = 0 ​ ​

14 − λ1 −11 0
[ ] [ 1] = [ ]
​ u ​

−11 23 − λ1 u2 0
​ ​ ​ ​

​ ​

(14 − λ1 )u1 − 11u2 = 0 ⟶ (1)


​ ​ ​

−11u1 + (23 − λ1 )u2 = 0 ⟶ (2)


​ ​ ​

Take First Equation and solve it


u1 u2
= =t
​ ​

11 14 − λ1
​ ​

When t=1
u1 = 11 ​

u2 = 14 − λ1 ​ ​

11
Eigen Vector U1 of λ1 = [ ]
14 − λ1
​ ​ ​

11
=[ ]
14 − 30.3849

11
U1 = [ ]
−16.384
​ ​ ​

3. Eigen Vector of λ2  ​

Pattern Recognition Part-2 8


(S − λ2 I )U2 = 0 ​ ​

14 − λ2 −11 0
[ ] [ 1] = [ ]
u ​ ​

−11 23 − λ2 u2 0
​ ​ ​ ​

​ ​

(14 − λ1 )u1 − 11u2 = 0 ⟶ (1)


​ ​ ​

−11u1 + (23 − λ1 )u2 = 0 ⟶ (2)


​ ​ ​

Take First Equation and solve it


u1 u2
= =t
​ ​

11 14 − λ2
​ ​

When t=1
u1 = 11 ​

u2 = 14 − λ2 ​ ​

11
Eigen Vector U2 of λ2 = [ ]
14 − λ2
​ ​ ​

11
=[ ]
14 − 6.6151

11
U2 = [ ]
7.3849
​ ​ ​

4. Normalize the eigen vector U1

11
112 +(−16.384)2

e1 = ​ ​ ​ ​

−16.384
112 +(−16.384)2

11
19.7341

e1 = ​ ​ ​ ​

−16.384
19.7341

0.5574
e1 = [ ]
−0.8302
​ ​ ​

5. Normalize the eigen vector U2

11
112 +(7.3849)2

e2 = ​ ​ ​ ​

7.3849
112 +(7.3849)2

11
13.249

e2 = ​ ​ ​ ​

7.3849
13.249 ​

0.8302
e2 = [ ]
0.5574
​ ​ ​

Step 5. Derive the new dataset

(4 − 8)
P11 = [0.5574 −0.8302] [ ]
(11 − 8.5)
​ ​ ​ ​

(−4)
= [0.5574 −0.8302] [ ]
(2.5)
​ ​ ​

P11 = −4.3051 ​ ​

(8 − 8)
P12 = [0.5574 −0.8302] [ ]
(4 − 8.5)
​ ​ ​ ​

0
= [0.5574 −0.8302] [ ]
(−4.5)
​ ​ ​

P12 = 3.7359 ​ ​

(13 − 8)
P13 = [0.5574 −0.8302] [ ]
(5 − 8.5)
​ ​ ​ ​

5
= [0.5574 −0.8302] [ ]
(−3.5)
​ ​ ​

P13 = 5.6927 ​ ​

Pattern Recognition Part-2 9


(7 − 8)
P14 = [0.5574 −0.8302] [ ]
(14 − 8.5)
​ ​ ​ ​

−1
= [0.5574 −0.8302] [ ]
(5.5)
​ ​ ​

P14 = −5.1235
​ ​

eg1 eg2 eg3 eg4

X1 4 8 13 7

X2 11 4 5 14

First PC -4.3051 3.7359 5.6927 -5.1235

Support Vector Machines (SVM)


Support Vector Machine or SVM is one of the most popular Supervised Learning Algorithms , which is used for
classification and regression problems.

however, primarily used for classification problems in machine learning.

The goal is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.

This best decision boundary is called a hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset which means if there are 2 features
then the hyperplane will be a straight line.

And if there are 3 features, then hyperplane will be a 2-dimensional plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance between the data
points.

The datapoints lying near to hyperplane are called support vectors.

The perpendicular distance between the support vectors to the hyperplane is called margin.

If the datapoint’s margin is minimum it has a chance to be placed into another class.

SVM is an optimization problem.

If you are not able to separate datapoints with a single hyperplane - Hard margin, else it is called Soft Margin.

Linear SVM - Linear SVM is used for linearly separable data which means if a dataset can be classified into two classes
by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.

Non-linear SVM - Non-linear SVM is used for non-linearly separated data, which means if the dataset cannot be
classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.

Pattern Recognition Part-2 10


Mathematical Formulation of SVM problem
SVM problem is the problem of finding the equation of SVM given a linearly separable data of two classes (+ve and -ve).

Suppose the dataset of N points (x1 , y1 ), (x2 , y2 ), ..., (xN , yN )where x1 , x2 , ......, xN are the features of the dataset
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

and y1 , y2 , ......, yN are the class labels of the each vectors. and yi
​ ​ ​ ​ ​ ​ ​ = +1 or − 1
Hyperplane is of the form w. x + b = 0
Negative Samples can be expressed as w . x + b < 0 ∀ y = −1
Positive Samples can be expressed as w . x + b > 0 ∀ y = +1 
Two parallel separating hyperplanes can be described as

w . x + b = −1
w . x + b = +1

Aim is to find the best optimal hyperplane

The best separating hyperplane lies on the halfway between two parallel hyperplanes or in other words we need to
maximize the margin.

In order to find the distance of a hyperplane from a point x = (x1 , x2 , ......, xN )from a hyperplane whose equation is
​ ​ ​ ​ ​ ​

given by

α0 + α . x = 0 (general form)

b + ω.x = 0

i.e.

α0 + α . x
​ ​ ​ ​

​ α ​

b + ω.x
​ ​

​ ω ​

So to find the perpendicular distance from the origin to 2 parallel hyperplanes do the following:

Pattern Recognition Part-2 11


Let the distance between origin and first hyperplane be pand the distance between origin and the second
hyperplane be qthen the distance between the 2 hyperplanes is:

D = Q−P

P ⟹ ω . x + b = −1
⟹ ω.x + b + 1 = 0
Q ⟹ ω . x + b = +1
⟹ ω.x + b − 1 = 0


ω.x + b + 1 ​

∣b + 1∣
Distance from origin to P = ​
= ​

​ ω ​ ​ ω ​

​ ω.x + b − 1 ​

∣b − 1∣
Distance from origin to Q = ​
= ​

​ ω ​ ​ ω ​

Now distance between two parallel hyperplanes D


D = Q−P
∣b − 1∣ ∣b + 1∣
= − ​ ​

ω ω ​ ​ ​ ​

∣b − 1 − b − 1∣
= ​

ω ​ ​

∣−2∣
= ​

ω ​ ​

2
D= ​ ​

​ ω ​

We need to maximize D, to maximize, minimize ​ ω ​


⟹

2
2 ​
ω ​

max ⟹ min ω ⟹ min


2
​ ​ ​ ​ ​ ​

​ ω ​

Final Definition

We can define the SVM pattern as an optimization problem


2
∥ω ∥
Find a vector ω and a number bwhich minimize 2
​ 

Subject to yi ( ω .xi
​ ​ ​ − b) ≥ 1for i = 1…..N
The solution to the SVM problem is a classifier known as SVM classifier.

Let ω = ω ∗ and b = b∗ be a solution of the SVM problem. Let x be an unclassified data instance

Pattern Recognition Part-2 12


⟹ Assign the class label +1 to x if ω ∗ . x + b∗ > 0
⟹ Assign the class label -1 to x if ω ∗ . x + b∗ < 0

SVM Classifier
Using SVM algorithm find the hyperplane with maximum margin for the following data

x1 x2 class

2 1 +1

4 3 -1

Reasons for using the above equation Simply formulation


N =2
and Interpretation of boundaries
x1 = (2, 1)
​ ​

x2 = (4, 3)
​ ​
1. because b can be +ve/-ve

f x = ω.x − b ​
2. consistency with classification boundaries
y1 = +1 3. Helps in making the optimization problem easier.
y2 = −1

Find ω and b to calculate this


α = (α1 , α2 )Since two examples subject to the conditions
​ ​

1. first equation belongs to +1 ⟹ +α1  ​

2. second equation belongs to -1 ⟹ −α2  ​

α1 − α2 = 0means α1 = α2 
​ ​ ​ ​

α1 > 0and α2 > 0


​ ​

Therefore the equation is

N N
1
ϕ( α ) = ∑ αi − ∑ αi .αj .yi .yj .(xi .xj )
2
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

i=1 i=1,j=1

1
= (α1 + α2 ) − [α1 α1 y1 y1 (x1 .x1 ) + α1 α2 y1 y2 (x1 .x2 ) + α2 α1 y2 y1 (x2 .x1 ) + α2 α2 y2 y2 (x2 .x2 )]
2
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

= (α1 + α2 ) − ​ ​

1 2
[α (1)(1)(2 ∗ 2 + 1 ∗ 1) + α1 α2 (1)(−1)(2 ∗ 4 + 1 ∗ 3) + α2 α1 (−1)(1)(4 ∗ 2 + 3 ∗ 1) + α2 α2 (1)(−1)(4 ∗ 4 + 3 ∗ 3)]
2 1
​ ​ ​ ​ ​ ​ ​ ​

1
= (α1 + α2 ) − [5α12 − 22α1 α2 + 25α22 ]
2
​ ​ ​ ​ ​ ​ ​

and Find the values of α1 and α2 which maximizes ​

1
(α1 + α2 ) − [5α12 − 22α1 α2 + 25α22 ]
2
​ ​ ​ ​ ​ ​ ​

Assume α1 ​ = α2

1
ϕ( α ) = 2α1 − [5α12 − 22α12 + 25α12 ]
2
​ ​ ​ ​ ​

= 2α1 − 4α12 ​ ​

For ϕto be maximum we must have


= 2 − 8α1 = 0
​ ​

dx
1
α1 =
4
​ ​

1
α1 =
4
​ ​

Step 2. Computing the weight vector

Pattern Recognition Part-2 13


N
ω = ∑ αi yi xi ​ ​ ​ ​ ​

i=1
= α1 y1 x1 + α2 y2 x2 ​ ​ ​ ​ ​ ​ ​ ​

1 1
= (+1)(2, 1) + (−1)(4, 3)
4 4
​ ​

1
= (−2, −2)
4

1 1
ω = (− , − )
2 2
​ ​

1
b=( min (ω .x1 ) + max (ω .x1 ))
2 i:yi =+1
​ ​ ​ ​ ​ ​ ​

​ i:yi = −1 ​

1
= ((ω .x1 ) + (ω .x2 ))
2
​ ​ ​ ​ ​

1 1 1 1 1
= ((− ∗ 2 − ∗ 1) + (− ∗ 4 − ∗ 3))
2 2 2 2 2
​ ​ ​ ​ ​

1 −10
= ( )
2 2
​ ​

−5
b=
2
​ ​

The SVM Classifier

f( x ) = ω . x − b
1 1 5
= (− , − ).(x1 , x2 ) − (− )
2 2 2
​ ​ ​ ​ ​

1 1 5
= − x1 − x2 +
2 2 2
​ ​ ​ ​ ​

1
= − (x1 + x2 − 5)
2
​ ​ ​

The equation of maximum margin hyperplane is


f( x ) = 0

1
− (x1 + x2 − 5) = 0
2
​ ​ ​

x1 + x2 − 5 = 0 ​ ​

2+4 1+3
mean = ( , ) = (3, 2)
2 2
​ ​

Fisher Linear Discriminant Analysis


Dimensionality Reduction Technique

Commonly used for supervised classification problems.

used to project the features in higher dimension space into a lower dimension space.

Two criteria are used by LDA to create a new axis

Pattern Recognition Part-2 14


Maximize the distance between means of two classes

Minimize the variation within each class

Algorithm Steps
1. Compute the class means of dependent variable

1
μ1 = ​ ∑x ​ ​

N x∈ω
1 ​

2. Derive the covariance matrix of class variable

1
S1 = ∑ (x − μ1 )(x − μ1 )T
N − 1 x∈ω
​ ​ ​ ​ ​

1 ​

3. Compute the within class - Scatter Matrix

Sw = S1 + S2 ​ ​ ​

4. Compute the between class scatter matrix

Sβ = (μ1 − μ2 )(μ1 − μ2 )T
​ ​ ​ ​ ​

5. Compute the Eigen Values and Eigen Vectors from the within class Sω and between class Sβ scatter matrix ​ ​

Sω−1 Sβ ω = λω ​ ​

6. Sort the eigen values and select top k values

7. Find the eigen vectors corresponding to the top k eigen vectors

8. Obtain the LDA by taking the dot product of eigen vectors and original data

(Sω−1 Sβ − λI ) ( 1 ) = 0
ω ​

ω2
​ ​ ​

Compute the Linear Discriminant projection for the following 2D dataset


Samples for class ω1 ​
: X1 = (x1 , x2 ) = {(4, 2), (2, 4), (2, 3), (3, 6), (4, 4)}
​ ​ ​

Samples for class ω2 ​ : X2 = (x1 , x2 ) = {(9, 10), 6, 8), (9, 5), (8, 7), (10, 8)}
​ ​ ​

Solution

The classes mean are:

1
μ1 = ​ ∑x ​ ​ ​

N x∈ω
1 ​

1 4 2 2 3 4 3
μ1 = [( ) + ( ) + ( ) + ( ) + ( )] = ( )
5 2 4 3 6 4 3.8
​ ​ ​ ​ ​ ​ ​ ​

1 9 6 9 8 10 8.4
μ2 = [( ) + ( ) + ( ) + ( ) + ( )] = ( )
5 10 8 5 7 8 7.6
​ ​ ​ ​ ​ ​ ​ ​

Covariance matrix of the first class:

Pattern Recognition Part-2 15


T
4 3 4 3
S1 = [( ) − ( )] [( ) − ( )] +
2 3.8 2 3.8
​ ​ ​ ​ ​

T
2 3 2 3
[( ) − ( )] [( ) − ( )] +
4 3.8 4 3.8
​ ​ ​ ​

T
2 3 2 3
[( ) − ( )] [( ) − ( )] +
3 3.8 3 3.8
​ ​ ​ ​

T
3 3 3 3
[( ) − ( )] [( ) − ( )] +
6 3.8 6 3.8
​ ​ ​ ​

T
4 3 4 3
[( ) − ( )] [( ) − ( )] +
4 3.8 4 3.8
​ ​ ​ ​

1 −0.25
=( )
Ans
S1 =
N −1 −0.25 2.2
​ ​ ​ ​

Covariance matrix of second class:

T
9 8.4 9 8.4
S2 = [( ) − ( )] [( ) − ( )] +
10 7.6 10 7.6
​ ​ ​ ​ ​

T
6 8.4 6 8.4
[( ) − ( )] [( ) − ( )] +
8 7.6 8 7.6
​ ​ ​ ​

T
9 8.4 9 8.4
[( ) − ( )] [( ) − ( )] +
5 7.6 5 7.6
​ ​ ​ ​

T
8 8.4 8 8.4
[( ) − ( )] [( ) − ( )] +
7 7.6 7 7.6
​ ​ ​ ​

T
10 8.4 10 8.4
[( ) − ( )] [( ) − ( )] +
8 7.6 8 7.6
​ ​ ​ ​

2.3 −0.05
=( )
Ans
S2 =
N −1 −0.05 3.3
​ ​ ​ ​

Within-class scatter matrix

Sw = S1 + S2 ​ ​ ​

1 −0.25 2.3 −0.05


=( )+( )
−0.25 2.2 −0.05 3.3
​ ​ ​ ​

3.3 −0.3
=( )
−0.3 5.5
​ ​

Between class scatter matrix

T
3 8.4 3 8.4
Sβ = [( ) − ( )] [( ) − ( )]
3.8 7.6 3.8 7.6
​ ​ ​ ​ ​

−5.4
=( ) (−5.4 −3.8 )
−3.8
​ ​ ​

29.16 20.52
=( )
20.52 14.44
​ ​

Find Eigen values

Sω−1 Sβ ω = λω ​ ​ ​

⟹ Sω−1 Sβ ω − λI = 0
​ ​ ​ ​

−1
3.3 −0.3 20.52 1 0 29.16
( ) ) − λ( (
) =0
−0.3 5.5 14.44 0 1 20.52
​ ​ ​ ​ ​ ​ ​ ​

Adj(A)
A−1 =
∣A∣

0.3045 0.0166 29.16 20.52 1 0


( )( ) − λ( ) =0
0.0166 0.1827 20.52 14.44 0 1
​ ​ ​ ​ ​ ​ ​ ​

9.2213 − λ 6.489
( ) =0
4.2339 2.9794 − λ
​ ​ ​ ​

⟹ λ2 − 12.2007λ = 0
⟹ λ(λ − 12.2007) = 0
⟹ λ1 = 0, λ2 = 12.2007 ​ ​

Pattern Recognition Part-2 16


− λI ) ( 1 ) = 0or using the direct method using this below equation you dont have
ω
Find Eigen Vector using (Sω−1 Sβ

ω2
​ ​ ​

to find eigen values and eigen vectors it gives you the linear discriminant matrix

ω ∗ = Sω −1 (μ1 − μ2 )
​ ​

Using Direct Method:


3.3 −0.3 3 8.4
ω =(

) [( ) − ( )]
0.3 5.5 3.8 7.6
​ ​ ​

0.3045 0.0166 −5.4


=( )( )
0.0166 0.1827 −3.8
​ ​ ​

0.9088
=( )
0.4173

Find the dot product of the data points to ω ∗ to get the 1st Linear Discriminant (normal matrix multiplication)

Unsupervised Learning
type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without
any supervision.

cannot be applied directly to regression or classification problem, because we have the input data but no corresponding
output data.

Pattern Recognition Part-2 17


goal is to find the underlying structure of dataset, group that data according to similarities and represent the data in a
compressed format.

Why we use unsupervised learning?

helpful to find the useful insights from the data

much similar as human learns to think by their own experience, which makes it closer to AI.

In real world we do not always have the input data with corresponding output to solve such cases we use
unsupervised learning.

Types of Unsupervised Learning

Clustering - method of grouping objects into clusters such that objects with most similarities remains into a group
and has less or no similarities with objects of another group.

Association - is a unsupervised learning method which is used for finding the relationships between variables in a
large database. It determines the set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective.

Clustering
method of grouping objects into clusters such that objects with most similarities remains into a group and has less or no
similarities with objects of another group.

after clustering technique, each cluster is provided with a cluster-ID

Applications

Market segmentation

Image Segmentation

Social Network Analysis

Netflix recommend movies and web series as per watch history

Types of Clustering

Hard Clustering - In this each input data point either belongs to a cluster completely or not.

Soft Clustering - In this, instead of putting each input data point into a separate cluster, a probability or likelihood of
that data point being in those clusters is assigned. (an item can exist in multiple clusters)

Types of Clustering Algorithms

Partitioning Clustering

Density Based Clustering

Hirarchical Clustering

Fuzzy Clustering etc

Partitioning Clustering

divides data into non-hierarchical groups

also known as centroid based method

for example - K means Clustering algorithm

These are iterative clustering algos in which the notion of similarity is derived by the closeness of a data point to the
centroid or cluster center of the clusters.

K-means Clustering

Pattern Recognition Part-2 18


Question

Suppose that the data mining task is to cluster points into three clusters
where the points are

A1 (2,10)

A2 (2,5)

A3 (8,4)

B1 (5,8)

B2 (7,5)

B3 (6,4)

C1 (1,2)

C2 (4,9)

The distance function is euclidean distance

Suppose initially we assign A1, B1, C1 as the center of each cluster repectively

Answer
Initial Centroids:

A1 ⇒ (2,10)
B1 ⇒ (5,8)

C1 ⇒ (1,2)

Pattern Recognition Part-2 19


New Centroids

A1 ⇒ (2,10)

B1 ⇒ (6,6)
C1 ⇒ (1.5, 3.5)

New Centroids

A1 ⇒ (3,9.5)

B1 ⇒ (6.5, 5.25)
C1 ⇒ (1.5,3.5)

New Centroids
A1⇒(3.67, 9)

B1⇒(7,4.33)

C1⇒(1.5, 3.5)

Since for the last two iterations the Clustering is same so we can conclude that

1st Cluster ⇒ A1, B1, C2

2nd Cluster ⇒ A3, B2, B3


3rd Cluster ⇒ A2, C1

Challenges with K means clustering

Pattern Recognition Part-2 20


finding the optimal k value, especially for noisy data. The appropriate value of k depends on the data structure and the
problem being solved. It is important to choose the right value of k, as a small value can result in under-clustered data,
and a large value can cause over-clustering

The size of clusters is different

LLE and Perceptrons


Locally Linear Embedding (LLE) is a nonlinear dimensionality reduction technique.

It helps in reducing the dimensionality of data while preserving its local neighborhood structure, making it easier to
identify patterns in high-dimensional datasets

Algorithm

Input Data - Suppose we have a dataset with N data points, each having D dimensions. Let the dataset be
represented as a matrix X, where each row is a data point: X = {x1 , x2 , ...., xn }
​ ​ ​

Find Nearest Neighbors - For each data point xi , find its K nearest neighbors. This can be done using various

distance metrics like Euclidean Distance.

Compute Reconstruction Weights - For each data point xi compute the weights Wij that creates the best linear
​ ​

combination with its nearest neighbors. The weights are found by minimizing the reconstruction error:

Reconstruction Error = ∑ xi − ∑ Wij xj ​ ​ ​ ​ ​ ​ ​

i j

Compute Low Dimensional Embedding - Find the low dimensional representation Y that best preserves the local
relationships defined by the weights W . This is done by minimizing the embedding cost function

ϕ(Y ) = ∑ yi − ∑ Wij yj
​ ​ ​ ​ ​ ​ ​

i j

Advantages of Using LLE over PCA


LLE is a non-linear dimensionality reduction technique whereas PCA is a linear dimensionality reduction technique.

LLE preserves local relationships whereas PCA preserves the global structure of data.

LLE is more robust to noises compared with PCA

Applications are-

Face recognition

Speech processing

Image segmentation

Perceptron
Perceptron is a linear supervised machine learning algorithm

Is is used for binary classification.

It is particularly significant in the field of pattern recognition due to its ability to classify data points into different
categories based on their features.

The perceptron is based on the idea of linear separability.

Perceptron Learning Algorithm


Perceptron - a type of artificial neuron used in machine learning for binary classification tasks.

serves as the building block for many complex models.

The main idea behind a perceptron is to classify input data into one of two categories based on its features.

Key components of Perceptron

Pattern Recognition Part-2 21


Inputs

The perceptron receives multiple input values, which can be features of the data. For example, if you are classifying
emails as spam or not spam, the inputs could be features like the number of links, the presence of certain keywords
etc.

Weights

Each input is associated with a weight, which determines the importance of that input in the classification process.
Weights are adjusted during the learning process to improve the model’s accuracy.

Bias

A bias term is added to weighted sum of the inputs to allow model to fit the data better. It acts as an additional
parameter that helps shift the decision boundary.

Activation Function

The perceptron uses an activation function to determine the output based on the weighted sum of inputs. The most
common activation function for a Perceptron is the step function, which outputs either 0 or 1.

How the Perceptron Works?


1. Weighted Sum Calculation

a. For a given input vector X = [x1 , x2 , ......., xn ]and corresponding weights W = [ω1 , ω2 , ....., ωn ]the Perceptron
​ ​ ​ ​ ​

calculates the weighted sum:

n
z = ∑ ωi xi + b
​ ​ ​

i=1

b. Here, b is the bias term

2. Activation Function

a. The output yof the Perceptron is determined by applying the activation function to the weighted sum:

y = f(z) = {
1 if z ≥ 0
0 if z < 0
​ ​

KNN Algorithm
K Nearest Neighbor Algorithm

supervised machine learning algo

for classification and regression problems.

assumes the similarity between new data and available cases and put the new case into category that is most similar to
available categories.

when new data appears then it can be easily classified into a well suited category by using KNN algo.

also called lazy learner algo because it doesn’t learn from the training set immediately instead it stores the dataset and
at the time of classification, it performs an action on the dataset.

Why do we need KNN?

suppose there are 2 categories, i.e., Category A and Category B, and we have a new datapoint x1, so this data point
will lie in which of these categories.

To solve this type of problem, we need a KNN algo

With help of KNN we can easily identify the category or class of particular dataset.

How does KNN Works?

1. Select the number K of the neighbors

2. Calculate the Euclidean distance of K number of neighbors

3. Take the K nearest neighbors as per the calculated Euclidean distance

4. Among these k neighbors, count the no of the data points in each category

5. Assign the new data points to that category for which the number of neighbor is maximum.

Pattern Recognition Part-2 22


How to select the value of K in the KNN Algo?

1. There is no particular way to determine the best value for K so we need to try some values to find the best out of
them. The most preferred value is 5.

2. A very low value for K such as 1 or 2 can be noisy and lead to the effect of outliers in the model

3. Large values are good but it may find some difficulties

Advantages

simple implementation

robust to noisy training data

more effective if training data is large.

Disadvantages

needs to determine the value of K which can be complex some time.

high computation cost (calculating the distance between data points for all the training samples)

Mean Shift Algorithm


non parametric, density-based clustering algorithm that can be used to identify clusters in a dataset.

It is particularly useful for datasets where the clusters have arbitrary shapes and are not well separated by linear
boundaries.

Mode-seeking algo

image processing and computer vision

unlike KNN, mean shift doesn’t need number of clusters specified in advance. It is calculated based on the data

main idea is to shift each data point towards the mode (the highest density) of the distribution of points within a certain
radius.

The algorithm iteratively performs these shifts until the points converge to a local maximum of density function.

local maxima ⇒ clusters in data

Why?

Identifies Data Structure - it helps uncover the natural structure in data without assuming specific parameters like
the number or shapes of clusters.

Focus on Dense Areas- It prioritizes regions with data points densely packed.

It handles noisy data effectively, minimizing the influence of outliers.

Algorithm

1. Start - Treat each data point as a potential cluster center.

2. Measure Density - define a radius around each point and calculate the average position of the points within that
radius.

3. Shift points - Move each point toward the average position (dense region)

4. Repeat Until Stable - Keep repeating until movement is very small, meaning the points have settled around dense
regions.

Kernel Density Estimation (KDE)

non-parametric way to estimate the probability density function of a dataset.

Purpose - it shows the distribution of data points by estimating their density.

How it works:

Place a smooth, bell shaped curve (kernel) over each data point

Add up these curves to create a continuous density function.

The width of the curve (bandwidth) controls the smoothness:

Small bandwidth: detailed but possibly noisy

Large bandwidth: Smoother but may miss details

Pattern Recognition Part-2 23


Applications: visualizing distributions, detecting modes, and as part of clustering algorithms.

Applications
Clustering

Use: grouping similar data points into clusters without assuming fixed number of clusters.

How: each data points moves to its corresponding mode, and each points converging to same mode form a cluster.

Advantages

no need to predefine the number of clusters

flexible and adaptive to data distribution.

Image Segmentation

Use: Segmenting an image into regions based on color or intensity.

How

treat pixel intensities (or colors) as data points in a high dimensional space.

Apply mean shift to cluster pixels based on their feature similarity.

Assign regions based on cluster membership

Object Tracking

Tracking objects in video frames

How

use mean shift to locate the region of highest feature density (e.g. color histogram) in a search windows

update the search windows in subsequent frames to track the object.

Anomaly Detection

Identifying outliers in data

How

dense regions correspond to normal patterns.

Points far from dense regions are flagged as anomalies

Expectation Maximization Algorithm


iterative optimization technique used for finding the maximum likelihood estimates of parameters in probabilistic models,
especially when the data has missing or hidden variables.

Latent Variables - unobserved variables in statistical models that can only be inferred indirectly through their effects on
observable variables.

Likelihood - probability of observing the given data given the parameters of the model. In the EM algo, the goal is to find
the parameters that maximize the likelihood.

Log-Likelihood - logarithm of likelihood function, which measures how fit is that between the observed data and the
model. (EM algo seeks to maximize this)

Maximum Likelihood Estimation (MLE) - MLE is a method to estimate the parameters of a statistical model by finding
parameter values that maximize the likelihood function, which measures how well the model explains the observed data.

Convergence - condition when the EM algo has reached a stable solution. (change in log-likelihood or parameter
estimates fall below the threshold)

Algorithm
1. Initialization - starts with initial guesses for the parameters of the model

2. E-Step (Expectation) - compute the expected values of the latent variables using the current parameter estimates.

3. M-Step (Maximization) - update the parameters to maximize the likelihood of the observed data given the expectations
from the E-step

4. Iterate - Repeat the E and M steps until the parameters converge.

Applications

Pattern Recognition Part-2 24


1. Clustering
Gaussian Mixture Models (GMM): EM is used to cluster data into groups assuming the data is generated from a mixture
of Gaussian distributions.

Automatically determines cluster assignments and estimates their parameters.

2. Image Segmentation
Pixel Classification: Groups pixels based on intensity or color into regions (e.g., separating foreground from
background).

Handles noise by modeling pixel intensity as a mixture of distributions.

3. Dimensionality Reduction
Principal Component Analysis (PCA) Variants: EM helps in probabilistic approaches to PCA, enabling reduction of data
dimensions while accounting for missing data.

Facilitates data visualization and feature extraction in high-dimensional datasets.

Pattern Recognition Part-2 25

You might also like