PatternRecognition
PatternRecognition
Introduction
Pattern Recognition - scientific process of classifying objects into number of categories or
classes.
Objects - images, signals, or measurements needing classification and the objects used for
classification are called patterns.
Types of Approaches:
Features - measurable properties of an object. Features can be crucial for classifying and
differentiating between classes.
Categorical Features - discrete categories (eg. color, type etc.) often converted into
numerical formats for machine learning.
Input Representation - feature vectors convert raw data into numerical formats that
machine learning models can use effectively.
Classification & Clustering - Algorithms like K-nearest neighbors (KNN) and Support
Vector Machines (SVM) use feature vectors to classify group data.
Classifiers - Algorithms that categorize data points using feature vectors various classifiers
use these vectors to make predictions.
Nearest Neighbor classification - classifies data points by the majority class of their
nearest neighbors in feature space.
Feature Extraction - Identifies and extracts relevant characteristics from the segmented
data.
Classification - The extracted features are used to classify data into categories involves
training a model to learn the patterns and relationships between features and classes
Clustering - data may not have predefined labels, then these techniques are used to group
similar data points together based on their feature similarities.
Regression - regression techniques are used if goal is to predict the continuous numerical
values and models learn relationship between feature and target variable
Discriminant Analysis
Structural PR -
powerful for complex patterns and can learn from examples and adapt to new data.
Fuzzy PR -
Decision Tree
Decision Tree which has a hierarchical structure made up of root, branches, internal and
leaf nodes, is a non-parametric, supervised learning approach.
Feature Selection - Decision trees can help identify the most informative features in a
dataset, aiding in feature engineering and dimensionality reductions.
Classification - By partitioning the feature spaces into decision regions, decision trees,
can effectively classify patterns into categories.
Root Node - initial node at beginning of the Decision Tree, where entire population
starts dividing based on various features / conditions.
Decision Nodes - These result from splitting the root nodes, these nodes represent the
intermediate decisions / conditions.
Binary Splits - Decision Trees use binary splits to divide data at each node into two
subsets.
Homogeneity - DTs aim to create homogeneous subgroups in each node, i.e. each leaf
node should ideally contain similar instances regarding the target variable.
Top Down Greedy Approach - DTs are created using greedy approach where each split is
chosen to maximize information gain / minimize the impurity at the current node.
Overfitting - DTs are prone to overfitting when they capture noise in the data - when the
tree is allowed to grow too deep, this occurs when the model learns the training data too
Consequences Prevention
No outliers - DTs are sensitive to outliers, and extreme values can influence their
construction. To handle outliers effectively pre processing / robust methods may be
needed.
Sensitive to Sample Size - small datasets may lead to overfitting and larger datasets may
result in complex trees. The sample size and tree depth should be balanced.
Advantages of DT Disadvantages of DT
Handle both categorical and can introduce bias, if data is imbalanced as the algorithm tends to
numerical data favour majority classes
m
Entropy(S) = − ∑[P i log 2 P i]
i=1
A higher entropy value indicates greater uncertainty, while a lower entropy value
indicates a more homogeneous dataset. (lower the better)
Gini Impurity - (Gini Index) - measures the probability of a randomly chosen element being
incorrectly classified.
n
Gini(S) = 1 − ∑(P i)2
i=1
A higher Gini impurity value indicates greater uncertainty, while a lower impurity value
indicates more homogeneous dataset. (lower the better)
∣S ∣
i=1
where S is the dataset, A is the attribute and Sv is the subset of S that has value V for
attribute A.
A higher information gain value indicates a more informative attribute for splitting.
Gain Ratio - Adjusts information gain by taking into account the essential information of the
attribute [split info].
v
∣Ai ∣ ∣Ai ∣
Split Info(T, A) = ∑ ∗ log 2
∣T ∣ ∣T ∣
i=1
Information Gain(T, A)
Gain Ratio =
Split Info(T, A)
Helps to overcome the bias of the information gain towards attributes with many values
by normalizing it
Information Gain(A))
Normalized Information Gain (A) =
Entropy(S)
CART Algorithm
Classification and Regression Tree
supervised learning algorithm that learns from labelled data to predict unseen data.
Initial Split - choose the best feature and value to split the data
Stopping criteria - maximum depth of a tree, a minimum number of instances in each leaf
node.
Prediction - follow the tree to leaf node to get the predicted class or value.
Steps
Calculate Gini Impurity for the entire dataset. This is the impurity of the root node.
For each input variable, calculate Gini Impurity for all possible split points. The split point
that results in minimum Gini impurity is chosen.
Data Input is split into two subsets based on the chosen split point and new node is created
for each subset.
Step 2 and 3 are repeated for each node until a stopping condition is met.
handle both numerical and categorical data greedy algorithm [may not be optimal]
it can handle multi class classification problems by using an unstable - small changes in dataset can
extension called a multi class CART affect the tree structure
Sunny 2 3 5
Overcast 4 0 4
Rain 3 2 5
5 4 5
Gini(Outlook) = ∗ 0.48 + ∗0+ ∗ 0.48 = 0.342
14 14 14
Hot 2 2 4
Cool 3 1 4
Mild 4 2 6
4 4 6
Gini(Temp) = ∗ 0.5 + ∗ 0.375 + ∗ 0.44 = 0.439
14 14 14
High 3 4 7
Normal 6 1 7
7 7
Gini(Humidity) = ∗ 0.489 + ∗ 0.244 = 0.366
14 14
Weak 6 2 8
Strong 3 3 6
8 6
Gini(Wind) = ∗ 0.375 + ∗ 0.5 = 0.428
14 14
When Comparing the Gini Values of Outlook, Temperature, Humidity, and Wind the Outlook
attribute should be chosen because it has the minimum impurity, then the Root node is the
Outlook
Now we have to take the Outlook node and split them into attributes and the new table is
Outlook = Sunny
Hot 0 2 2
Cool 1 0 1
Mild 1 1 2
2 1 2
Gini(Temp) = ∗ 0 + ∗ 0 + ∗ 0.5 = 0.2
5 5 5
High 0 3 3
Normal 2 0 2
3 2
Gini(Humidity) = ∗ 0 + ∗ 0 = 0.366
5 5
Weak 1 2 3
Strong 1 1 2
3 2
Gini(Wind) = ∗ 0.44 + ∗ 0.5 = 0.464
5 5
After comparing the attributes for choosing the right node for the sunny branch, i.e.
Temperature, Humidity and Wind the node will be Humidity since it has the minimum Gini
Impurity
💡 If every attribute in a branch is having same label yes or same label no it is the pure
node
Mild 1 2 3
Cool 1 1 2
3 2
Gini(Temp) = ∗ 0.44 + ∗ 0.5 = 0.464
5 5
High 1 1 2
Normal 2 1 3
2 3
Gini(Temp) = ∗ 0.5 + ∗ 0.44 = 0.464
5 5
Weak 3 0 3
Strong 0 2 2
3 2
Gini(Temp) = ∗ 0 + ∗ 0 = 0
5 5
After comparing the attributes for choosing the right node for the rain branch, i.e. Temperature,
Humidity and Wind the node will be Wind since it has the minimum Gini Impurity
ID3 Algorithm
Iterative Dichotomiser 3
Calculate Entropy - For each attribute ID3 calculates the entropy, which measures the
impurity or disorder in the dataset
Information Gain - For each attribute, ID3 computes the information gain which measures
how much uncertainty in the class label is reduced after splitting the dataset based on the
attribute.
Select the best attribute - Choose the attribute with the highest info gain. This attribute is
the best choice for splitting the dataset at this node.
Create branches - divide the dataset into subsets based on the selected attributes values,
creating branches in the decision tree.
Stopping Criteria - for each subset check if all instances belong to the same class (leaf
node), no attributes are left to split on (leaf node), the subset is empty (prune this branch)
repeat step 2 to step 6 for each subset, treating each as a new dataset once tree is fully
grown, assign class labels to leaf nodes based on the majority class in that subset.
Entropy(Outlook)
m
Entropy(S) = − ∑[P i log 2 P i]
i=1
Entropy(S) = P(yes) + P(no)
9 9 5 5
E(S) = − log 2 − log 2
14 14 14 14
=0
5 5 4
Gain(Outlook) = 0.94 − [ ∗ 0.97 + ∗ 0.97 + ∗ 0]
14 14 14
= 0.247
Temperature
2 2 2 2
Entropy(Hot) = − log 2 − log 2
4 4 4 4
=1
4 4 2 2
Entropy(Mild) = − log 2 − log 2
6 6 6 6
= 0.03
Humidity
4 4 3 3
Entropy(High) = − log 2 − log 2
7 7 7 7
= 0.16
Wind
6 6 2 2
Entropy(Weak) = − log 2 − log 2
8 8 8 8
=1
8 6
Gain(Wind) = 0.94 − [ ∗ 0.811 + ∗ 1]
14 14
= 0.05
Sunny
2 2 3 3
Entropy(Sunny) = − log 2 − log 2 = 0.52 + 0.442 = 0.962
5 5 5 5
Temperature
0 0 2 2
Entropy(Hot) = − log 2 − log 2 = 0
2 2 2 2
1 1 1 1
Entropy(Mild) = − log 2 − log 2 = 1
2 2 2 2
0 0 1 1
Entropy(Cool) = − log 2 − log 2 = 0
1 1 1 1
2 2 1
Gain(Temperature) = 0.962 − [ ∗ 0 + ∗ 1 + ∗ 0]
5 5 5
Humidity
0 0 3 3
Entropy(High) = − log 2 − log 2 = 0
3 3 3 3
0 0 2 2
Entropy(Normal) = − log 2 − log 2 = 0
2 2 2 2
3 2
Gain(Humidity) = 0.962 − [ ∗ 0 + ∗ 0]
5 5
= 0.962 − 0 = 0.962
Wind
3 2
Gain(Wind) = 0.962 − [ ∗ 0.917 + ∗ 1] = 0.02
5 5
After Second Iteration we found that Humidity is the best node for Outlook=sunny and
Outlook=overcast is a pure node so we have formed two sub branches
Outlook = Rain
Temperature
2 2 1 1
Entropy(Mild) = − log 2 − log 2
3 3 3 3
3 2
Gain(Temp) = 0.962 − [ ∗ 0.917 + ∗ 1]
5 5
= 0.02
Humidity
=1
3 3 0 0
Entropy(Normal) = − log 2 − log 2 = 0
3 3 3 3
2 3
Gain(Humidity) = 0.962 − [ ∗ 1 + ∗ 0] = 0.02
5 5
Wind
0 0 2 2
Entropy(Strong) = − log 2 − log 2
2 2 2 2
=0
3 3 0 0
Entropy(Weak) = − log 2 − log 2 = 0
3 3 3 3
2 3
Gain(Wind) = 0.962 − [ ∗ 0 + ∗ 0] = 0.962
5 5
After 3rd iteration we got that Outlook=Wind is the third node for Rain branch and the final
decision tree will look like this:
Disadvantages of ID3
overfitting
handling missing data - ID3 does not deal with the missing values
greedy approach
C4.5 Algorithm
extension of ID3 algorithm which is designed to improve ID3’s shortcomings
Advantages
Disadvantages
Computational Complexity - can be slower compared to simpler algorithms, especially with
larger dataset
memory consumption - although more efficient than ID3, C4.5 uses significant memory and
processing power.
Steps
Compute entropy information of whole dataset based on target attribute
m
Entropy Info(T) = − ∑ Pi log 2 Pi
i=1
next for each attribute in training set compute the entropy information
v
∣Ai ∣
Entropy Info(T,A) = ∑
EntropyInfo(Ai)
∣T ∣
i=1
v
∣Ai ∣ ∣Ai ∣
Split Info(T,A) = ∑ ∗ log 2
∣T ∣ ∣T ∣
i=1
InformationGain(A)
Gain Ratio =
SplitInfo(T , A)
choose the attribute for which gain ratio is maximum as best split attribute
root nodes are branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute.
recursively apply same operations for the subsets until you get the leaf nodes.
≥9 No avg poor No
7 7 3 3
Entropy(T = Job Offer) = − log 2 − log 2
10 10 10 10
CGPA
v
∣Ai ∣
Entropy Info(T, CGPA) = ∑ Entropy(Ai )
∣T ∣
i=1
4 3 3 1 1
= [− log 2 − log 2 ] +
10 4 4 4 4
4 4 4 0 0
[− log 2 − log 2 ] +
10 4 4 4 4
2 0 0 2 2
[− log 2 − log 2 ]
10 2 2 2 2
= 0.324
Info Gain = Entropy(T ) − Entropy(T , CGP A) = 0.881 − 0.324 − 0.557
v
∣Ai ∣ ∣Ai ∣
Split Info(T, CGPA) = ∑ ∗ log 2
∣T ∣ ∣T ∣
i=1
4 4 4 4 2 2
= − log 2 − log 2 − log 2
10 10 10 10 10 10
SplitInfo
Interactiveness
∣T ∣
i=1
6 5 5 1 1
= [− log 2 − log 2 ] +
10 6 6 6 6
4 2 2 2 2
[− log 2 − log 2 ]+
10 4 4 4 4
= 0.789
Info Gain = Entropy(T ) − Entropy(T , Interactiveness) = 0.881 − 0.789 = 0.092
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Interactiveness) = ∑ ∗ log 2
∣T ∣ ∣T ∣
i=1
6 6 4 4
= − log 2 − log 2
10 10 10 10
SplitInfo
Practical Knowledge
v
∣Ai ∣
Entropy Info(T, Pract. Knowledge) = ∑ Entropy(Ai )
∣T ∣
i=1
2 2 2 0 0
= [− log 2 − log 2 ] +
10 2 2 2 2
5 4 4 1 1
[− log 2 − log 2 ]+
10 5 5 5 5
3 1 1 2 2
[− log 2 − log 2 ] +
10 3 3 3 3
= 0.6356
Info Gain = Entropy(T ) − Entropy(T , P ract.Knowledge) = 0.881 − 0.6356 = 0.2454
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Pract. Knowledge) = ∑ ∗ log 2
∣T ∣ ∣T ∣
i=1
2 2 5 5 3 3
= − log 2 − log 2 − log 2
10 10 10 10 10 10
SplitInfo
Communication Skills
∣T ∣
i=1
5 4 4 1 1
=[− log 2 − log 2 ] +
10 5 5 5 5
3 3 3 0 0
[− log 2 − log 2 ]+
10 3 3 3 3
2 0 0 2 2
[− log 2 − log 2 ] +
10 2 2 2 2
= 0.3605
Info Gain = Entropy(T ) − Entropy(T , Communication) = 0.881 − 0.3605 = 0.5205
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Communication) = ∑ ∗ log 2
∣T ∣ ∣T ∣
i=1
5 5 3 3 2 2
= − log 2 − log 2 − log 2
10 10 10 10 10 10
SplitInfo
The root node is CGPA since it has the max gain ratio
CGPA(≥9)
No avg poor No
3 3 1 1
Entropy(T = Job Offer) = − log 2 − log 2
4 4 4 4
Interactiveness
∣T ∣
i=1
2 2 2 0 0
= [− log 2 − log 2 ] +
4 2 2 2 2
2 1 1 1 1
[− log 2 − log 2 ]+
4 2 2 2 2
= 0.5
Info Gain = Entropy(T ) − Entropy(T , Interactiveness) = 0.811 − 0.5 = 0.311
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Interactiveness) = ∑ ∗ log 2
∣T ∣ ∣T ∣
i=1
2 2 2 2
= − log 2 − log 2
4 4 4 4
=1
InfoGain 0.311
Gain Ratio = = = 0.311
1
SplitInfo
Practical Knowledge
v
∣Ai ∣
Entropy Info(T, Pract. Knowledge) = ∑ Entropy(Ai )
∣T ∣
i=1
2 2 2 0 0
= [− log 2 − log 2 ] +
4 2 2 2 2
1 0 0 1 1
[− log 2 − log 1 ]+
4 1 1 1 1
1 1 1 0 0
[− log 2 − log 2 ]
4 1 1 1 1
=0
Info Gain = Entropy(T ) − Entropy(T , P ract.Knowledge) = 0.811 − 0 = 0.811
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Pract. Knowledge) = ∑ ∗ log 2
∣T ∣ ∣T ∣
i=1
2 2 1 1 1 1
= − log 2 − log 2 − log 2
4 4 4 4 4 4
= 1.5
InfoGain 0.811
Gain Ratio = = = 0.540
1.5
SplitInfo
Communication
∣T ∣
i=1
2 2 2 0 0
= [− log 2 − log 2 ] +
4 2 2 2 2
1 0 0 1 1
[− log 2 − log 1 ]+
4 1 1 1 1
1 1 1 0 0
[− log 2 − log 2 ]
4 1 1 1 1
=0
Info Gain = Entropy(T ) − Entropy(T , Communication) = 0.811 − 0 = 0.811
v
∣Ai ∣ ∣Ai ∣
Split Info(T, Communication) = ∑ ∗ log 2
∣T ∣ ∣T ∣
i=1
2 2 1 1 1 1
= − log 2 − log 2 − log 2
4 4 4 4 4 4
= 1.5
InfoGain 0.811
Gain Ratio = = = 0.540
1.5
SplitInfo
💡 If two classes have same maximum gain ratio then choose any one of them
Now we have choosen the Communication Skills attribute as node for ≥9 and then ≥8 and <8
are pure node so the final tree will look like
Random Forest
robust tree based learning technique in Machine Learning
each tree is built using a random subset of the dataset and measures a random subset of
features at each partition.
Working Mechanism
a. Tree Construction - multiple decision trees are created using different subset of training
data.
b. Feature Selection - each tree uses random subset of features to make decision at each
node.
2. Prediction Phase
a. Aggregation - for classification tasks, the algorithm aggregates results from all trees
through majority voting. For regression tasks, it averages the result of all trees.
Advantages
Reduction in Overfitting - by averaging multiple trees, the risk of overfitting is significantly
reduced.
High accuracy - random forests are known for their high accuracy in predictions (achieved
through algorithm’s unique approach by constructing multiple decision trees during training
phase)
Handling missing data - the algo can estimate missing data efficiency
Disadvantages
Computational resources - requires more computational resources compared to simpler
algorithms like decision trees.
Time Consumption - takes more time to train compared to a single decision tree.
Complexity - the model can become less intuitive with an extensive collection of decision
trees, making it harder to interpret.
Bagging
Bootstrap Aggregation
Bagging involves creating different training subsets from the original data with replacement
Process
Aggregation - results from these individual models are combined using majority voting
to generate the final output.
Boosting
Boosting combines weak learners into strong learners by creating sequential models. Each
model attempts to correct the errors made by the previous ones.
Weak Learner - machine learning model which performs only slightly better than random
guessing (non-zero predictive power)
Sequential Training - models are trained sequentially with each new model focusing on
the errors of previous models.
weighted voting - instances misclassified by previous models are given more weight
and final prediction is made by weighted voting
P (B∣A)P (A)
P (A∣B) =
P (B)
P (A∣B)= called the posterior probability, it is the probability of the predicted class to be A for a given entry of feature
(B). [A is true after we see the evidence B]
P (B∣A)= class-conditional probability density function for feature. We call it likelihood of A with respect to B, a term
chosen to indicate that, other things being equal, the category (or class) for which it is large is more “likely” to be the
true category. [Probability of seeing the evidence B if A is true].
P (A)= A priori probability (or simply prior) of class A. It is usually pre-determined and depends on the external factors.
It means how probable the occurrence of class A out of all the classes. [ What we believe about A before seeing
evidence B].
P (B)= called the evidence, it is merely a scaling factor that guarantees that the posterior probabilities sum to one. It is
also called Marginal Likelihood [total probability of observing the evidence B across all possible propositions].
Proof
Let ω1 , ω2 to be two classes in which our patterns belong.
The probability density function pdf(x∣ωi )is sometimes referred to as likelihood function of ωi with respect to x, where
Pdf is used in classification problems to understand how data points xare distributed for each class ωi
Bayes Rule,
P (x∣ωi )P (ωi )
P (ωi ∣x) =
P (x)
where
2
P (x) = ∑ P (x∣ωi )P (ωi )
i=1
Pe = P (x ∈ R2 , ω1 ) + P (x ∈ R1 , ω2 )
R2 R1
R2 R1
It is now easy to see that error is minimized if the probability regions R1 and R2 of the feature space are chosen so that
Indeed since the union of regions R1 , R1 covers all the space, then from the definition of a pdf we have that
R1
R2
R1
This suggests that the probability of error is minimized if R1 is the region space which P (ω1 ∣x) > P (ω2 ∣x). Then R2
Key Concepts
Generative Model - parametric generative methods explicitly model the joint probability distribution P(X,Y), where
X represents the features and Y the class labels.
Likelihood: P (X∣Y ), useful for evaluating the fit of the model.
P (X∣Y )P (Y )
P (Y ∣X) =
P (X)
Bayes decision theory’s decision rule - on the basis of Bayes’ theorem if the posterior probability is maximum
then the data point should be assigned to that class.
Parametric Model - assume that the data is generated from a distribution characterized by a set of parameters θ.
Parameters are estimated using methods like Maximum Likelihood Estimation (MLE) or Maximum a Posteriori
(MAP).
In Bayesian Estimation, the parameters of the model θare treated as random variables with their own probability
distributions.
The goal is to estimated the posterior distribution of the parameters P (θ∣D)where D is the observed data
(features)
P (D∣θ)P (θ)
P (θ∣D) =
P (D)
P (θ∣D)- likelihood (probability of observing the data D given the parameters θ.)
P (θ)- Prior (Encodes prior beliefs about the parameters before observing the data.)
P (θ∣D)- Posterior (The updated belief about the parameters after observing the data)
P (D)- Evidence (The marginal likelihood of the data, often a normalizing constant)
D1 = {xi1 , xi2 , xi3 , ...., Xin }where each feature may not be same and
Let the parameter which should be obtained be θ = {σ, μ}which is the parameter estimation
(mean and variance).
i
product becomes indistinguishably close to zero which can lead to underflow in computer systems.
ni
ln(P (Di ∣θi )) = ∑ ln(P (x∣θi ))
k=1
ni
⟹ θmL = ∑ ln(P (x∣θi ))
i=1
1. Feature selection
2. Feature Extraction
Dimensionality Reduction
It is the process of taking data in a high dimensional space and mapping it into a new space whose dimensionality is
much smaller.
It is the technique used to reduce the number of features in a dataset while retaining as much of the important
information as possible.
Reasons:
Can be used for interpretability of the data, for finding meaningful structure of data, and for illustration purposes.
Feature Selection
selecting a subset of the original features that are most relevant to the problem at hand.
The goal is to reduce the dimensionality of the dataset while retaining the most important features
Feature Extraction
Involves creating new features by combining or transforming the original features
The goal is to create a set of features that captures the essence of the original data in a lower-dimensional space.
PCA is a popular technique that projects the original features onto a lower dimensional space while preserving as much
of the variance as possible.
Advantages
helps in data compression, and hence reduced storage space.
improved visualization
overfitting prevention
improved performance
Disadvantages
may lead to some amount of data loss
PCA tends to find linear correlations between variables, which is sometimes undesirable.
the reduced dimensions may not be easily interpretable, and it may be difficult to understand the relationship between
the original features and the reduced dimensions.
statistical procedure that uses an orthogonal transformation that converts a set of correlated variables to a set of
uncorrelated variables. These new transformed features are called the Principle Components.
Terminologies
Dimensionality - number of features or variables present in the given dataset (number of columns present in the
dataset)
Correlation - how strongly two variables are related to each other (ranges from -1 to +1) : -1 if two variables are inversely
proportional to each other, +1 if two variables are directly proportional to each other.
Orthogonal - variables are not correlated to each other, correlation between them will be zero.
Eigen Vectors - if there is a square matrix M, and a non-zero vector v is given. Then v will be eigen vector if Av is the
scalar multiple of v.
Covariance matrix - A matrix containing covariance between the pair of variables is called the Covariance matrix.
Algorithm
1. Dataset Representation
. . . . .
. . . . .
. . . . .
1
xˉi =
(xi1 + xi2 + .... + xiN )
N
3. Calculate Covariance matrix
N
1
Cov(Xi , Xj ) = ∑(Xik − X
ˉi )(Xjk − Xˉj )
N −1
i=1
. . ... .
S=
. . ... .
cov(xn , x1 ) cov(xn , x2 )
... cov(xn , xN )
a. To find the eigen value solve the equation det(S − λI) = 0 we get n roots i.e. λ1 , λ2 , ..., λn which are eigen
u1
u2
vector U = .
.
un
Ui
ei = ∀ ∥Ui ∥ = u21 + u22 + .... + u2n
∥Ui ∥
💡 The unit eigen vector corresponding to the largest eigen value is the first principle component
. . . ……………….. .
. . . ……………….. .
such that
ˉ
X2j − Xˉ2
Pij = eTi .
.
Xnj − Xˉn
Given the following data use PCA to reduce the dimension from two to one
feature eg1 eg2 eg3 eg4
x 4 8 13 7
y 11 4 5 14
Solution
Step 1. Data Representation
x 4 8 13 7
y 11 4 5 14
(n) No of features = 2
(N) no of samples = 4
Step 2. Find the Mean of two feature x and y
1
xˉ = (4 + 8 + 13 + 7) = 8
4
1
yˉ = (11 + 4 + 5 + 14) = 8.5
4
N
1
cov(Xi , Xj ) = ∑(Xik − X
ˉi )(Xjk − Xˉj )
N − 1 i=1
1
cov(x, x) = [(4 − 8)2 + (8 − 8)2 + (13 − 8)2 + (7 − 8)2 ]
4−1
1
= (16 + 0 + 25 + 1)
3
42
=
3
cov(x, x) = 14
1
cov(x, y) = [(4 − 8)(11 − 8.5) + (8 − 8)(4 − 8.5) + (13 − 8)(5 − 8.5) + (7 − 8)(14 − 8.5)]
4−1
1
= (−10 + 0 − 17.5 − 5.5)
3
−33
=
3
cov(x, y) = −11
1
cov(y, y) = [(11 − 8.5)2 + (4 − 8.5)2 + (5 − 8.5)2 + (14 − 8.5)2 ]
4−1
1
= (6.25 + 20.25 + 12.25 + 30.25)
3
69
=
3
cov(y, y) = 23
14 − 11
S=[ ]
−11 23
Step 4. Construct eigen value, eigen vector, and normalized eigen vector.
1 0
I=[ ]
0 1
λ 0
λI = [ ]
0 λ
14 − λ −11
S − λI ⟹ [ ]
−11 23λ
14 − λ −11
=0
−11 23 − λ
λ1 = 30.3849 ⟹ First PC
λ2 = 6.6151
2. Eigen vector of λ1
(S − λ1 I )U1 = 0
14 − λ1 −11 0
[ ] [ 1] = [ ]
u
−11 23 − λ1 u2 0
11 14 − λ1
When t=1
u1 = 11
u2 = 14 − λ1
11
Eigen Vector U1 of λ1 = [ ]
14 − λ1
11
=[ ]
14 − 30.3849
11
U1 = [ ]
−16.384
3. Eigen Vector of λ2
14 − λ2 −11 0
[ ] [ 1] = [ ]
u
−11 23 − λ2 u2 0
11 14 − λ2
When t=1
u1 = 11
u2 = 14 − λ2
11
Eigen Vector U2 of λ2 = [ ]
14 − λ2
11
=[ ]
14 − 6.6151
11
U2 = [ ]
7.3849
11
112 +(−16.384)2
e1 =
−16.384
112 +(−16.384)2
11
19.7341
e1 =
−16.384
19.7341
0.5574
e1 = [ ]
−0.8302
11
112 +(7.3849)2
e2 =
7.3849
112 +(7.3849)2
11
13.249
e2 =
7.3849
13.249
0.8302
e2 = [ ]
0.5574
(4 − 8)
P11 = [0.5574 −0.8302] [ ]
(11 − 8.5)
(−4)
= [0.5574 −0.8302] [ ]
(2.5)
P11 = −4.3051
(8 − 8)
P12 = [0.5574 −0.8302] [ ]
(4 − 8.5)
0
= [0.5574 −0.8302] [ ]
(−4.5)
P12 = 3.7359
(13 − 8)
P13 = [0.5574 −0.8302] [ ]
(5 − 8.5)
5
= [0.5574 −0.8302] [ ]
(−3.5)
P13 = 5.6927
−1
= [0.5574 −0.8302] [ ]
(5.5)
P14 = −5.1235
X1 4 8 13 7
X2 11 4 5 14
The goal is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
The dimensions of the hyperplane depend on the features present in the dataset which means if there are 2 features
then the hyperplane will be a straight line.
We always create a hyperplane that has a maximum margin, which means the maximum distance between the data
points.
The perpendicular distance between the support vectors to the hyperplane is called margin.
If the datapoint’s margin is minimum it has a chance to be placed into another class.
If you are not able to separate datapoints with a single hyperplane - Hard margin, else it is called Soft Margin.
Linear SVM - Linear SVM is used for linearly separable data which means if a dataset can be classified into two classes
by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
Non-linear SVM - Non-linear SVM is used for non-linearly separated data, which means if the dataset cannot be
classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.
Suppose the dataset of N points (x1 , y1 ), (x2 , y2 ), ..., (xN , yN )where x1 , x2 , ......, xN are the features of the dataset
and y1 , y2 , ......, yN are the class labels of the each vectors. and yi
= +1 or − 1
Hyperplane is of the form w. x + b = 0
Negative Samples can be expressed as w . x + b < 0 ∀ y = −1
Positive Samples can be expressed as w . x + b > 0 ∀ y = +1
Two parallel separating hyperplanes can be described as
w . x + b = −1
w . x + b = +1
The best separating hyperplane lies on the halfway between two parallel hyperplanes or in other words we need to
maximize the margin.
In order to find the distance of a hyperplane from a point x = (x1 , x2 , ......, xN )from a hyperplane whose equation is
given by
α0 + α . x = 0 (general form)
b + ω.x = 0
i.e.
α0 + α . x
α
b + ω.x
ω
So to find the perpendicular distance from the origin to 2 parallel hyperplanes do the following:
D = Q−P
P ⟹ ω . x + b = −1
⟹ ω.x + b + 1 = 0
Q ⟹ ω . x + b = +1
⟹ ω.x + b − 1 = 0
ω.x + b + 1
∣b + 1∣
Distance from origin to P =
=
ω ω
ω.x + b − 1
∣b − 1∣
Distance from origin to Q =
=
ω ω
ω ω
∣b − 1 − b − 1∣
=
ω
∣−2∣
=
ω
2
D=
ω
2
2
ω
ω
Final Definition
Subject to yi ( ω .xi
− b) ≥ 1for i = 1…..N
The solution to the SVM problem is a classifier known as SVM classifier.
Let ω = ω ∗ and b = b∗ be a solution of the SVM problem. Let x be an unclassified data instance
SVM Classifier
Using SVM algorithm find the hyperplane with maximum margin for the following data
x1 x2 class
2 1 +1
4 3 -1
x2 = (4, 3)
1. because b can be +ve/-ve
f x = ω.x − b
2. consistency with classification boundaries
y1 = +1 3. Helps in making the optimization problem easier.
y2 = −1
α1 − α2 = 0means α1 = α2
N N
1
ϕ( α ) = ∑ αi − ∑ αi .αj .yi .yj .(xi .xj )
2
i=1 i=1,j=1
1
= (α1 + α2 ) − [α1 α1 y1 y1 (x1 .x1 ) + α1 α2 y1 y2 (x1 .x2 ) + α2 α1 y2 y1 (x2 .x1 ) + α2 α2 y2 y2 (x2 .x2 )]
2
= (α1 + α2 ) −
1 2
[α (1)(1)(2 ∗ 2 + 1 ∗ 1) + α1 α2 (1)(−1)(2 ∗ 4 + 1 ∗ 3) + α2 α1 (−1)(1)(4 ∗ 2 + 3 ∗ 1) + α2 α2 (1)(−1)(4 ∗ 4 + 3 ∗ 3)]
2 1
1
= (α1 + α2 ) − [5α12 − 22α1 α2 + 25α22 ]
2
1
(α1 + α2 ) − [5α12 − 22α1 α2 + 25α22 ]
2
Assume α1 = α2
1
ϕ( α ) = 2α1 − [5α12 − 22α12 + 25α12 ]
2
= 2α1 − 4α12
dϕ
= 2 − 8α1 = 0
dx
1
α1 =
4
1
α1 =
4
i=1
= α1 y1 x1 + α2 y2 x2
1 1
= (+1)(2, 1) + (−1)(4, 3)
4 4
1
= (−2, −2)
4
1 1
ω = (− , − )
2 2
1
b=( min (ω .x1 ) + max (ω .x1 ))
2 i:yi =+1
i:yi = −1
1
= ((ω .x1 ) + (ω .x2 ))
2
1 1 1 1 1
= ((− ∗ 2 − ∗ 1) + (− ∗ 4 − ∗ 3))
2 2 2 2 2
1 −10
= ( )
2 2
−5
b=
2
f( x ) = ω . x − b
1 1 5
= (− , − ).(x1 , x2 ) − (− )
2 2 2
1 1 5
= − x1 − x2 +
2 2 2
1
= − (x1 + x2 − 5)
2
1
− (x1 + x2 − 5) = 0
2
x1 + x2 − 5 = 0
2+4 1+3
mean = ( , ) = (3, 2)
2 2
used to project the features in higher dimension space into a lower dimension space.
Algorithm Steps
1. Compute the class means of dependent variable
1
μ1 = ∑x
N x∈ω
1
1
S1 = ∑ (x − μ1 )(x − μ1 )T
N − 1 x∈ω
1
Sw = S1 + S2
Sβ = (μ1 − μ2 )(μ1 − μ2 )T
5. Compute the Eigen Values and Eigen Vectors from the within class Sω and between class Sβ scatter matrix
Sω−1 Sβ ω = λω
8. Obtain the LDA by taking the dot product of eigen vectors and original data
(Sω−1 Sβ − λI ) ( 1 ) = 0
ω
ω2
Samples for class ω2 : X2 = (x1 , x2 ) = {(9, 10), 6, 8), (9, 5), (8, 7), (10, 8)}
Solution
1
μ1 = ∑x
N x∈ω
1
1 4 2 2 3 4 3
μ1 = [( ) + ( ) + ( ) + ( ) + ( )] = ( )
5 2 4 3 6 4 3.8
1 9 6 9 8 10 8.4
μ2 = [( ) + ( ) + ( ) + ( ) + ( )] = ( )
5 10 8 5 7 8 7.6
T
2 3 2 3
[( ) − ( )] [( ) − ( )] +
4 3.8 4 3.8
T
2 3 2 3
[( ) − ( )] [( ) − ( )] +
3 3.8 3 3.8
T
3 3 3 3
[( ) − ( )] [( ) − ( )] +
6 3.8 6 3.8
T
4 3 4 3
[( ) − ( )] [( ) − ( )] +
4 3.8 4 3.8
1 −0.25
=( )
Ans
S1 =
N −1 −0.25 2.2
T
9 8.4 9 8.4
S2 = [( ) − ( )] [( ) − ( )] +
10 7.6 10 7.6
T
6 8.4 6 8.4
[( ) − ( )] [( ) − ( )] +
8 7.6 8 7.6
T
9 8.4 9 8.4
[( ) − ( )] [( ) − ( )] +
5 7.6 5 7.6
T
8 8.4 8 8.4
[( ) − ( )] [( ) − ( )] +
7 7.6 7 7.6
T
10 8.4 10 8.4
[( ) − ( )] [( ) − ( )] +
8 7.6 8 7.6
2.3 −0.05
=( )
Ans
S2 =
N −1 −0.05 3.3
Sw = S1 + S2
3.3 −0.3
=( )
−0.3 5.5
T
3 8.4 3 8.4
Sβ = [( ) − ( )] [( ) − ( )]
3.8 7.6 3.8 7.6
−5.4
=( ) (−5.4 −3.8 )
−3.8
29.16 20.52
=( )
20.52 14.44
Sω−1 Sβ ω = λω
⟹ Sω−1 Sβ ω − λI = 0
−1
3.3 −0.3 20.52 1 0 29.16
( ) ) − λ( (
) =0
−0.3 5.5 14.44 0 1 20.52
Adj(A)
A−1 =
∣A∣
9.2213 − λ 6.489
( ) =0
4.2339 2.9794 − λ
⟹ λ2 − 12.2007λ = 0
⟹ λ(λ − 12.2007) = 0
⟹ λ1 = 0, λ2 = 12.2007
ω2
to find eigen values and eigen vectors it gives you the linear discriminant matrix
ω ∗ = Sω −1 (μ1 − μ2 )
0.9088
=( )
0.4173
Find the dot product of the data points to ω ∗ to get the 1st Linear Discriminant (normal matrix multiplication)
Unsupervised Learning
type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without
any supervision.
cannot be applied directly to regression or classification problem, because we have the input data but no corresponding
output data.
much similar as human learns to think by their own experience, which makes it closer to AI.
In real world we do not always have the input data with corresponding output to solve such cases we use
unsupervised learning.
Clustering - method of grouping objects into clusters such that objects with most similarities remains into a group
and has less or no similarities with objects of another group.
Association - is a unsupervised learning method which is used for finding the relationships between variables in a
large database. It determines the set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective.
Clustering
method of grouping objects into clusters such that objects with most similarities remains into a group and has less or no
similarities with objects of another group.
Applications
Market segmentation
Image Segmentation
Types of Clustering
Hard Clustering - In this each input data point either belongs to a cluster completely or not.
Soft Clustering - In this, instead of putting each input data point into a separate cluster, a probability or likelihood of
that data point being in those clusters is assigned. (an item can exist in multiple clusters)
Partitioning Clustering
Hirarchical Clustering
Partitioning Clustering
These are iterative clustering algos in which the notion of similarity is derived by the closeness of a data point to the
centroid or cluster center of the clusters.
K-means Clustering
Suppose that the data mining task is to cluster points into three clusters
where the points are
A1 (2,10)
A2 (2,5)
A3 (8,4)
B1 (5,8)
B2 (7,5)
B3 (6,4)
C1 (1,2)
C2 (4,9)
Suppose initially we assign A1, B1, C1 as the center of each cluster repectively
Answer
Initial Centroids:
A1 ⇒ (2,10)
B1 ⇒ (5,8)
C1 ⇒ (1,2)
A1 ⇒ (2,10)
B1 ⇒ (6,6)
C1 ⇒ (1.5, 3.5)
New Centroids
A1 ⇒ (3,9.5)
B1 ⇒ (6.5, 5.25)
C1 ⇒ (1.5,3.5)
New Centroids
A1⇒(3.67, 9)
B1⇒(7,4.33)
C1⇒(1.5, 3.5)
Since for the last two iterations the Clustering is same so we can conclude that
It helps in reducing the dimensionality of data while preserving its local neighborhood structure, making it easier to
identify patterns in high-dimensional datasets
Algorithm
Input Data - Suppose we have a dataset with N data points, each having D dimensions. Let the dataset be
represented as a matrix X, where each row is a data point: X = {x1 , x2 , ...., xn }
Find Nearest Neighbors - For each data point xi , find its K nearest neighbors. This can be done using various
Compute Reconstruction Weights - For each data point xi compute the weights Wij that creates the best linear
combination with its nearest neighbors. The weights are found by minimizing the reconstruction error:
i j
Compute Low Dimensional Embedding - Find the low dimensional representation Y that best preserves the local
relationships defined by the weights W . This is done by minimizing the embedding cost function
ϕ(Y ) = ∑ yi − ∑ Wij yj
i j
LLE preserves local relationships whereas PCA preserves the global structure of data.
Applications are-
Face recognition
Speech processing
Image segmentation
Perceptron
Perceptron is a linear supervised machine learning algorithm
It is particularly significant in the field of pattern recognition due to its ability to classify data points into different
categories based on their features.
The main idea behind a perceptron is to classify input data into one of two categories based on its features.
The perceptron receives multiple input values, which can be features of the data. For example, if you are classifying
emails as spam or not spam, the inputs could be features like the number of links, the presence of certain keywords
etc.
Weights
Each input is associated with a weight, which determines the importance of that input in the classification process.
Weights are adjusted during the learning process to improve the model’s accuracy.
Bias
A bias term is added to weighted sum of the inputs to allow model to fit the data better. It acts as an additional
parameter that helps shift the decision boundary.
Activation Function
The perceptron uses an activation function to determine the output based on the weighted sum of inputs. The most
common activation function for a Perceptron is the step function, which outputs either 0 or 1.
a. For a given input vector X = [x1 , x2 , ......., xn ]and corresponding weights W = [ω1 , ω2 , ....., ωn ]the Perceptron
n
z = ∑ ωi xi + b
i=1
2. Activation Function
a. The output yof the Perceptron is determined by applying the activation function to the weighted sum:
y = f(z) = {
1 if z ≥ 0
0 if z < 0
KNN Algorithm
K Nearest Neighbor Algorithm
assumes the similarity between new data and available cases and put the new case into category that is most similar to
available categories.
when new data appears then it can be easily classified into a well suited category by using KNN algo.
also called lazy learner algo because it doesn’t learn from the training set immediately instead it stores the dataset and
at the time of classification, it performs an action on the dataset.
suppose there are 2 categories, i.e., Category A and Category B, and we have a new datapoint x1, so this data point
will lie in which of these categories.
With help of KNN we can easily identify the category or class of particular dataset.
4. Among these k neighbors, count the no of the data points in each category
5. Assign the new data points to that category for which the number of neighbor is maximum.
1. There is no particular way to determine the best value for K so we need to try some values to find the best out of
them. The most preferred value is 5.
2. A very low value for K such as 1 or 2 can be noisy and lead to the effect of outliers in the model
Advantages
simple implementation
Disadvantages
high computation cost (calculating the distance between data points for all the training samples)
It is particularly useful for datasets where the clusters have arbitrary shapes and are not well separated by linear
boundaries.
Mode-seeking algo
unlike KNN, mean shift doesn’t need number of clusters specified in advance. It is calculated based on the data
main idea is to shift each data point towards the mode (the highest density) of the distribution of points within a certain
radius.
The algorithm iteratively performs these shifts until the points converge to a local maximum of density function.
Why?
Identifies Data Structure - it helps uncover the natural structure in data without assuming specific parameters like
the number or shapes of clusters.
Focus on Dense Areas- It prioritizes regions with data points densely packed.
Algorithm
2. Measure Density - define a radius around each point and calculate the average position of the points within that
radius.
3. Shift points - Move each point toward the average position (dense region)
4. Repeat Until Stable - Keep repeating until movement is very small, meaning the points have settled around dense
regions.
How it works:
Place a smooth, bell shaped curve (kernel) over each data point
Applications
Clustering
Use: grouping similar data points into clusters without assuming fixed number of clusters.
How: each data points moves to its corresponding mode, and each points converging to same mode form a cluster.
Advantages
Image Segmentation
How
treat pixel intensities (or colors) as data points in a high dimensional space.
Object Tracking
How
use mean shift to locate the region of highest feature density (e.g. color histogram) in a search windows
Anomaly Detection
How
Latent Variables - unobserved variables in statistical models that can only be inferred indirectly through their effects on
observable variables.
Likelihood - probability of observing the given data given the parameters of the model. In the EM algo, the goal is to find
the parameters that maximize the likelihood.
Log-Likelihood - logarithm of likelihood function, which measures how fit is that between the observed data and the
model. (EM algo seeks to maximize this)
Maximum Likelihood Estimation (MLE) - MLE is a method to estimate the parameters of a statistical model by finding
parameter values that maximize the likelihood function, which measures how well the model explains the observed data.
Convergence - condition when the EM algo has reached a stable solution. (change in log-likelihood or parameter
estimates fall below the threshold)
Algorithm
1. Initialization - starts with initial guesses for the parameters of the model
2. E-Step (Expectation) - compute the expected values of the latent variables using the current parameter estimates.
3. M-Step (Maximization) - update the parameters to maximize the likelihood of the observed data given the expectations
from the E-step
Applications
2. Image Segmentation
Pixel Classification: Groups pixels based on intensity or color into regions (e.g., separating foreground from
background).
3. Dimensionality Reduction
Principal Component Analysis (PCA) Variants: EM helps in probabilistic approaches to PCA, enabling reduction of data
dimensions while accounting for missing data.