ML Notes - 2025
ML Notes - 2025
Supervised Learning
• In supervised learning, the machine is trained on a set of labelled data, which means
that the input data is paired with the desired output. The machine then learns to predict
the output for new input data.
• Supervised learning is often used for tasks such as classification and regression.
• In supervised learning, each data point in the training data contains input variables (also
known as independent variables or features), and an output variable, or label.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised
learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
For example, a labelled dataset of images of Elephant, Camel and Cow would have each
image tagged with either “Elephant” , “Camel “or “Cow.”
Classification:
• The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data.
• In Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or
1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories.
• Classification: The process of sorting data into categories based on specific features or
characteristics.
• There are different types of classification problems depending on how many categories
(or classes) we are working with and how they are organized. There are two main
classification types in machine learning:
1. Binary Classification
• This is the simplest kind of classification. In binary classification, the goal is to sort the
data into two distinct categories. Think of it like a simple choice between two options.
• Imagine a system that sorts emails into either spam or not spam. It works by looking
at different features of the email like certain keywords or sender details, and decides
whether it’s spam or not. It only chooses between these two options.
2. Multiclass Classification
• Here, instead of just two categories, the data needs to be sorted into more than two
categories. The model picks the one that best matches the input.
• Think of an image recognition system that sorts pictures of animals into categories
like cat, dog, and bird.
Basically, machine looks at the features in the image (like shape, color, or texture) and
chooses which animal the picture is most likely to be based on the training it received.
3. Multi-Label Classification
In multi-label classification single piece of data can belong to multiple categories at once.
Unlike multiclass classification where each data point belongs to only one class, multi-label
classification allows datapoints to belong to multiple classes. A movie recommendation
system could tag a movie as both action and comedy. The system checks various features (like
movie plot, actors, or genre tags) and assigns multiple labels to a single piece of data, rather
than just one.
Prediction Too Slow tries to apply functions and Faster, predicts very fast as there are
Speed learnings in the prediction stage pre-defined functions
Learning Medium, it can learn from data while Medium, it can learn from data
Scope training while testing
Classification algorithms are widely used in many real-world applications across various
domains, including:
• Email spam filtering
• Credit risk assessment
• Medical diagnosis
• Sentiment analysis
• Fraud detection
• Recommendation systems
Classification Algorithms
Now, for implementation of any classification model it is essential to understand Logistic
Regression, which is one of the most fundamental and widely used algorithms in machine
learning for classification tasks. There are various types of classifiers algorithms. Some of
them are:
Linear Classifiers: Linear classifier models create a linear decision boundary between classes.
They are simple and computationally efficient. Some of the linear classification models are as
follows:
• Logistic Regression
• Support Vector Machines having kernel = ‘linear’
• Single-layer Perceptron
• Stochastic Gradient Descent (SGD) Classifier
Non-linear Classifiers: Non-linear models create a non-linear decision boundary between
classes. They can capture more complex relationships between input features and target
variable. Some of the non-linear classification models are as follows:
• K-Nearest Neighbours
• Kernel SVM
• Naive Bayes
• Decision Tree Classification
• Ensemble learning classifiers:
• Random Forests,
• Multi-layer Artificial Neural Networks
Decision Trees:
• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
• The decisions or the test are performed on the basis of features of the given dataset.
• Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next
decision node (distance from the office) and one leaf node based on the corresponding
labels. The next decision node further gets split into one decision node (Cab facility)
and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers
and Declined offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
• Information Gain
• Gini Index
Information Gain (IG) is a metric used in decision tree algorithms (like ID3, C4.5, and
CART) to evaluate the effectiveness of a particular attribute in classifying data. It measures
the reduction in entropy (uncertainty or disorder) after a dataset is split based on a particular
attribute.
Entropy: A measure of uncertainty or disorder. If a dataset is perfectly pure (all
instances belong to the same class), the entropy is 0. If the classes are equally
distributed, the entropy is maximized (which is 1 for binary classification).
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
Information Gain (IG): It measures the change in entropy after a dataset is split based on a
particular attribute. It is the difference between the original entropy of the dataset and the
weighted average entropy of the subsets resulting from the split.
The formula for Information Gain is:
Information Gain (Dataset, Attribute) = Entropy(Dataset) - Weighted Average
Entropy(Subsets created by splitting on Attribute)
Gini Index
Decision trees are a popular machine learning algorithm used for both classification and
regression tasks. In classification, the goal is to predict which category a data point belongs to.
The Gini Index plays a crucial role in how decision trees decide how to split the data at each
step.
Here's the basic idea:
1. Measuring Impurity: The Gini Index measures the "impurity" of a set of data points.
A pure set means all data points belong to the same category. An impure set means the
data points are mixed across different categories.
2. Finding the Best Split: When building a decision tree, the algorithm needs to decide
which feature to use for splitting the data at each node. It calculates the Gini Index for
each possible split and chooses the one that results in the greatest reduction in impurity.
This means the split that creates the most "pure" subsets of data.
How it's calculated:
The Gini Index for a set of data points is calculated as:
Gini Index = 1 - (probability of category 1) ^2 - (probability of category 2) ^2 - ... -
(probability of category c) ^2
where c is the number of categories.
• Uses conditions like Xi ≤ V (for numerical data) or categorical rules (e.g., "Color =
Red").
• Simpler to interpret and computationally efficient.
• Commonly used in CART, ID3, and C4.5 algorithms.
Example:
If we have features (Age, Income, Credit Score), a univariate decision tree may use only one
feature at a time to make a split:
• If Age ≤ 30 → Go left
• If Age > 30 → Go right
Characteristics:
• More flexible than univariate trees.
• Decision boundaries are not restricted to be parallel to axes (can be diagonal or
curved).
• Uses linear models (like Logistic Regression, SVM, PCA-based splits) to determine
the best split.
• More computationally expensive than univariate trees.
Example (Multivariate Split Rule):
Types of Pruning
1. Pre-Pruning (Early Stopping):
o Pre-pruning involves stopping the growth of a decision tree before it becomes
too complex and overfits the training data. This can be done by setting a
maximum tree depth, a minimum number of data points in a leaf node, or a
threshold for the information gain at each decision node. Pre-pruning is simple
and computationally efficient, but it may not capture complex relationships in
the data.
• Stops the tree from growing before it becomes too complex.
• Sets a limit on conditions like:
o Max depth (max_depth)
o Min samples per split (min_samples_split)
o Min samples per leaf (min_samples_leaf)
• Advantages:
Faster training.
Prevents excessive growth.
• Disadvantages:
Might stop too early, missing some important patterns.
Thus, if an email contains the word "discount," there's a 63% probability it's spam.
Step 4: Decision Making Using Loss Function
• A loss function in machine learning tells us how far our predictions are from the actual
values.
• In Bayesian learning, we make predictions based on probability distributions rather
than fixed numbers, so our loss function helps decide which predictions are best.
• Suppose misclassifying spam as non-spam has a higher cost (e.g., 5 points) than
misclassifying non-spam as spam (2 points).
• The expected loss for labelling an email as spam (ds) or non-spam (dN) is computed.
• Likelihood is a measure of how well a given set of parameters explains the observed
data.
If we have a dataset D = {x1, x2, ..., xn} and a model with parameters θ, the
likelihood function is:
or
L(θ) =P (D ∣ θ)
This function tells us how probable the observed data is, given the model
parameters.
3. Maximize the Likelihood: The goal of MLE is to find the parameters that maximize
the likelihood function. This means finding the parameters that make your observed
data the most probable.
• MLE finds the
• that maximizes the likelihood function:
• arg max: This is short for "argument of the maximum." It means "find the value of
θ that maximizes the following function."’
• L(θ): This is the likelihood function. It represents how likely it is to observe the data
we have, given a particular value of the parameter θ.
Since probabilities are usually very small (because they involve multiplying
probabilities of individual data points.), we often take the log-likelihood instead (to
avoid numerical issues and simplify calculations):
Linear regression:
• Linear regression is a quiet and the simplest statistical regression technique used for
predictive analysis in machine learning.
• It shows the linear relationship between the independent(predictor) variable i.e. X-axis
and the dependent (output) variable i.e. Y-axis, called linear regression. If there is a
single input variable X (independent variable), such linear regression is simple linear
regression.
• In a simple linear regression, there is one independent variable and one dependent
variable. The model estimates the slope and intercept of the line of best fit, which
represents the relationship between the variables.
• The slope represents the change in the dependent variable for each unit change in the
independent variable, while the intercept represents the predicted value of the
dependent variable when the independent variable is zero.
• The graph above presents the linear relationship between the output(y) and predictor(X)
variables. The blue line is referred to as the best-fit straight line. Based on the given
data points, we attempt to plot a line that fits the points the best.
Simple Regression Calculation
• To calculate best-fit line linear regression uses a traditional slope-intercept form which
is given below,
Yi= β0+β1Xi
where Y i = Dependent variable, β 0 = constant/Intercept, β 1 = Slope/Intercept, X
i = Independent variable.
• This algorithm explains the linear relationship between the dependent(output) variable
y and the independent(predictor) variable X using a straight-line Y= B 0 + B 1 X.
But how does the regression find out which is the best-fit line?
• The goal of the linear regression algorithm is to get the best values for B 0 and B 1 to
find the best-fit line. The best-fit line is a line that has the least error which means the
error between predicted values and actual values should be minimum.
Gradient Descent
• Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models
by means of minimizing errors between actual and expected results. It helps in finding
the local minimum of a function.
• The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
• If we move towards a negative gradient or away from the gradient of the function at
the current point, it will give the local minimum of that function.
• Whenever we move towards a positive gradient or towards the gradient of the function
at the current point, we will get the local maximum of that function. This entire
procedure is known as Gradient Ascent, which is also known as steepest descent.
The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration.
The cost function is defined as the measurement of difference or error between actual values
and expected values at the current position and present in the form of a single real number.
• Calculates the first-order derivative of the function to compute the gradient or slope of
that function.
• Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
To minimize the cost function, two data points are required: Direction & Learning Rate
Learning Rate: It is defined as the step size taken to reach the minimum or lowest point.
How Does Gradient Descent Work in Linear Regression?
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x)). Hence, the name is Linear Regression.
• Initialize Parameters: Start with random initial values for the slope (m) and intercept
(b).
• Calculate the Cost Function: Compute the error using a cost function such as Mean
Squared Error (MSE):
• Compute the Gradient: Find the gradient of the cost function with respect to m and b.
These gradients indicate how the cost changes when the parameters are adjusted.
• Update Parameters: Adjust m and b in the direction that reduces the cost:
• Repeat: Iterate until the cost function converges i.e. further updates make little or no
difference.
Linear Discrimination
• Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality
reduction techniques in machine learning to solve more than two-class classification
problems. It is also known as Normal Discriminant Analysis (NDA) or Discriminant
Function Analysis (DFA).
• It is also considered a pre-processing step for modelling differences in ML and
applications of pattern classification.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:
When we classify them using a single feature, then it may show overlapping.
To overcome the overlapping issue in the classification process, we must increase the number
of features regularly.
Here, LDA uses an X-Y axis to create a new axis by separating them using a straight line and
projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane into
1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
o It maximizes the distance between means of two classes.
o It minimizes the variance within the individual class.
LDA can be performed in 5 steps:
Step 1: Compute the mean vectors for the different classes from the dataset.
Step 2: Compute the scatter matrices (in-between-class and within-class scatter matrices).
Step 3: Compute the eigenvectors and corresponding eigenvalues for the scatter matrices.
Step 4: Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest
eigenvalues.
Step 5: Use this eigenvector matrix to transform the samples onto the new subspace.
3. Objective of LDA:
LDA tries to maximize the ratio of between-class scatter to within-class scatter to
achieve maximum class separation. This is represented mathematically as:
4. Solve for the Optimal Projection Vector:
This transformation reduces the dimensionality of the data while maintaining the class-
related information.
Logistic regression
• Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
• Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
Logistic Regression Equation:
• The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:
We know the equation of the straight line can be written as:
P(y=1)
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
Multilayer Perceptron:
Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the
network’s weights and biases. This is achieved through backpropagation:
Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss:
Step 4: Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases
during training. Popular optimization methods include:
• Stochastic Gradient Descent (SGD): Updates the weights based on a single sample
or a small batch of data:
Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the
network’s weights and biases. This is achieved through backpropagation:
Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss:
Step 4: Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases
during training. Popular optimization methods include:
• Stochastic Gradient Descent (SGD): Updates the weights based on a single sample
or a small batch of data:
NONPARAMETRIC METHODS:
Parametric methods, like linear regression, logistic regression, and Naive Bayes (with Gaussian
assumptions), assume a specific form for the relationship between features and the target. They
estimate a fixed set of parameters from the data. If the assumed form is incorrect, the model
might perform poorly even with a lot of data.
Nonparametric methods, on the other hand, let the data speak for itself and adapt the model
complexity accordingly. This flexibility comes at the cost of potential overfitting and
sometimes higher computational demands.
Histogram Estimator
The Histogram Estimator is a fundamental and intuitive nonparametric density estimation
technique. It provides a visual and basic approximation of the probability density function
(PDF) of a dataset by dividing the data range into intervals (bins) and representing the
frequency of data points within each bin as the height of a bar.
Kernel Density Estimator (KDE)
The Kernel Density Estimator (KDE) is a non-parametric method used to estimate the
probability density function (PDF) of a continuous random variable. Unlike parametric
methods that assume a specific underlying distribution (like a normal distribution), KDE makes
no such assumptions and instead learns the distribution directly from the data. It provides a
smooth and continuous estimate of the density.
K-Nearest Neighbor Estimator (KNN Estimator):
The K-Nearest Neighbor Estimator (KNN Estimator) is a non-parametric density estimation
technique that estimates the probability density function (PDF) of a random variable at a given
point based on the distances to its k nearest neighbors in the training data. Unlike methods like
histograms or kernel density estimation (KDE) that fix the bin width or kernel bandwidth, the
KNN estimator adapts the size of the neighborhood based on the local density of the data.
Where:
• k is the number of nearest neighbors to consider (a pre-defined positive integer).
• n is the total number of data points in the dataset.
• rk(x) is the distance from the point x to its k-th nearest neighbor in the dataset (using a
chosen distance metric like Euclidean distance).
• Vd(r) is the volume of a d-dimensional ball (or hypercube, depending on the distance
metric) with radius r.
Parametric Methods uses a fixed number of Non-Parametric Methods use the flexible number
parameters to build the model. of parameters to build the model.
It is applicable only for variables. It is applicable for both – Variable and Attribute.
Parametric Methods require lesser data than Non- Non-Parametric Methods requires much more data
Parametric Methods. than Parametric Methods.
Here when we use parametric methods then the When we use non-parametric methods then the
result or outputs generated can be easily affected result or outputs generated cannot be seriously
by outliers. affected by outliers.
Parametric methods have more statistical power Non-parametric methods have less statistical power
than Non-Parametric methods. than Parametric methods.
As far as the computation is considered these As far as the computation is considered these
methods are computationally faster than the Non- methods are computationally slower than the
Parametric methods. Parametric methods.
Nonparametric Classification:
Nonparametric classification algorithms are a type of machine learning technique that doesn't
make strong assumptions about the underlying distribution of the data.
Characteristics:
• Flexibility: They can model complex and irregular decision boundaries because they
aren't constrained by a specific functional form.
• Fewer Assumptions: They make minimal to no assumptions about the shape of the
data or the relationship between features and classes.
• Data-Driven: The model's complexity and structure are determined by the training
data. More data can lead to more complex models.
• Potential for Overfitting: Due to their flexibility, they can be more prone to
overfitting the training data if not carefully tuned or with insufficient data.
• Computational Cost: Some nonparametric methods can be computationally
expensive, especially with large datasets, as they might need to store or compare new
instances with a significant portion of the training data.
• Interpretability: They are often less interpretable than parametric methods because
the decision-making process isn't always easily summarized by a small set of
parameters.
K-Nearest Neighbor (KNN) is a supervised learning algorithm used for
both classification and regression. It is non-parametric, meaning it doesn’t make any
assumptions about the underlying data distribution, which makes it versatile for various
applications. KNN works by analyzing the proximity or “closeness” of data points based on
specific distance metrics.
• In classification, KNN assigns a class label to a new data point based on the majority
class of its nearest neighbors. For instance, if a data point has five nearest neighbors,
and three of them belong to class A while two belong to class B, the algorithm will
classify the point as class A.
• In regression, KNN predicts continuous values by averaging the values of the k-nearest
neighbors. For example, if you’re predicting house prices, KNN will use the average
prices of the k-nearest neighbors to estimate the price of a new house.
How Does KNN Work?
The KNN algorithm follows a straightforward, step-by-step approach:
Step 1: Determine the Number of Nearest Neighbors (k)
The first step is to select the number of neighbors (k) to consider. The value of k determines
how many neighboring points will influence the classification or prediction of a new data point.
Step 2: Calculate the Distance between the Query(target) Point and Dataset Points
For each data point in the dataset, the algorithm calculates the distance between the query point
(the new point to be classified or predicted) and every other point. Various distance metrics can
be used, such as Euclidean distance, Manhattan distance, or Minkowski distance.
Step 3: Sort and select the k-Nearest Neighbors
After calculating the distances, the algorithm sorts all data points in ascending order of
distance. It then selects the k-nearest neighbors—the data points that are closest to the query
point.
Step 4: Make a Prediction
• For classification: The algorithm assigns the query point to the class label that is most
frequent among the k-nearest neighbors (majority voting).
• For regression: The algorithm predicts the value by averaging the values of the k-
nearest neighbors.
The Problem with Standard k-NN:
The k-NN algorithm is a "lazy learner" because it stores the entire training dataset and performs
computation only at the time of prediction. For large datasets, this can lead to:
• High memory usage: Storing all training samples can be memory-intensive.
• Slow prediction times: Classifying a new instance requires calculating distances to all
training samples.
Here are some popular non-parametric classification algorithms:
• K-Nearest Neighbors (KNN): Classifies a new data point based on the majority class
among its k closest neighbors in the training data. The decision boundary is implicitly
defined by the distribution of the training points.
• Decision Trees: Create a tree-like structure of decision rules based on features to
classify instances. The tree's structure adapts to the data.
• Random Forests: An ensemble method that builds multiple decision trees on
different subsets of the data and averages their predictions.
• Support Vector Machines (SVM) with Non-linear Kernels: While linear SVM can
be seen as somewhat parametric, using kernels like the Radial Basis Function (RBF)
allows SVM to create highly non-linear decision boundaries that are data-driven.
• Neural Networks (Deep Learning Models): With enough layers and neurons, these
models can learn extremely complex and non-linear relationships, making them
effectively nonparametric in their ability to model intricate decision boundaries.
• Kernel Density Estimation (KDE) for Classification: Estimates the probability
density function for each class using KDE and then uses Bayes' theorem to classify
new points based on these estimated densities.
Condensed nearest neighbor:
Condensed Nearest Neighbor (CNN) is a data reduction technique used in machine learning,
particularly as a preprocessing step for the k-Nearest Neighbors (k-NN) algorithm. The primary
goal of CNN is to reduce the size of the training dataset while preserving or even improving
the performance of the k-NN classifier.
Condensed Nearest Neighbor (CNN) is a data reduction technique used in machine learning,
particularly as a preprocessing step for the k-Nearest Neighbors (k-NN) algorithm. The primary
goal of CNN is to reduce the size of the training dataset while preserving or even improving
the performance of the k-NN classifier.
Here's a breakdown of how it works:
The Problem with Standard k-NN:
The k-NN algorithm is a "lazy learner" because it stores the entire training dataset and performs
computation only at the time of prediction. For large datasets, this can lead to:
• High memory usage: Storing all training samples can be memory-intensive.
• Slow prediction times: Classifying a new instance requires calculating distances to all
training samples.
The Goal of CNN:
CNN aims to identify a smaller subset of the training data, called the "store" or "prototype set,"
which can still correctly classify the original training data using a 1-Nearest Neighbor rule.
This reduced set aims to capture the essential information needed for classification, especially
the samples near the decision boundaries between classes.
The Basic CNN Algorithm:
1. Initialize:
o Create an empty "store" (S).
o Randomly select one sample from each class in the original training set (T) and
add them to S. This ensures that all classes are initially represented.
o Move the selected samples from T to a temporary set (say, T').
2. Iterate:
o Scan through all the samples in T'.
o For each sample (x) in T', find its nearest neighbor in the current store (S).
o If the class label of the nearest neighbor in S is different from the class label of
x, then x is misclassified by the current store. In this case, move x from T' to S.
o Repeat this scan through T' until no more samples are moved to S in a complete
pass.
3. The Result: The final set S is the condensed training set.
Working of CNN
Let us understand the working of CNN.
Suppose that we have a dataset D, given by
Step 3: We check whether the store S is ‘Training Set Consistent’ or not. If it is, we stop; else
we add a point to the store so as to improve it and make it training set consistent.
Training Set Consistency: A set is said to be training set consistent if on running KNN on the
dataset with the classifiers as the points in store, we get the same classification as when KNN
was run on the entire dataset.
i.e.
Dimensionality reduction techniques can be broadly classified into two categories based on the
approach used:
1. Feature Selection
This approach selects a subset of the original features or variables, discarding the
rest. Feature selection methods can be further divided into three categories -
• Filter methods:
These methods evaluate the relevance of each feature independently of the target
variable and select the most relevant ones based on a specific criterion, such as
correlation or mutual information. A few of the most common filter methods
include Correlation, Chi-Square Test, ANOVA, Information Gain, etc.
• Wrapper methods:
These methods evaluate the performance of a model trained on a subset of features and
select the best subset based on model performance. Some of the wrapper methods
include forward selection, backward selection, bi-directional elimination, etc.
• Embedded methods:
These methods combine feature selection with model training, selecting the most
relevant features during the training process. Some commonly used embedded methods
include LASSO, Ridge Regression, etc.
2. Feature Extraction
This approach transforms the original features into a new set of features, typically of
lower dimensionality, while preserving the most important information. Feature extraction
methods can be further divided into two categories:
• Linear methods:
These methods transform the data using linear transformations, such as Principal
Component Analysis (PCA) or Linear Discriminant Analysis (LDA).
• Non-linear methods:
These methods use non-linear transformations to map the data to a lower-dimensional
space, such as t-Distributed Stochastic Neighbor Embedding (t-
SNE) or Autoencoders.
2. Wrapper Methods:
These methods evaluate subsets of features by training and evaluating a specific
machine learning model on each subset. The feature subset that yields the best model
performance (based on a chosen evaluation metric) is selected.
• Advantages: Can find feature subsets that are optimally suited for a particular model
and can capture feature interactions.
• Disadvantages: Computationally expensive, especially for a large number of features,
as it involves training the model multiple times. Can also be prone to overfitting if the
feature selection process is not carefully validated.
• Techniques include:
o Forward Selection: Starts with an empty set of features and iteratively adds the
feature that best improves model performance.
o Backward Elimination: Starts with all features and iteratively removes the
least significant feature until performance degrades.
o Recursive Feature Elimination (RFE): Repeatedly builds a model and
removes the worst-performing feature until the desired number of features is
reached.
o Exhaustive Search: Evaluates all possible subsets of features (computationally
very expensive for a large number of features).
3. Embedded Methods:
These methods perform feature selection as part of the model training
process. The model itself learns which features are most important.
• Advantages: Less computationally expensive than wrapper methods and can consider
feature interactions.
• Disadvantages: Feature selection is specific to the model being used.
• Techniques include:
o L1 Regularization (Lasso): Adds a penalty to the absolute size of the
coefficients in linear models, forcing the coefficients of less important features
to become zero.
o Tree-based Feature Importance (e.g., Random Forests, Gradient
Boosting): Tree-based models naturally provide a ranking of feature importance
based on how much each feature contributes to reducing impurity.
o Feature Importance from Linear Models: The magnitude of the coefficients
in linear models can indicate feature importance (after appropriate scaling).
where I is the identity matrix of the same shape as matrix A. And the above conditions will be
true only if (A–λI)(A–λI) will be non-invertible (i.e. singular matrix). That means,
∣A–λI∣=0
This determinant equation is called the characteristic equation.
1. Solving it gives the eigenvalues \lambda,
2. and therefore corresponding eigenvector can be found using the equation AX=λX.
How This Connects to PCA?
3. In PCA, the covariance matrix C (from Step 2) acts as matrix A.
4. Eigenvectors of C are the principal components (PCs).
5. Eigenvalues represent the variance captured by each PC.
Step 4: Pick the Top Directions & Transform Data
• Keep only the top 2–3 directions (or enough to capture ~95% of the variance).
• Project the data onto these directions to get a simplified, lower-dimensional version.
Principal components are new variables that are constructed as linear combinations or mixtures
of the initial variables. These combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information within the initial variables
is squeezed or compressed into the first components. So, the idea is 10-dimensional data gives
you 10 principal components, but PCA tries to put maximum possible information in the first
component, then maximum remaining information in the second and so on, until having
something like shown in the scree plot below.
Factor Analysis
• Factor Analysis is an unsupervised, probabilistic machine learning algorithm used
for dimensionality reduction.
• Factor Analysis is a statistical method used to reduce a large number of observed
variables into a smaller set of underlying unobserved (latent) variables called
"factors." It helps to uncover the underlying structure of data, identify key dimensions
or constructs, and simplify complex datasets.
• There are two main types: Exploratory Factor Analysis (EFA) and Confirmatory
Factor Analysis (CFA). EFA is used when you don't have a preconceived idea of the
factor structure, while CFA tests a pre-specified hypothesis about the factor structure.
• Exploratory Factor Analysis (EFA):
o EFA is a data-driven technique used when the researcher has no clear
hypothesis about the number of factors or how observed variables will
specifically group onto those factors.
o Its primary goal is to discover and identify the underlying factor structure.
Think of it as "exploring" your data to see what natural clusters or dimensions
emerge.
• Confirmatory Factor Analysis (CFA):
o CFA is a theory-driven technique used when the researcher has a clear, pre-
specified hypothesis about the number of factors, which specific observed
variables load onto which factors, and whether these factors are correlated. Its
primary goal is to test or confirm this hypothesized factor structure.
Multidimensional Scaling
• Multidimensional scaling (MDS) is a dimensionality reduction technique that is
used to project high-dimensional data onto a lower-dimensional space while
preserving the pairwise distances between the data points as much as possible.
• MDS is based on the concept of distance and aims to find a projection of the data
that minimizes the differences between the distances in the original space and
the distances in the lower-dimensional space.
• MDS is commonly used to visualize complex, high-dimensional data, and to
identify patterns and relationships that may not be apparent in the original space.
• It can be applied to a wide range of data types, including numerical, categorical,
and mixed data.
• MDS is implemented using numerical optimization algorithms, such as gradient
descent or simulated annealing, to minimize the difference between the distances in
the original and lower-dimensional spaces.
There are three main types of Multidimensional Scaling: Classical, Metric, and Non-
metric.
1. Classical Multidimensional Scaling (CMDS)
Classical MDS, also known as Principal Coordinates Analysis (PCoA), takes an
input matrix representing dissimilarities between pairs of items and produces a
coordinate matrix that minimizes "strain." Strain quantifies how well the distances in
the low-dimensional representation match the original dissimilarities.
Mathematically, strain is defined as:
The steps of a Classical MDS algorithm involve:
1. Setting up the squared proximity matrix D(2).
2. Applying double cantering to compute matrix B.
3. Determining the m largest eigenvalues and corresponding eigenvectors of B.
4. Obtaining the coordinates matrix X from these eigenvalues and eigenvectors.
Classical MDS is chosen when the distance data are Euclidean and accurate
preservation of these distances is crucial.
Metric MDS is suitable when distances are non-Euclidean or when the scale of
measurement levels varies.
3. Non-metric Multidimensional Scaling (NMDS)
Non-metric Multidimensional Scaling finds a non-parametric monotonic
relationship between dissimilarities and Euclidean distances between items, along with
the location of each item in the low-dimensional space. It defines a "stress" function
to optimize, considering a monotonically increasing function f.
Non-metric MDS is beneficial for qualitative data or when only the order of distances
(not the actual distances) matters.
REINFORCEMENT LEARNING
Reinforcement learning (RL) is a type of machine learning where an agent learns to make
decisions by interacting with an environment and receiving rewards or penalties based on its
actions. The goal is to maximize cumulative rewards over time. It's a trial-and-error approach
where the agent learns through feedback, adjusting its behavior to achieve optimal outcomes.
Key Concepts:
• Agent: The entity that interacts with the environment and makes decisions.
• Environment: The world in which the agent operates.
• Action: A choice made by the agent in the environment.
• Reward: Feedback given to the agent for its actions, positive or negative.
• Policy: The agent's strategy for selecting actions based on the current state.
• Value function: A measure of how good a particular state or action is, based on the
expected future rewards.
• State: The current condition of the environment that the agent perceives.
How it Works:
1. The agent interacts with the environment and takes an action.
2. The environment responds to the action and transitions to a new state.
3. The agent receives a reward or penalty for its action.
4. The agent updates its policy (strategy) based on the feedback it receives.
5. This cycle repeats, with the agent gradually learning to make better decisions to
maximize cumulative rewards.
Examples of Applications:
• Robotics: Training robots to navigate environments, grasp objects, or perform tasks.
• Game Playing: Developing AI agents that can play games like chess, Go, or video
games.
• Autonomous Driving: Building self-driving cars that can navigate roads and make
decisions.
• Resource Management: Optimizing resource allocation in various systems, such as
power grids or data centers.
• Recommendation Systems: Personalizing recommendations to users based on their
preferences.
Types of Reinforcement Learning:
• Policy-based RL: Directly learns the policy (strategy) that maps states to actions.
• Value-based RL: Learns a value function that estimates the expected reward of being
in a particular state or taking a specific action.
• Model-based RL: Learns a model of the environment to predict how the environment
will respond to actions.
Single State Case: K-Armed Bandit- Elements of Reinforcement Learning
• The Multi-Armed Bandit (MAB) problem is a classic problem in probability theory
and decision-making that captures the essence of balancing exploration and
exploitation.
• This problem is named after the scenario of a gambler facing multiple slot machines
(bandits) and needing to determine which machine to play to maximize their
rewards.
• The MAB problem has significant applications in various fields, including online
advertising, clinical trials, adaptive routing in networks, and more.
• Model Learning:
o The agent interacts with the environment, collecting data in the form of (state,
action, next state, reward) tuples.
o This experience is then used to learn a model of the environment's dynamics
(st+1=f(st,at)) and sometimes also the reward function (r(st,at)). This learning
process can often be framed as a supervised learning problem.
• Planning:
o Once the model is learned, the agent uses it to simulate future interactions
without needing to interact with the real environment. This simulation allows
the agent to:
• Predict outcomes: Given a sequence of actions, the model can predict the
resulting states and rewards.
• Evaluate policies: By simulating different action sequences, the agent can
evaluate the potential effectiveness of various policies (strategies for acting).
• Optimize actions: The agent can use planning algorithms (e.g., Model Predictive
Control (MPC), tree search algorithms) to find the optimal sequence of actions
that maximizes expected future rewards, all within the simulated
environment.
• Action Execution and Model Update: Based on its planning, the agent takes an
action in the real environment. The new experience gained is then used to refine and
update the learned model, leading to continuous improvement.
Model-based learning algorithms:
• Monte Carlo Tree Search (MCTS)
• Model Predictive Control (MPC)
• Dyna-Q
• Model-Based Policy Optimization (MBPO)
(AUTONOMOUS) L T P C
3 - - 3
III B.Tech. – II Sem.
COURSE OBJECTIVES
UNIT-II
UNIT-III
UNIT-V
TEXT BOOKS
REFERENCES
1. Tom M Mitchell, Machine Learning, First Edition, McGraw Hill Education, 2013
UNIT-I
Machine learning is a growing technology which enables computers to learn automatically from past
data. Machine learning uses various algorithms for building mathematical models and making
predictions using historical data or information. Currently, it is being used for various tasks such
as image recognition, speech recognition, email filtering, Facebook auto-tagging, recommender
system, and many more.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences on
their own. The term machine learning was first introduced by Arthur Samuel in 1959. We can define it
in a summarized way as:
Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine learning algorithms
build a mathematical model that helps in making predictions or decisions without being explicitly
programmed. Machine learning brings computer science and statistics together for creating predictive
models. Machine learning constructs or uses the algorithms that learn from historical data. The more we
will provide the information, the higher will be the performance.
A machine has the ability to learn if it can improve its performance by gaining more data.
Machine learning is a subfield of artificial intelligence that involves training computers to learn from
data without being explicitly programmed. In other words, machine learning algorithms use statistical
techniques to find patterns in data and use these patterns to make predictions or take actions.
A Machine Learning system learns from historical data, builds the prediction models, and whenever
it receives new data, predicts the output for it. The accuracy of predicted output depends upon the
amount of data, as the huge amount of data helps to build a better model which predicts the output more
accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead of writing
a code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms,
machine builds the logic as per the data and predict the output. Machine learning has changed our way
of thinking about the problem. The below block diagram explains the working of Machine Learning
algorithm:
The need for machine learning is increasing day by day. The reason behind the need for machine learning
is that it is capable of doing tasks that are too complex for a person to implement directly. Asa human,
we have some limitations as we cannot access the huge amount of data manually, so for this, we need
some computer systems and here comes the machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let them
explore the data, construct the models, and predict the required output automatically. The performance
of the machine learning algorithm depends on the amount of data, and it can be determined by the cost
function. With the help of machine learning, we can save both time and money.
The importance of machine learning can be easily understood by its uses cases, Currently, machine
learning is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion
by Facebook, etc. Various top companies such as Netflix and Amazon have build machine learning
models that are using a vast amount of data to analyze the user interest and recommend product
accordingly.
Following are some key points which show the importance of Machine Learning:
Before some years (about 40-50 years), machine learning was science fiction, but today it is the part of
our daily life. Machine learning is making our day to day life easy from self-driving cars to Amazon
virtual assistant "Alexa". However, the idea behind machine learning is so old and has a long history.
Below some milestones are given which have occurred in the history of machine learning:
Now machine learning has got a great advancement in its research, and it is present everywhere around
us, such as self-driving cars, Amazon Alexa, Catboats, recommender system, and many more. It
includes Supervised, unsupervised, and reinforcement learning with
clustering, classification, decision tree, SVM algorithms, etc.
Modern machine learning models can be used for making various predictions, including weather
prediction, disease prediction, stock market analysis, etc.
Prerequisites
Before learning machine learning, you must have the basic knowledge of followings so that you can
easily understand the concepts of machine learning:
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by day. We
are using machine learning in our daily life even without knowing it such as Google Maps, Google
assistant, Alexa, etc. Below are some most trending real-world applications of Machine Learning:
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to identify
objects, persons, places, digital images, etc. The popular use case of image recognition and face detection
is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our
Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind
this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's
a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech
to text", or "Computer speech recognition." At present, machine learning algorithms are widely used
by various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are
using speech recognition technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information from the
user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies suchas
Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product
on Amazon, then we started getting an advertisement for the same product while internet surfing on the
same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests the product
as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc.,
and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning plays
a significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on
self-driving car. It is using unsupervised learning method to train the car models to detect people and
objects while driving.
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spambox, and
the technology behind this is Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction. These assistants canhelp us
in various ways just by our voice instructions such as Play music, call someone, Open an email,
Scheduling an appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.
8. Online Fraud Detection:
Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent transaction
can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to
detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or
a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values become
the input for the next round. For each genuine transaction, there is a specific pattern which gets change
for the fraud transaction hence, it detects it and makes our online transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a risk of
up and downs in shares, so for this machine learning's long short term memory neural network is used
for the prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the
brain. It helps in finding brain tumors and other brain-related diseases easily.
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all,
as for this also machine learning helps us by converting the text into our known languages. Google's
GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning
that translates the text into our familiar language, and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm, which
is used with image recognition and translates the text from one language to another language.
Supervised learning is the types of machine learning in which machines are trained using well"labelled"
training data, and on basis of that data, machines predict the output. The labelled data meanssome input
data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to map the
input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about each type
of data. Once the training process is completed, the model is tested on the basis of test data (a subset of
the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the
shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
In the previous topic, we learned supervised machine learning in which models are trained using labeled
data under the supervision of training data. But there may be many cases in which we do not have labeled
data and need to find the hidden patterns from the given dataset. So, to solve such typesof cases in
machine learning, we need unsupervised learning techniques.
Unsupervised learning is a machine learning technique in which models are not supervised using training
dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be
compared to learning which takes place in the human brain while learning new things. It can be defined
as:
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification problem because unlike
supervised learning, we have the input data but no corresponding output data. The goal of unsupervised
learning is to find the underlying structure of dataset, group that data according to similarities,
and represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of
different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset. The task of the unsupervised learning algorithm
is to identify the image features on their own. Unsupervised learning algorithm will perform this task by
clustering the image dataset into the groups according to similarities between images.
Keep Watching
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
o In real-world, we do not always have input data with the corresponding output so to solve such
cases, we need unsupervised learning.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to
the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes them
as per the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriority algorithm
o Singular value decomposition
Supervised and Unsupervised learning are the two techniques of machine learning. But both the
techniques are used in different scenarios and with different datasets. Below the explanation of both
learning methods along with their difference table is given.
The main differences between Supervised and Unsupervised learning are given below:
Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the output. Unsupervised learning model finds the
hidden patterns in data.
In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it the hidden patterns and useful insights from
is given new data. the unknown dataset.
Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.
Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
learning.
Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine things
predict the correct output. by his experiences.
Reinforcement learning:
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward that
is the diamond and avoid the hurdles that are fired. The robot learns by trying all the possible paths
and then choosing the path which gives him the reward with the least hurdles. Each right step will
give the robot a reward and each wrong step will subtract the reward of the robot. The total reward
will be calculated when it reaches the final reward that is the diamond.
Main points in Reinforcement learning –
• Input: The input should be an initial state from which the model will start
• Output: There are many possible outputs as there are a variety of solutions to a particular
problem
• Training: The training is based upon the input, The model will return a state and the user will
decide to reward or punish the model based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.
1. Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words, it has a positive effect
on behavior.
Advantages of reinforcement learning are:
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which can diminish the
results
2. Negative –
Negative Reinforcement is defined as strengthening of behavior because a negative condition is
stopped or avoided.
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
Various Practical applications of Reinforcement Learning –
Model selection refers to the process of selecting the best model from a set of candidate models based
on their performance on a given task. This process typically involves splitting the available data into
training and validation sets, using the training set to train each candidate model, and then evaluating
their performance on the validation set. The model with the best performance on the validation set is
selected as the final model.
Generalization refers to the ability of a model to perform well on new, unseen data. When a model is
trained on a dataset, it may overfitt the training data by memorizing specific patterns in the data that
are not representative of the underlying distribution. This can lead to poor performance on new data.
To ensure good generalization, it is important to evaluate a model's performance on a separate test set
that was not used during model selection or training.
To improve generalization, techniques such as regularization, early stopping, and data augmentation
can be used. Regularization involves adding a penalty term to the loss function to discourage complex
models that are prone to overfitting. Early stopping involves monitoring the validation error during
training and stopping the training process when the error begins to increase. Data augmentation
involves generating new training examples by applying transformations to existing examples, which
can increase the size and diversity of the training set and help prevent overfitting.
Overall, model selection and generalization are crucial aspects of machine learning that help ensure
that models are accurate and reliable, and can be applied successfully to new data.
Fig:Model Seleciton
1. Collecting Data:
As you know, machines initially learn from the data that you give them. It is of the utmost importance
to collect reliable data so that your machine learning model can find the correct patterns. The quality of
the data that you feed to the machine will determine how accurate your model is. If you have incorrect
or outdated data, you will have wrong outcomes or predictions which are not relevant.
Make sure you use data from a reliable source, as it will directly affect the outcome of your model. Good
data is relevant, contains very few missing and repeated values, and has a good representation of the
various subcategories/classes present.
2. Preparing the Data:
After you have your data, you have to prepare it. You can do this by:
• Putting together all the data you have and randomizing it. This helps make sure that data is evenly
distributed, and the ordering does not affect the learning process.
• Cleaning the data to remove unwanted data, missing values, rows, and columns, duplicate values,
data type conversion, etc. You might even have to restructure the dataset and changethe rows
and columns or index of rows and columns.
• Visualize the data to understand how it is structured and understand the relationship between
various variables and classes present.
• Splitting the cleaned data into two sets - a training set and a testing set. The training set is the set
your model learns from. A testing set is used to check the accuracy of your model after training.
Figure 3: Cleaning and Visualizing Data
3. Choosing a Model:
A machine learning model determines the output you get after running a machine learning algorithm
on the collected data. It is important to choose a model which is relevant to the task at hand. Over the
years, scientists and engineers developed various models suited for different tasks like speech
recognition, image recognition, prediction, etc. Apart from this, you also have to see if your model is
suited for numerical or categorical data and choose accordingly.
Training is the most important step in machine learning. In training, you pass the prepared data to your
machine learning model to find patterns and make predictions. It results in the model learning from the
data so that it can accomplish the task set. Over time, with training, the model gets better at predicting.
After training your model, you have to check to see how it‟s performing. This is done by testing the
performance of the model on previously unseen data. The unseen data used is the testing set that you
split our data into earlier. If testing was done on the same data which is used for training, you will not
get an accurate measure, as the model is already used to the data, and finds the same patterns in it, as it
previously did. This will give you disproportionately high accuracy.
When used on testing data, you get an accurate measure of how your model will perform and its speed.
Figure 6: Evaluating a model
6. Parameter Tuning:
Once you have created and evaluated your model, see if its accuracy can be improved in any way. This
is done by tuning the parameters present in your model. Parameters are the variables in the model that
the programmer generally decides. At a particular value of your parameter, the accuracy will be the
maximum. Parameter tuning refers to finding these values.
7. Making Predictions
In the end, you can use your model on unseen data to make predictions accurately.
How to Implement Machine Learning Steps in Python?
You will now see how to implement a machine learning model using Python.
In this example, data collected is from an insurance company, which tells you the variables that come
into play when an insurance amount is set. Using this, you will have to predict the insurance amount
for a person. This data was collected from Kaggle.com, which has many reliable datasets.
You need to start by importing any necessary modules, as shown.
Now, clean your data by removing duplicate values, and transforming columns into numerical values
to make them easier to work with.
As you need to predict a numeral value based on some parameters, you will have to use Linear
Regression. The model needs to learn on your training set. This is done by using the '.fit' command.
Now, predict your testing dataset and find how accurate your predictions are.
Figure 15: Predicting using your model
1.0 is the highest level of accuracy you can get. Now, get your parameters.
The above picture shows the hyperparameters which affect the various variables in your dataset.
AI& ML Differences
AI is a bigger concept to create intelligent machines that can simulate human thinking capability and
behavior, whereas, machine learning is an application or subset of AI that allows machines to learn
from data without being programmed explicitly.
Below are some main differences between AI and machine learning along with the overview of Artificial
intelligence and machine learning
Artificial Intelligence
Artificial intelligence is a field of computer science which makes a computer system that can mimic
human intelligence. It is comprised of two words "Artificial" and "intelligence", which means "a
human-made thinking power." Hence we can define it as,
Artificial intelligence is a technology using which we can create intelligent systems that can simulate
human intelligence.
The Artificial intelligence system does not require to be pre-programmed, instead of that, they use such
algorithms which can work with their own intelligence. It involves machine learning algorithms such as
Reinforcement learning algorithm and deep learning neural networks. AI is being used in multiple places
such as Siri, Google?s AlphaGo, AI in Chess playing, etc.
Currently, we are working with weak AI and general AI. The future of AI is Strong AI for which it is
said that it will be intelligent than humans.
Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,
Machine learning is a subfield of artificial intelligence, which enables machines to learn from past
data or experiences without being explicitly programmed.
Artificial intelligence is a technology which Machine learning is a subset of AI which allows a machine
enables a machine to simulate humanbehavior. to automatically learn from past data withoutprogramming
explicitly.
The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from data so
system like humans to solve complex that they can give accurate output.
problems.
In AI, we make intelligent systems to perform In ML, we teach machines with data to perform a
any task like a human. particular task and give an accurate result.
Machine learning and deep learning are the Deep learning is a main subset of machine learning.
two main subsets of AI.
AI has a very wide range of scope. Machine learning has a limited scope.
AI is working to create an intelligent system Machine learning is working to create machines that can
which can perform various complex tasks. perform only those specific tasks for which they are trained.
AI system is concerned about maximizing the Machine learning is mainly concerned about accuracy and
chances of success. patterns.
The main applications of AI are Siri, customer The main applications of machine learning are Online
support using catboats, Expert System, recommender system, Google search
Online game playing, intelligent algorithms, Facebook auto friend tagging suggestions,
humanoid robot, etc. etc.
On the basis of capabilities, AI can be divided Machine learning can also be divided into mainly three
into three types, which are, Weak AI, General types that are Supervised learning, Unsupervised
AI, and Strong AI. learning, and Reinforcement learning.
It includes learning, reasoning, and self- It includes learning and self-correction when introduced
correction. with new data.
AI completely deals with Structured, semi- Machine learning deals with Structured and semi-
structured, and unstructured data. structured data.
UNIT-III
Unsupervised learning is a type of machine learning where the model is trained on unlabeled
data. Unlike supervised learning, where the algorithm learns from labeled input-output pairs,
unsupervised learning finds patterns, structures, or relationships in the data without explicit
guidance.
• Customer segmentation
• Anomaly detection in cybersecurity
• Feature extraction for supervised learning
• Image compression
CLUSTERING
Clustering is grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis. This method is defined under the branch
of Unsupervised Learning, which aims at gaining insights from unlabelled data points,
that is, unlike supervised learning we don’t have a target variable.
For Example, in the graph given below, we can clearly see that there are 3 circular
clusters forming on the basis of distance.
• Now it is not necessary that the clusters formed must be circular in shape. The shape
of clusters can be arbitrary. There are many algortihms that work well with detecting
arbitrary shaped clusters.
• For example, In the below given graph we can see that the clusters formed are not
circular in shape.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group similar
data points:
Hard Clustering: In this type of clustering, each data point belongs to a cluster
completely or not. For example, Let’s say there are 4 data point and we have to cluster
them into 2 clusters. So each data point will either belong to cluster 1 or cluster 2.
Soft Clustering: In this type of clustering, instead of assigning each data point into a
separate cluster, a probability or likelihood of that point being that cluster is evaluated.
For example, Let’s say there are 4 data point and we have to cluster them into 2 clusters. So
we will be evaluating a probability of a data point belonging to both clusters. This
probability is calculated for all data points.
Simplify working with large datasets – Each cluster is given a cluster ID after clustering is
complete. Now, you may reduce a feature set’s whole feature set into its cluster ID.
Clustering is effective when it can represent a complicated case with a straightforward
cluster ID. Using the same principle, clustering data can make complex datasets simpler.
• Fuzzy Clustering
Partitioning Clustering (Centroid based Clustering)
Partitional clustering is a method that divides a dataset into a predetermined number of non-
overlapping clusters, where each data point belongs to only one cluster, aiming to optimize a
specific objective function like minimizing intra-cluster distance.
Definition:
Partitional clustering algorithms aim to partition a dataset into a set of disjoint clusters,
meaning each data point belongs to only one cluster. It is a type of clustering that divides the
data into non-hierarchical groups. It is also known as the centroid-based method. The most
common example of partitioning clustering is the K-Means Clustering algorithm.
Process:
These algorithms require the analyst to specify the number of clusters (K) beforehand. The
algorithm then iteratively refines the cluster assignments to minimize the distance between
data points and their respective cluster centroids.
Objective Function:
The goal is to find the optimal partitioning of the data that minimizes the within-cluster
variance or maximizes the between-cluster variance.
Popular Algorithms:
K-means: A widely used algorithm that assigns data points to the nearest cluster centroid,
iteratively updating the centroids until convergence.
K-medoids: Similar to K-means, but instead of using centroids, it uses medoids (representative
data points) to define the clusters.
Mini-batch K-means: An efficient variant of K-means that uses mini-batches of data points to
speed up the clustering process.
Advantages:
Computational Efficiency: Partitional algorithms are generally computationally efficient and easy
to implement.
Suitable for Large Datasets: They can handle large datasets effectively.
Good for Clusters of Similar Shapes and Sizes: They perform well when clusters have similar
shapes and sizes.
Disadvantages:
Requires Predefined Number of Clusters: The analyst needs to specify the number of clusters (K)
in advance, which can be challenging for complex datasets.
Struggles with Clusters of Varying Shapes and Sizes: They may struggle with clusters that have
irregular shapes or sizes.
In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between
the data points of one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser
areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
DBSCAN is a density-based clustering algorithm that groups data points that are
closely packed together and marks outliers as noise based on their density in the feature
space. It identifies clusters as dense regions in the data space, separated by areas of lower
density.
Unlike K-Means or hierarchical clustering, which assume clusters are compact and
spherical, DBSCAN excels in handling real-world data irregularities such as:
Arbitrary-Shaped Clusters: Clusters can take any shape, not just circular or convex.
Noise and Outliers: It effectively identifies and handles noise points without assigning
them to any cluster.
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is
no requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative
Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the
degree of membership to be in a cluster. Fuzzy C-means algorithm is the example of this
type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models.There are different types of
clustering algorithms published, but only a few are commonly used. The clustering algorithm
is based on the kind of data that we are using. Such as, some algorithms need to guess the
number of clusters in the given dataset, whereas some are required to find the minimum
distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm. It is fast
with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the
smooth density of data points. It is an example of a centroid-based model, that works
on updating the candidates for centroid to be the center of the points within a given
region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of high
density are separated by the areas of low density. Because of this, the clusters can be
found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as
an alternative for the k-means algorithm or for those cases where K-means can be
failed. In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can be
represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not
require to specify the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N2T) time complexity,
which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets
into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers
based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands
use in the GIS database. This can be very useful to find that for what purpose the
particular land should be used, that means for which purpose it is more suitable.
o Market Segmentation – Businesses use clustering to group their customers and use
targeted advertisements to attract more audience.
o Market Basket Analysis – Shop owners analyze their sales and figure out which
items are majorly bought together by the customers. For example, In USA,
according to a study diapers and beers were usually bought together by fathers.
o Social Network Analysis – Social media sites use your data to understand your
browsing behaviour and provide you with targeted friend recommendations or
content recommendations.
o Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic
images like X-rays.
o Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting
fraudulent transactions we can use clustering to identify them.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabelled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting
the below two points as k points, which are not the part of our dataset. Consider the
below image:
o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied to
calculate the distance between two points. So, we will draw a median between both
the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them
as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of
the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, one of the most appropriate method to
find the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each
data point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
Note: We can choose the number of clusters equal to the given data points. If we choose the
number of clusters equal to the data points, then the value of WCSS becomes zero, and that
will be the endpoint of the plot.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but
they both differ depending on how they work. As there is no requirement to predetermine the
number of clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:
1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm
starts with taking all data points as single clusters and merging them until one cluster
is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is
a top-down approach.
NEED of hierarchical clustering
In the K-means clustering that there are some challenges with this algorithm, which are a
predetermined number of clusters, and it always tries to create the clusters of the same size.
To solve these two challenges, we can opt for the hierarchical clustering algorithm because,
in this algorithm, we don't need to have knowledge about the predefined number of clusters.
The working of the AHC algorithm can be explained using the below steps:
Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there
will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one cluster.
There will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
Measure for the distance between two clusters
The closest distance between the two clusters is crucial for the hierarchical clustering. There
are various ways to calculate the distance between two clusters, and these ways decide the
rule for clustering. These measures are called Linkage methods. Some of the popular linkage
methods are given below:
Single Linkage:
It is the Shortest Distance between the closest points of the clusters. Consider the below
image:
Complete Linkage:
It is the farthest distance between the two points of two different clusters. It is one of the
popular linkage methods as it forms tighter clusters than single linkage.
Average Linkage:
It is the linkage method in which the distance between each pair of datasets is added up and
then divided by the total number of datasets to calculate the average distance between two
clusters. It is also one of the most popular linkage methods.
Centroid Linkage:
It is the linkage method in which the distance between the centroid of the clusters is
calculated. Consider the below image:
Ward Linkage
The Ward approach analyzes the variance of the clusters rather than measuring distances
directly, minimizing the variance between clusters. With the Ward method, the distance
between two clusters is related to how much the sum of squares (SS) value will increase
when combined.
In other words, the Ward method attempts to minimize the sum of the squared distances of
the points from the cluster centers. Compared to the distance-based measures described
above, the Ward method is less susceptible to noise and outliers. Therefore, Ward's method is
preferred more than others in clustering.
Dendrogram in Hierarchical clustering
The dendrogram is a tree-like structure that is mainly used to store each step as a memory
that the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean
distances between the data points, and the x-axis shows all the data points of the given
dataset.
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram.
o As we have discussed above, firstly, the datapoints P2 and P3 combine together and
form a cluster, correspondingly a dendrogram is created, which connects P2 and P3
with a rectangular shape. The hight is decided according to the Euclidean distance
between the data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is
created. It is higher than of previous, as the Euclidean distance between P5 and P6 is a
little bit greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one
dendrogram, and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points together.
We can cut the dendrogram tree structure at any level as per our requirement.
Steps for implementation of AHC using Python:
The steps for implementation will be the same as the k-means clustering, except for some
changes such as the method to find the number of clusters. Below are the steps:
1. Data Pre-processing
2. Finding the optimal number of clusters using the Dendrogram
3. Training the hierarchical clustering model
4. Visualizing the clusters
DIVISIVE CLUSTERING
• Choose the best cluster among the clusters to split further, choose the one that has the
largest Sum of Squared Error (SSE).
• Repeat steps 2 and 3 until a single cluster is formed.
•
• It has high time and space computational complexity. For computing proximity
matrix, the time complexity is O(N2), since it takes N steps to search, the total time
complexity is O(N3)
• There is no objective function for hierarchical clustering.
• Due to high time complexity, it cannot be used for large datasets.
• It is sensitive to noise and outliers since we use distance metrics.
Applications
There are many real-life applications of Hierarchical clustering. They include:
• Bioinformatics: grouping animals according to their biological features to reconstruct
phylogeny trees
• Business: dividing customers into segments or forming a hierarchy of employees
based on salary.
• Image processing: grouping handwritten characters in text recognition based on the
similarity of the character shapes.
• Information Retrieval: categorizing search results based on the query.
MIXTURE DENSITITES
• Mixture densities are a fundamental concept and it plays a significant role in machine
learning, particularly in probabilistic modelling unsupervised learning and density
estimation tasks.
• They are used to represent complex probability distributions by combining simpler
distributions, such as Gaussians, into a weighted mixture.
• This approach is widely used for tasks like clustering, density estimation, and
generative modeling.
Below is a detailed explanation of mixture densities in machine learning:
• Mixture models are probabilistic models that assume the data is generated from a
mixture of several probability distributions. Each component of the mixture represents
a cluster or a subgroup within the data.
• The most common type is the Gaussian Mixture Model (GMM), where each
component is a Gaussian distribution.
Key Concepts:
5. Advantages
• Flexibility: Can model complex, multi-modal distributions.
• Probabilistic Framework: Provides soft assignments (probabilities) rather than hard
assignments.
6. Challenges
• Choosing the Number of Components: Determining the optimal number of
components KK is often non-trivial.
• Sensitivity to Initialization: The EM algorithm can converge to local optima, so
initialization is crucial.
• Computational Complexity: Fitting mixture models can be computationally expensive
for large datasets.
EM Algorithm
The EM algorithm was proposed and named in a seminal paper published in 1977 by
Arthur Dempster, Nan Laird, and Donald Rubin.
This approach is often referred to as handling missing data. By using the available instances
where the variable is observable, machine learning algorithms can learn patterns and
relationships from the observed data. These learned patterns can then be used to predict the
values of the variable in instances where it is missing or not observable.
By iteratively repeating these steps, the EM algorithm seeks to maximize the likelihood of
the observed data. It is commonly used for unsupervised learning tasks, such as clustering,
where latent variables are inferred and has applications in various fields, including machine
learning, computer vision, and natural language processing.
Some of the most commonly used key terms in the Expectation-Maximization (EM)
Algorithm are as follows:
Latent Variables: Latent variables are unobserved variables in statistical models that
can only be inferred indirectly through their effects on observable variables. They
cannot be directly measured but can be detected by their impact on the observable
variables.
Likelihood: It is the probability of observing the given data given the parameters of the
model. In the EM algorithm, the goal is to find the parameters that maximize the
likelihood.
Log-Likelihood: It is the logarithm of the likelihood function, which measures the
goodness of fit between the observed data and the model. EM algorithm seeks to
maximize the log-likelihood.
Maximum Likelihood Estimation (MLE): MLE is a method to estimate the parameters
of a statistical model by finding the parameter values that maximize the likelihood
function, which measures how well the model explains the observed data.
Posterior Probability: In the context of Bayesian inference, the EM algorithm can be
extended to estimate the maximum a posteriori (MAP) estimates, where the posterior
probability of the parameters is calculated based on the prior distribution and the
likelihood function.
Expectation (E) Step: The E-step of the EM algorithm computes the expected value or
posterior probability of the latent variables given the observed data and current
parameter estimates. It involves calculating the probabilities of each latent variable for
each data point.
Maximization (M) Step: The M-step of the EM algorithm updates the parameter
estimates by maximizing the expected log-likelihood obtained from the E-step. It
involves finding the parameter values that optimize the likelihood function, typically
through numerical optimization methods.
Convergence: Convergence refers to the condition when the EM algorithm has reached
a stable solution. It is typically determined by checking if the change in the log-
likelihood or the parameter estimates falls below a predefined threshold.
EM Algorithm Flowchart
1. Initialization:
Initially, a set of initial values of the parameters are considered. A set of incomplete
observed data is given to the system with the assumption that the observed data
comes from a specific model.
2. E-Step (Expectation Step): In this step, we use the observed data in order to
estimate or guess the values of the missing or incomplete data. It is basically used to
update the variables.
Compute the posterior probability or responsibility of each latent variable given the
observed data and current parameter estimates.
• Estimate the missing or incomplete data values using the current parameter
estimates.
• Compute the log-likelihood of the observed data based on the current
parameter estimates and estimated missing data.
3. M-step (Maximization Step): In this step, we use the complete data generated in
the preceding “Expectation” – step in order to update the values of the parameters. It is
basically used to update the hypothesis.
• Update the parameters of the model by maximizing the expected complete
data log-likelihood obtained from the E-step.
• This typically involves solving optimization problems to find the parameter
values that maximize the log-likelihood.
• The specific optimization technique used depends on the nature of the
problem and the model being used.
4. Convergence: In this step, it is checked whether the values are converging or not, if
yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function. Further, GMM also
requires estimated statistics values such as mean and standard deviation or parameters. It
is used to estimate the parameters of the probability distributions to best fit the density of a
given training dataset.
Although there are plenty of techniques available to estimate the parameter of the Gaussian
Mixture Model (GMM), the Maximum Likelihood Estimation is one of the most popular
techniques among them.
The processes used to generate the data point represent a latent variable or unobservable data.
In such cases, the Estimation-Maximization algorithm is one of the best techniques which
helps us to estimate the parameters of the gaussian distributions. In the EM algorithm, E-step
estimates the expected value for each latent variable, whereas M-step helps in optimizing
them significantly using the Maximum Likelihood Estimation (MLE). Further, this process
is repeated until a good set of latent values, and a maximum likelihood is achieved that fits
the data.
Given a finite set of probability density functions p1(x), ..., pn(x), or corresponding cumulative
distribution functions P1(x), ..., Pn(x) and weights w1, ..., wn such that wi ≥ 0 and Σwi = 1, the
mixture distribution can be represented by writing either the density, f, or the distribution
function, F, as a sum (which in both cases is a convex combination):
This type of mixture, being a finite sum, is called a finite mixture, and in
applications, an unqualified reference to a "mixture density" usually means a finite
mixture. The case of a countably infinite set of components is covered formally by
allowing
MIXTURE OF LATENT VARIABLE MODELS
Key Concepts:
1. Mixture Models:
o A mixture model is a probabilistic model that assumes the data is generated
from a mixture of several distributions (e.g., Gaussian distributions). Each
component in the mixture corresponds to a cluster or subgroup in the data.
o The most common example is the Gaussian Mixture Model (GMM), where
each component is a Gaussian distribution.
2. Latent Variable Models:
o Latent variable models assume that the observed data is generated from some
underlying latent (unobserved) variables. These models are often used for
dimensionality reduction or to capture hidden structure in the data.
o Examples include Factor Analysis, Probabilistic Principal Component
Analysis (PPCA), and Variational Autoencoders (VAEs).
3. Mixture of Latent Variable Models:
o This combines the two ideas: the data is assumed to be generated from a
mixture of distributions, where each component is itself a latent variable
model.
o For example, a mixture of factor analyzers (MFA) assumes that the data is
generated from a mixture of factor analysis models, where each component
has its own latent variables.
Applications:
1. Clustering:
o The mixture components can represent different clusters in the data, while the
latent variables capture the structure within each cluster.
o For example, in a mixture of factor analysers, each cluster has its own low-
dimensional representation.
2. Dimensionality Reduction:
o Latent variable models reduce the dimensionality of the data, and the mixture
model allows for multiple low-dimensional representations corresponding to
different clusters.
3. Density Estimation:
o Mixture of latent variable models can model complex, multi-modal data
distributions more effectively than simple models.
Advantages:
• Flexibility: Can model complex, multi-modal data distributions.
• Interpretability: Latent variables provide insights into the underlying structure of the
data.
• Scalability: Can handle high-dimensional data by reducing dimensionality within
each cluster.
Challenges:
1. Model Selection:
o Choosing the number of mixture components and the dimensionality of the
latent variables can be difficult.
o Techniques like cross-validation, Bayesian Information Criterion (BIC), or
Dirichlet Process Mixtures can help.
2. Optimization:
o Training mixture of latent variable models often involves non-convex
optimization, which can be computationally expensive and prone to local
optima.
o Expectation-Maximization (EM) or variational inference are commonly used.
3. Overfitting:
o Complex models with many parameters can overfit the data, especially with
limited samples. Regularization or Bayesian approaches can mitigate this.
After clustering, supervised learning can be applied in several ways depending on the specific
goal of your analysis. Here are a few common approaches:
Label Propagation: If you have labels for a subset of your data, you can propagate these
labels to the entire dataset based on the clusters. For example, if most of the data points in a
cluster have a certain label, you can assign that label to all data points in that cluster.
Cluster as a Feature: You can treat the cluster assignments as additional features in your
dataset and then use these features in a supervised learning model. This can sometimes
improve the performance of the model, especially if the clusters capture useful information
about the data.
Cluster-Specific Models: You can train a separate supervised learning model for each
cluster. This allows you to capture the different characteristics of each cluster and can
sometimes lead to better performance compared to a single model for the entire dataset.
Cluster Ensemble: You can create an ensemble of supervised learning models, where each
model is trained on a different cluster. This can help capture the heterogeneity of the data and
improve the overall performance of the ensemble.
Supervised learning after clustering is a hybrid approach where clustering is first used to
identify structure in the data, followed by supervised learning to make predictions based on
the discovered clusters. This technique is useful when labels are scarce or noisy, and
clustering can help extract meaningful representations before classification or regression.
1. Perform Clustering
o Apply clustering algorithms like K-Means, DBSCAN, Hierarchical
Clustering, or Gaussian Mixture Models (GMM) to group similar data
points.
o This helps discover inherent structures, segment the data, or reduce noise.
2. Assign Cluster Labels
o Use the cluster assignments as new categorical features for supervised
learning.
o Optionally, analyze clusters manually and assign meaningful labels if
available.
3. Train a Supervised Learning Model
o Use labeled data (if available) or pseudo-labels derived from clusters.
o Common models: Decision Trees, Random Forests, Support Vector
Machines (SVM), Neural Networks.
o Features can include original attributes plus cluster membership.
4. Evaluate Performance
o Compare models trained with and without clustering.
o Measure metrics like accuracy, precision, recall, F1-score for classification
or RMSE, MAE for regression.
Use Cases
3. Semi-Supervised Learning
• Idea: Use clustering to propagate labels in a partially labeled dataset.
• Steps:
1. Perform clustering on the entire dataset (both labeled and unlabeled data).
2. Use the cluster structure to infer labels for the unlabeled data (e.g., by
assigning the majority label of the cluster).
3. Train a supervised model on the now fully labeled dataset.
• Example:
o In a text classification task, you might cluster documents and use the cluster
assignments to label unlabeled documents, then train a classifier on the
expanded dataset.
4. Cluster-Based Regularization
• Idea: Use clustering to guide the training of a supervised model by incorporating
cluster information into the loss function.
• Steps:
1. Perform clustering on the dataset.
2. Modify the loss function of the supervised model to encourage similar
predictions for data points in the same cluster.
• Example:
o In a deep learning model, you might add a regularization term to the loss
function that penalizes differences in predictions for points within the same
cluster.
Practical Considerations:
• Choice of Clustering Algorithm:
o The choice of clustering algorithm (e.g., K-Means, DBSCAN, hierarchical
clustering) depends on the data and the problem. For example, DBSCAN is
better for data with noise, while K-Means is faster for large datasets.
• Number of Clusters:
o The number of clusters is a hyperparameter that can significantly impact
performance. Use techniques like the elbow method, silhouette score, or
domain knowledge to choose an appropriate number.
• Feature Scaling:
o Clustering algorithms like K-Means are sensitive to the scale of features, so
ensure proper normalization or standardization before clustering.
• Overfitting:
o Be cautious when using clustering results as features, as they might introduce
noise or overfitting if the clusters do not generalize well to new data.
Example Workflow:
1. Dataset: A dataset with features X and labels y (for supervised learning).
2. Clustering:
o Apply K-Means clustering to X to create k clusters.
o Add the cluster labels as a new feature to X.
3. Supervised Learning:
o Split the augmented dataset into training and test sets.
o Train a classifier (e.g., Random Forest) on the training set.
o Evaluate the model on the test set.
By integrating clustering with supervised learning, you can leverage the strengths of both
approaches to build more robust and interpretable models.