0% found this document useful (0 votes)
35 views145 pages

ML Notes - 2025

The document discusses supervised learning, focusing on classification and regression techniques, including decision trees, logistic regression, and various algorithms. It explains the differences between binary, multiclass, and multi-label classification, as well as the concepts of lazy and eager learners. Additionally, it covers decision tree structures, attribute selection measures, and pruning techniques to enhance model performance.

Uploaded by

Bhukya Rajakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views145 pages

ML Notes - 2025

The document discusses supervised learning, focusing on classification and regression techniques, including decision trees, logistic regression, and various algorithms. It explains the differences between binary, multiclass, and multi-label classification, as well as the concepts of lazy and eager learners. Additionally, it covers decision tree structures, attribute selection measures, and pruning techniques to enhance model performance.

Uploaded by

Bhukya Rajakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 145

UNIT-II

SUPERVISED LEARNING: Classification, Decision Trees – Univariate Tree –Multivariate


Tree – Pruning, Bayesian Decision Theory, Parametric Methods-Maximum Likelihood
Estimation - Evaluating an Estimator Bias and Variance -The Bayes Estimator, Linear
Discrimination- Gradient Descent- Logistic Discrimination- Logistic Regression, Multilayer
Perceptron- Back Propagation Algorithm

Supervised Learning
• In supervised learning, the machine is trained on a set of labelled data, which means
that the input data is paired with the desired output. The machine then learns to predict
the output for new input data.
• Supervised learning is often used for tasks such as classification and regression.
• In supervised learning, each data point in the training data contains input variables (also
known as independent variables or features), and an output variable, or label.

1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised
learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
For example, a labelled dataset of images of Elephant, Camel and Cow would have each
image tagged with either “Elephant” , “Camel “or “Cow.”

Classification:
• The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data.
• In Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or
1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories.

• Classification: The process of sorting data into categories based on specific features or
characteristics.

• There are different types of classification problems depending on how many categories
(or classes) we are working with and how they are organized. There are two main
classification types in machine learning:
1. Binary Classification
• This is the simplest kind of classification. In binary classification, the goal is to sort the
data into two distinct categories. Think of it like a simple choice between two options.
• Imagine a system that sorts emails into either spam or not spam. It works by looking
at different features of the email like certain keywords or sender details, and decides
whether it’s spam or not. It only chooses between these two options.
2. Multiclass Classification
• Here, instead of just two categories, the data needs to be sorted into more than two
categories. The model picks the one that best matches the input.
• Think of an image recognition system that sorts pictures of animals into categories
like cat, dog, and bird.
Basically, machine looks at the features in the image (like shape, color, or texture) and
chooses which animal the picture is most likely to be based on the training it received.

3. Multi-Label Classification
In multi-label classification single piece of data can belong to multiple categories at once.
Unlike multiclass classification where each data point belongs to only one class, multi-label
classification allows datapoints to belong to multiple classes. A movie recommendation
system could tag a movie as both action and comedy. The system checks various features (like
movie plot, actors, or genre tags) and assigns multiple labels to a single piece of data, rather
than just one.

Learners in Classification Problems:


In the classification problems, there are two types of learners:
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time for
predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners: Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naive
Bayes, ANN.

Property Lazy Learning Eager Learning

Training Slow, Tries to learn from data while


Fast, stores the data while training
Speed training

Prediction Too Slow tries to apply functions and Faster, predicts very fast as there are
Speed learnings in the prediction stage pre-defined functions

Learning Medium, it can learn from data while Medium, it can learn from data
Scope training while testing

Example KNN Linear Regression

Classification algorithms are widely used in many real-world applications across various
domains, including:
• Email spam filtering
• Credit risk assessment
• Medical diagnosis
• Sentiment analysis
• Fraud detection
• Recommendation systems

Classification Algorithms
Now, for implementation of any classification model it is essential to understand Logistic
Regression, which is one of the most fundamental and widely used algorithms in machine
learning for classification tasks. There are various types of classifiers algorithms. Some of
them are:
Linear Classifiers: Linear classifier models create a linear decision boundary between classes.
They are simple and computationally efficient. Some of the linear classification models are as
follows:
• Logistic Regression
• Support Vector Machines having kernel = ‘linear’
• Single-layer Perceptron
• Stochastic Gradient Descent (SGD) Classifier
Non-linear Classifiers: Non-linear models create a non-linear decision boundary between
classes. They can capture more complex relationships between input features and target
variable. Some of the non-linear classification models are as follows:
• K-Nearest Neighbours
• Kernel SVM
• Naive Bayes
• Decision Tree Classification
• Ensemble learning classifiers:
• Random Forests,
• Multi-layer Artificial Neural Networks
Decision Trees:
• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
• The decisions or the test are performed on the basis of features of the given dataset.

Below diagram explains the general structure of a decision tree:

Decision Tree Terminologies:

• Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.

How does the Decision Tree algorithm Work?


• In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree.
• This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
• For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the
tree. The complete process can be better understood using the below algorithm:

Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next
decision node (distance from the office) and one leaf node based on the corresponding
labels. The next decision node further gets split into one decision node (Cab facility)
and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers
and Declined offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

• Information Gain
• Gini Index
Information Gain (IG) is a metric used in decision tree algorithms (like ID3, C4.5, and
CART) to evaluate the effectiveness of a particular attribute in classifying data. It measures
the reduction in entropy (uncertainty or disorder) after a dataset is split based on a particular
attribute.
Entropy: A measure of uncertainty or disorder. If a dataset is perfectly pure (all
instances belong to the same class), the entropy is 0. If the classes are equally
distributed, the entropy is maximized (which is 1 for binary classification).
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no

Information Gain (IG): It measures the change in entropy after a dataset is split based on a
particular attribute. It is the difference between the original entropy of the dataset and the
weighted average entropy of the subsets resulting from the split.
The formula for Information Gain is:
Information Gain (Dataset, Attribute) = Entropy(Dataset) - Weighted Average
Entropy(Subsets created by splitting on Attribute)

Gini Index
Decision trees are a popular machine learning algorithm used for both classification and
regression tasks. In classification, the goal is to predict which category a data point belongs to.
The Gini Index plays a crucial role in how decision trees decide how to split the data at each
step.
Here's the basic idea:
1. Measuring Impurity: The Gini Index measures the "impurity" of a set of data points.
A pure set means all data points belong to the same category. An impure set means the
data points are mixed across different categories.
2. Finding the Best Split: When building a decision tree, the algorithm needs to decide
which feature to use for splitting the data at each node. It calculates the Gini Index for
each possible split and chooses the one that results in the greatest reduction in impurity.
This means the split that creates the most "pure" subsets of data.
How it's calculated:
The Gini Index for a set of data points is calculated as:
Gini Index = 1 - (probability of category 1) ^2 - (probability of category 2) ^2 - ... -
(probability of category c) ^2
where c is the number of categories.

Types of Decision Tree Algorithms


1.CART (Classification and Regression Trees)
• Used for both classification & regression.
• Uses Gini Index for classification.
• Uses Mean Squared Error (MSE) for regression.
• Forms binary splits (only two child nodes at each step).
2. ID3 (Iterative Dichotomiser 3)
• Used only for classification.
• Uses Entropy & Information Gain to decide splits.
• Splits can have multiple child nodes (not just binary).
• Works best with categorical data.

3. C4.5 (Successor of ID3)

• Improves ID3 by handling:


• Numerical data (by discretizing it).
• Missing values (by assigning probabilities).
• Pruning (to reduce overfitting).
• Uses Entropy & Information Gain Ratio to split nodes.
4. C5.0 (Improved C4.5)
• Faster and more memory-efficient than C4.5.
• Uses Boosting (combining multiple trees).
• Generates smaller trees for better generalization.
Used in: Commercial decision tree software.

Univariate Decision Trees


A univariate decision tree splits nodes based on a single feature at each step. This means that
each decision rule involves only one attribute (one dimension) of the data.
Characteristics:
• Splits are parallel to feature axes.

• Uses conditions like Xi ≤ V (for numerical data) or categorical rules (e.g., "Color =
Red").
• Simpler to interpret and computationally efficient.
• Commonly used in CART, ID3, and C4.5 algorithms.
Example:
If we have features (Age, Income, Credit Score), a univariate decision tree may use only one
feature at a time to make a split:
• If Age ≤ 30 → Go left
• If Age > 30 → Go right

Multivariate Decision Trees


A multivariate decision tree considers multiple features in a single split. Instead of using a
simple condition like Xi ≤ V it uses linear combinations of features to make decisions.

Characteristics:
• More flexible than univariate trees.
• Decision boundaries are not restricted to be parallel to axes (can be diagonal or
curved).
• Uses linear models (like Logistic Regression, SVM, PCA-based splits) to determine
the best split.
• More computationally expensive than univariate trees.
Example (Multivariate Split Rule):

This allows complex decision boundaries like:


• If (0.3 × Age + 0.7 × Income) ≤ 50 → Go left
• If (0.3 × Age + 0.7 × Income) > 50 → Go right
Pruning

• Pruning is a technique used in decision trees to reduce overfitting and improve


generalization. It removes unnecessary branches from the tree, making it simpler and
more robust.
• Decision trees tend to overfit the training data by creating overly complex trees.
• Overfitting means the tree captures noise rather than true patterns.
• Pruning helps by reducing tree size and improving accuracy on unseen data.

Types of Pruning
1. Pre-Pruning (Early Stopping):
o Pre-pruning involves stopping the growth of a decision tree before it becomes
too complex and overfits the training data. This can be done by setting a
maximum tree depth, a minimum number of data points in a leaf node, or a
threshold for the information gain at each decision node. Pre-pruning is simple
and computationally efficient, but it may not capture complex relationships in
the data.
• Stops the tree from growing before it becomes too complex.
• Sets a limit on conditions like:
o Max depth (max_depth)
o Min samples per split (min_samples_split)
o Min samples per leaf (min_samples_leaf)
• Advantages:
Faster training.
Prevents excessive growth.
• Disadvantages:
Might stop too early, missing some important patterns.

2. Post-Pruning (Pruning After Tree Growth):


o Post-pruning involves growing a decision tree to its maximum depth and then
removing the unnecessary branches that do not improve the model's
performance on the validation data. This can be done by calculating a measure
of impurity reduction or error rate reduction for each subtree and removing the
subtree that does not meet a certain criterion. Post- pruning is more
computationally intensive than pre-pruning, but it can capture complex
relationships and improve the accuracy of the model.
• First, grow the full tree (allowing overfitting).
• Then, remove unnecessary branches based on performance.
• Uses cost complexity pruning (CCP) in Scikit-learn.
• Advantages:
More accurate than pre-pruning.
Uses real performance to decide cuts.
• Disadvantages:
Slower, since it grows a full tree first.
Needs tuning.

Bayesian Decision Theory

• Bayesian Decision Theory is a probabilistic approach to decision-making under


uncertainty. It helps in minimizing risk and making optimal decisions by using
probability distributions and cost functions.
• Bayes' theorem is a fundamental concept in probability theory that plays a crucial role
in various machine learning algorithms, especially in the fields of Bayesian statistics
and probabilistic modelling.
• It provides a way to update probabilities based on new evidence or information. In the
context of machine learning, Bayes' theorem is often used in Bayesian inference and
probabilistic models.
• Bayes' theorem may be derived from the definition of conditional probability:
• Bayes' theorem is stated mathematically as the following equation:
Example: Spam Email Classification
Consider a binary classification problem where we classify an email as Spam (S) or Not
Spam (N).
Step 1: Define Prior Probabilities
Assume from past data:
• P(S)=0.3 (30% of emails are spam)
• P(N)=0.7 (70% of emails are not spam)

Step 2: Compute Likelihood

• Suppose we analyze a word, say "discount", in an email:


• Probability of "discount" appearing in a spam email: P(W∣S) =0.8
• Probability of "discount" appearing in a non-spam email: P(W∣N) =0.2

Step 3: Compute Posterior Using Bayes' Theorem

Thus, if an email contains the word "discount," there's a 63% probability it's spam.
Step 4: Decision Making Using Loss Function
• A loss function in machine learning tells us how far our predictions are from the actual
values.
• In Bayesian learning, we make predictions based on probability distributions rather
than fixed numbers, so our loss function helps decide which predictions are best.
• Suppose misclassifying spam as non-spam has a higher cost (e.g., 5 points) than
misclassifying non-spam as spam (2 points).

• The expected loss for labelling an email as spam (ds) or non-spam (dN) is computed.

• Choose the decision that minimizes expected loss (Bayes Risk).


Conclusion
Bayesian Decision Theory allows us to:
• Integrate prior knowledge with observed data.
• Make probabilistic decisions to minimize risk.
• Apply in real-world scenarios like email filtering, medical diagnosis, and fraud
detection.
Parametric methods
• Parametric methods simplify the learning process by assuming a specific functional
form for the relationship between inputs and outputs.
• This means we're essentially trying to fit the data to a pre-defined equation with a fixed
number of parameters.
Key Characteristics
1. Assumptions: Parametric methods rely on pre-defined assumptions about the data
distribution or the relationship between variables. These assumptions might include:
o Data follows a normal distribution
o Relationship between variables is linear
2. Fixed Parameters: The model has a fixed number of parameters that are learned from
the training data. This number does not change with the size of the dataset.
3. Two-Step Process:
o Select a functional form for the mapping function based on assumptions.
o Learn the coefficients (parameters) of the function from the training data.
Examples with Detailed Explanations
1. Linear Regression
o Assumption: The relationship between the input features (X) and the output (Y)
is linear.
o Functional Form: Y = wx+ b
▪ w: Weight vector (parameters to be learned)
▪ b: Bias (parameter to be learned)
o Example: Predicting house prices based on size (in square feet). We assume a
linear relationship: larger houses cost more.
o Learning: We use the training data to find the best values for w and b that
minimize the difference between predicted and actual prices.
2. Logistic Regression: Used for classification tasks, assumes a logistic function to model
the probability of a class.
3. Linear Discriminant Analysis (LDA): Assumes that the data for each class follows a
normal distribution with the same covariance matrix.
4. Naive Bayes: Assumes that features are independent of each other given the class label.
5. Perceptron: A simple neural network with a single layer, makes assumptions about the
separability of data.
Advantages of Parametric Methods
• Simplicity: Easier to understand and interpret due to the strong assumptions.
• Speed: Often faster to train compared to non-parametric methods.
• Less Data: Can work reasonably well with smaller datasets.
Limitations of Parametric Methods
• Constrained: The pre-defined functional form limits the model's flexibility to capture
complex relationships in the data.
• Limited Complexity: May not be suitable for problems with highly intricate patterns.
• Poor Fit: If the assumptions are incorrect, the model might not accurately represent
the underlying data, leading to poor performance.
Maximum Likelihood Estimation
• Maximum Likelihood Estimation (MLE) is a method used in statistics and machine
learning to find the best parameters for a model by maximizing the likelihood of the
observed data.
How it Works
1. Assume a Probability Distribution: First, you assume that your data follows a specific
probability distribution. This could be a normal distribution, a binomial distribution, or
any other distribution that you think is appropriate for your data.
2. The Likelihood Function: The likelihood function measures how likely it is to observe
your data given a particular set of parameters for the chosen distribution. In simpler
terms, it tells you how well the distribution with those parameters fits your data.

• Likelihood is a measure of how well a given set of parameters explains the observed
data.

If we have a dataset D = {x1, x2, ..., xn} and a model with parameters θ, the
likelihood function is:

L(θ)=P(x1∣θ) P(x2∣θ) P(x3∣θ) … P(xn∣θ)

or

L(θ) =P (D ∣ θ)
This function tells us how probable the observed data is, given the model
parameters.

3. Maximize the Likelihood: The goal of MLE is to find the parameters that maximize
the likelihood function. This means finding the parameters that make your observed
data the most probable.
• MLE finds the
• that maximizes the likelihood function:

• θ̂ (theta-hat): This represents the estimated value of the parameter “The


estimated value of the parameter θ (θ̂) is the value that maximizes the likelihood
function L(θ). "

• arg max: This is short for "argument of the maximum." It means "find the value of
θ that maximizes the following function."’

• L(θ): This is the likelihood function. It represents how likely it is to observe the data
we have, given a particular value of the parameter θ.

Since probabilities are usually very small (because they involve multiplying
probabilities of individual data points.), we often take the log-likelihood instead (to
avoid numerical issues and simplify calculations):

Evaluating an Estimator: Bias and Variance


• In machine learning and statistics, we evaluate an estimator based on bias and
variance, which describe how well it generalizes to new data.
Bias Variance Tradeoff
• If the algorithm is too simple (hypothesis with linear equation) then it may be on high
bias and low variance condition and thus is error-prone. If algorithms fit too complex
(hypothesis with high degree equation) then it may be on high variance and low bias.
• In the latter condition, the new entries will not perform well. Well, there is something
between both of these conditions, known as a Trade-off or Bias Variance Trade-off.
• This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time. For the graph, the
perfect tradeoff will be like this.
The Bayes Estimator
• The Bayes Estimator is a fundamental concept in Bayesian statistics used for making
decisions or estimations when dealing with uncertainty.
• In machine learning, it's a method for estimating parameters of a model or making
predictions by combining prior knowledge (beliefs) with observed data.
How it Works
1. Start with a prior: Begin with a prior probability distribution representing your initial
beliefs about the parameters.
2. Observe data: Gather data relevant to the problem.
3. Calculate the likelihood: Determine how likely the observed data is for different
values of the parameters.
4. Apply Bayes' theorem: Use Bayes' theorem to update the prior distribution with the
likelihood, resulting in the posterior distribution.
5. Choose an estimator: Select an estimator (a specific value or decision) that minimizes
the expected loss under the posterior distribution. This estimator is the Bayes Estimator.
Mathematical Expression (Bayes' Theorem)

Types of Bayes estimators:

• Minimum Mean Square Error (MMSE) estimator


• Maximum A Posteriori (MAP) estimator.
• Posterior Median estimator.
Bayes estimators are used in various machine learning tasks, including:
• Classification: Naive Bayes classifier, Bayes optimal classifier
• Regression: Bayesian linear regression, Bayesian logistic regression
• Model selection: Bayesian model comparison
• Hyperparameter optimization: Bayesian optimization

Linear regression:
• Linear regression is a quiet and the simplest statistical regression technique used for
predictive analysis in machine learning.
• It shows the linear relationship between the independent(predictor) variable i.e. X-axis
and the dependent (output) variable i.e. Y-axis, called linear regression. If there is a
single input variable X (independent variable), such linear regression is simple linear
regression.
• In a simple linear regression, there is one independent variable and one dependent
variable. The model estimates the slope and intercept of the line of best fit, which
represents the relationship between the variables.
• The slope represents the change in the dependent variable for each unit change in the
independent variable, while the intercept represents the predicted value of the
dependent variable when the independent variable is zero.

• The graph above presents the linear relationship between the output(y) and predictor(X)
variables. The blue line is referred to as the best-fit straight line. Based on the given
data points, we attempt to plot a line that fits the points the best.
Simple Regression Calculation
• To calculate best-fit line linear regression uses a traditional slope-intercept form which
is given below,
Yi= β0+β1Xi
where Y i = Dependent variable, β 0 = constant/Intercept, β 1 = Slope/Intercept, X
i = Independent variable.
• This algorithm explains the linear relationship between the dependent(output) variable
y and the independent(predictor) variable X using a straight-line Y= B 0 + B 1 X.
But how does the regression find out which is the best-fit line?
• The goal of the linear regression algorithm is to get the best values for B 0 and B 1 to
find the best-fit line. The best-fit line is a line that has the least error which means the
error between predicted values and actual values should be minimum.

Gradient Descent
• Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models
by means of minimizing errors between actual and expected results. It helps in finding
the local minimum of a function.
• The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
• If we move towards a negative gradient or away from the gradient of the function at
the current point, it will give the local minimum of that function.
• Whenever we move towards a positive gradient or towards the gradient of the function
at the current point, we will get the local maximum of that function. This entire
procedure is known as Gradient Ascent, which is also known as steepest descent.
The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration.

The cost function is defined as the measurement of difference or error between actual values
and expected values at the current position and present in the form of a single real number.

To achieve this goal, it performs two steps iteratively:

• Calculates the first-order derivative of the function to compute the gradient or slope of
that function.
• Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
To minimize the cost function, two data points are required: Direction & Learning Rate
Learning Rate: It is defined as the step size taken to reach the minimum or lowest point.
How Does Gradient Descent Work in Linear Regression?
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x)). Hence, the name is Linear Regression.

• Initialize Parameters: Start with random initial values for the slope (m) and intercept
(b).
• Calculate the Cost Function: Compute the error using a cost function such as Mean
Squared Error (MSE):
• Compute the Gradient: Find the gradient of the cost function with respect to m and b.
These gradients indicate how the cost changes when the parameters are adjusted.

• Update Parameters: Adjust m and b in the direction that reduces the cost:

• Repeat: Iterate until the cost function converges i.e. further updates make little or no
difference.

Types of Gradient Descent


Based on the error in various training models, the Gradient Descent learning algorithm can be
divided into Batch gradient descent, stochastic gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:

1. Batch Gradient Descent:


Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples. This procedure is known as the
training epoch. In simple words, it is a greedy approach where we have to sum over all
examples for each update.
Advantages of Batch gradient descent:
o It produces less noise in comparison to other gradient descent.
o It produces stable gradient descent convergence.
o It is computationally efficient as all resources are used for all training samples.

2. Stochastic gradient descent


Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration. Or in other words, it processes a training epoch for each example
within a dataset and updates each training example's parameters one at a time. As it
requires only one training example at a time, hence it is easier to store in allocated memory.
Advantages of Stochastic gradient descent:
In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a
few advantages over other gradient descent.
o It is easier to allocate in desired memory.
o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.
3. Mini Batch Gradient Descent:
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent. It divides the training datasets into small batch sizes then performs the updates
on those batches separately. Splitting training datasets into smaller batches make a balance to
maintain the computational efficiency of batch gradient descent and speed of stochastic
gradient descent.
Advantages of Mini Batch gradient descent:
o It is easier to fit in allocated memory.
o It is computationally efficient.
o It produces stable gradient descent convergence.

Linear Discrimination
• Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality
reduction techniques in machine learning to solve more than two-class classification
problems. It is also known as Normal Discriminant Analysis (NDA) or Discriminant
Function Analysis (DFA).
• It is also considered a pre-processing step for modelling differences in ML and
applications of pattern classification.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:
When we classify them using a single feature, then it may show overlapping.

To overcome the overlapping issue in the classification process, we must increase the number
of features regularly.
Here, LDA uses an X-Y axis to create a new axis by separating them using a straight line and
projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane into
1-D.

To create a new axis, Linear Discriminant Analysis uses the following criteria:
o It maximizes the distance between means of two classes.
o It minimizes the variance within the individual class.
LDA can be performed in 5 steps:

Step 1: Compute the mean vectors for the different classes from the dataset.

Step 2: Compute the scatter matrices (in-between-class and within-class scatter matrices).

Step 3: Compute the eigenvectors and corresponding eigenvalues for the scatter matrices.

Step 4: Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest
eigenvalues.

Step 5: Use this eigenvector matrix to transform the samples onto the new subspace.

1. Compute the Mean Vectors:


Calculate the mean vector for each class and the overall mean of the entire dataset.

2. Calculate Scatter Matrices:

3. Objective of LDA:
LDA tries to maximize the ratio of between-class scatter to within-class scatter to
achieve maximum class separation. This is represented mathematically as:
4. Solve for the Optimal Projection Vector:

5. Project Data onto the New Axis:

This transformation reduces the dimensionality of the data while maintaining the class-
related information.
Logistic regression
• Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
• Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.

o It maps any real value into another value within a range of 0 and 1.
Logistic Regression Equation:
• The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:
We know the equation of the straight line can be written as:
P(y=1)

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".

Multilayer Perceptron:

• A Multi-Layer Perceptron (MLP) is a type of artificial neural network that consists of


multiple layers of neurons, or nodes, arranged in a hierarchical structure. It is one
of the simplest and most widely used types of neural networks, particularly for
supervised learning tasks such as classification and regression.
• MLP consists of fully connected dense layers that transform input data from one
dimension to another. It is called “multi-layer” because it contains an input layer, one
or more hidden layers, and an output layer. The purpose of an MLP is to model complex
relationships between inputs and outputs, making it a powerful tool for various machine
learning tasks.
Key Components of Multi-Layer Perceptron (MLP)
• Input Layer: Each neuron (or node) in this layer corresponds to an input feature.
For instance, if you have three input features, the input layer will have three neurons.
• Hidden Layers: An MLP can have any number of hidden layers, with each layer
containing any number of nodes. These layers process the information received from
the input layer.
• Output Layer: The output layer generates the final prediction or result. If there are
multiple outputs, the output layer will have a corresponding number of neurons.
Every connection in the diagram is a representation of the fully connected nature of an MLP.
This means that every node in one layer connects to every node in the next layer. As the
data moves through the network, each layer transforms it until the final output is
generated in the output layer.

Working of Multi-Layer Perceptron


Let’s delve in to the working of the multi-layer perceptron. The key mechanisms such as
forward propagation, loss function, backpropagation, and optimization.
Step 1: Forward Propagation
In forward propagation, the data flows from the input layer to the output layer, passing
through any hidden layers. Each neuron in the hidden layers processes the input as follows:
• Weighted Sum: The neuron computes the weighted sum of the inputs:

• Activation Function: The weighted sum z is passed through an activation function to


introduce non-linearity. Common activation functions include:
• Sigmoid:
• ReLU (Rectified Linear Unit)
• Tanh (Hyperbolic Tangent).
Step 2: Loss Function
Once the network generates an output, the next step is to calculate the loss using
a loss function. In supervised learning, this compares the predicted output to the
actual label.
For regression problems, the mean squared error (MSE) is often used:

Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the
network’s weights and biases. This is achieved through backpropagation:
Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss:

Step 4: Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases
during training. Popular optimization methods include:
• Stochastic Gradient Descent (SGD): Updates the weights based on a single sample
or a small batch of data:

• Adam Optimizer: An extension of SGD that incorporates momentum and adaptive


learning rates for more efficient training.
Example:
break down how to solve this problem:

Understanding the Network Architecture


• Input Layer: The input is a 3D feature vector, meaning it has 3 input nodes.
• Hidden Layer: The hidden layer has 30 nodes.
• Output Layer: Since it's a binary classification problem, the output layer has 1 node
(representing the probability of belonging to one of the two classes).
• Fully Connected: Each node in one layer is connected to every node in the next layer.
• No Bias Nodes: This simplifies the calculation as we don't need to account for bias
terms.
Calculating the Number of Parameters (Weights)
1. Weights between Input and Hidden Layer:
o Each of the 3 input nodes is connected to each of the 30 hidden nodes.
o Number of weights = (Number of input nodes) * (Number of hidden nodes) =
3 * 30 = 90
2. Weights between Hidden and Output Layer:
o Each of the 30 hidden nodes is connected to the 1 output node.
o Number of weights = (Number of hidden nodes) * (Number of output nodes) =
30 * 1 = 30
3. Total Number of Weights:
o Add the weights from the input-to-hidden layer and the hidden-to-output layer.
o Total weights = 90 + 30 = 120
Answer
The correct answer is b. 120
Back Propagation Algorithm:
• Backpropagation is also known as "Backward Propagation of Errors" and it is a method
used to train neural network . Its goal is to reduce the difference between the model’s
predicted output and the actual output by adjusting the weights and biases in the
network.
• Backpropagation is a technique used in deep learning to train artificial neural
networks particularly feed-forward networks. It works iteratively to adjust weights
and bias to minimize the cost function.
• In each epoch the model adapts these parameters reducing loss by following the error
gradient. Backpropagation often uses optimization algorithms like gradient
descent or stochastic gradient descent. The algorithm computes the gradient using the
chain rule from calculus allowing it to effectively navigate complex layers in the neural
network to minimize the cost function.

Step 1: Forward Propagation


In forward propagation, the data flows from the input layer to the output layer, passing
through any hidden layers. Each neuron in the hidden layers processes the input as follows:
• Weighted Sum: The neuron computes the weighted sum of the inputs:

• Activation Function: The weighted sum z is passed through an activation function to


introduce non-linearity. Common activation functions include:
• Sigmoid:

• ReLU (Rectified Linear Unit)


• Tanh (Hyperbolic Tangent).
Step 2: Loss Function
Once the network generates an output, the next step is to calculate the loss using
a loss function. In supervised learning, this compares the predicted output to the
actual label.
For regression problems, the mean squared error (MSE) is often used:

Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the
network’s weights and biases. This is achieved through backpropagation:
Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss:
Step 4: Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases
during training. Popular optimization methods include:
• Stochastic Gradient Descent (SGD): Updates the weights based on a single sample
or a small batch of data:

• Adam Optimizer: An extension of SGD that incorporates momentum and adaptive


learning rates for more efficient training.
UNIT-IV
NONPARAMETRIC METHODS- Nonparametric Density Estimation- k-Nearest Neighbor
Estimator-Nonparametric Classification- Condensed Nearest Neighbor
DIMENSIONALITY REDUCTION-Subset Selection-Principal Components Analysis-
Factor Analysis-Multidimensional Scaling-Linear Discriminant Analysis

NONPARAMETRIC METHODS:
Parametric methods, like linear regression, logistic regression, and Naive Bayes (with Gaussian
assumptions), assume a specific form for the relationship between features and the target. They
estimate a fixed set of parameters from the data. If the assumed form is incorrect, the model
might perform poorly even with a lot of data.
Nonparametric methods, on the other hand, let the data speak for itself and adapt the model
complexity accordingly. This flexibility comes at the cost of potential overfitting and
sometimes higher computational demands.

Nonparametric Density Estimation:


Nonparametric Density Estimation is a set of techniques in statistics and machine learning used
to estimate the probability density function (PDF) of a random variable without making strong
assumptions about the underlying distribution.
Nonparametric Density Estimation provides a flexible way to learn the shape of your data's
distribution directly from the data itself, without being constrained by predefined distributional
forms.
Nonparametric Density Estimation:
There are 3 Non – parametric density estimation methods:
i. Histogram Estimator
ii. Kernel Density Estimator (KDE)
iii. KNN estimator (K – Nearest Neighbor Estimator)

Histogram Estimator
The Histogram Estimator is a fundamental and intuitive nonparametric density estimation
technique. It provides a visual and basic approximation of the probability density function
(PDF) of a dataset by dividing the data range into intervals (bins) and representing the
frequency of data points within each bin as the height of a bar.
Kernel Density Estimator (KDE)
The Kernel Density Estimator (KDE) is a non-parametric method used to estimate the
probability density function (PDF) of a continuous random variable. Unlike parametric
methods that assume a specific underlying distribution (like a normal distribution), KDE makes
no such assumptions and instead learns the distribution directly from the data. It provides a
smooth and continuous estimate of the density.
K-Nearest Neighbor Estimator (KNN Estimator):
The K-Nearest Neighbor Estimator (KNN Estimator) is a non-parametric density estimation
technique that estimates the probability density function (PDF) of a random variable at a given
point based on the distances to its k nearest neighbors in the training data. Unlike methods like
histograms or kernel density estimation (KDE) that fix the bin width or kernel bandwidth, the
KNN estimator adapts the size of the neighborhood based on the local density of the data.
Where:
• k is the number of nearest neighbors to consider (a pre-defined positive integer).
• n is the total number of data points in the dataset.
• rk(x) is the distance from the point x to its k-th nearest neighbor in the dataset (using a
chosen distance metric like Euclidean distance).
• Vd(r) is the volume of a d-dimensional ball (or hypercube, depending on the distance
metric) with radius r.

Difference Between Parametric and Non-Parametric


There are several Difference between Parametric and Non-Parametric Methods are as
follows:

Parametric Methods Non-Parametric Methods

Parametric Methods uses a fixed number of Non-Parametric Methods use the flexible number
parameters to build the model. of parameters to build the model.

Parametric analysis is to test group means. A non-parametric analysis is to test medians.

It is applicable only for variables. It is applicable for both – Variable and Attribute.

It always considers strong assumptions about


It generally fewer assumptions about data.
data.

Parametric Methods require lesser data than Non- Non-Parametric Methods requires much more data
Parametric Methods. than Parametric Methods.

Parametric methods assumed to be a normal There is no assumed distribution in non-parametric


distribution. methods.
Parametric Methods Non-Parametric Methods

Parametric data handles – Intervals data or ratio


But non-parametric methods handle original data.
data.

Here when we use parametric methods then the When we use non-parametric methods then the
result or outputs generated can be easily affected result or outputs generated cannot be seriously
by outliers. affected by outliers.

Similarly, Non-Parametric Methods can perform


Parametric Methods can perform well in many
well in many situations but its performance is at
situations but its performance is at peak (top)
peak (top) when the spread of each group is the
when the spread of each group is different.
same.

Parametric methods have more statistical power Non-parametric methods have less statistical power
than Non-Parametric methods. than Parametric methods.

As far as the computation is considered these As far as the computation is considered these
methods are computationally faster than the Non- methods are computationally slower than the
Parametric methods. Parametric methods.

Examples: Logistic Regression, Naïve Bayes


Examples: KNN, Decision Tree Model, etc.
Model, etc.

Nonparametric Classification:
Nonparametric classification algorithms are a type of machine learning technique that doesn't
make strong assumptions about the underlying distribution of the data.
Characteristics:
• Flexibility: They can model complex and irregular decision boundaries because they
aren't constrained by a specific functional form.
• Fewer Assumptions: They make minimal to no assumptions about the shape of the
data or the relationship between features and classes.
• Data-Driven: The model's complexity and structure are determined by the training
data. More data can lead to more complex models.
• Potential for Overfitting: Due to their flexibility, they can be more prone to
overfitting the training data if not carefully tuned or with insufficient data.
• Computational Cost: Some nonparametric methods can be computationally
expensive, especially with large datasets, as they might need to store or compare new
instances with a significant portion of the training data.
• Interpretability: They are often less interpretable than parametric methods because
the decision-making process isn't always easily summarized by a small set of
parameters.
K-Nearest Neighbor (KNN) is a supervised learning algorithm used for
both classification and regression. It is non-parametric, meaning it doesn’t make any
assumptions about the underlying data distribution, which makes it versatile for various
applications. KNN works by analyzing the proximity or “closeness” of data points based on
specific distance metrics.

• In classification, KNN assigns a class label to a new data point based on the majority
class of its nearest neighbors. For instance, if a data point has five nearest neighbors,
and three of them belong to class A while two belong to class B, the algorithm will
classify the point as class A.
• In regression, KNN predicts continuous values by averaging the values of the k-nearest
neighbors. For example, if you’re predicting house prices, KNN will use the average
prices of the k-nearest neighbors to estimate the price of a new house.
How Does KNN Work?
The KNN algorithm follows a straightforward, step-by-step approach:
Step 1: Determine the Number of Nearest Neighbors (k)
The first step is to select the number of neighbors (k) to consider. The value of k determines
how many neighboring points will influence the classification or prediction of a new data point.
Step 2: Calculate the Distance between the Query(target) Point and Dataset Points
For each data point in the dataset, the algorithm calculates the distance between the query point
(the new point to be classified or predicted) and every other point. Various distance metrics can
be used, such as Euclidean distance, Manhattan distance, or Minkowski distance.
Step 3: Sort and select the k-Nearest Neighbors
After calculating the distances, the algorithm sorts all data points in ascending order of
distance. It then selects the k-nearest neighbors—the data points that are closest to the query
point.
Step 4: Make a Prediction
• For classification: The algorithm assigns the query point to the class label that is most
frequent among the k-nearest neighbors (majority voting).
• For regression: The algorithm predicts the value by averaging the values of the k-
nearest neighbors.
The Problem with Standard k-NN:
The k-NN algorithm is a "lazy learner" because it stores the entire training dataset and performs
computation only at the time of prediction. For large datasets, this can lead to:
• High memory usage: Storing all training samples can be memory-intensive.
• Slow prediction times: Classifying a new instance requires calculating distances to all
training samples.
Here are some popular non-parametric classification algorithms:
• K-Nearest Neighbors (KNN): Classifies a new data point based on the majority class
among its k closest neighbors in the training data. The decision boundary is implicitly
defined by the distribution of the training points.
• Decision Trees: Create a tree-like structure of decision rules based on features to
classify instances. The tree's structure adapts to the data.
• Random Forests: An ensemble method that builds multiple decision trees on
different subsets of the data and averages their predictions.
• Support Vector Machines (SVM) with Non-linear Kernels: While linear SVM can
be seen as somewhat parametric, using kernels like the Radial Basis Function (RBF)
allows SVM to create highly non-linear decision boundaries that are data-driven.
• Neural Networks (Deep Learning Models): With enough layers and neurons, these
models can learn extremely complex and non-linear relationships, making them
effectively nonparametric in their ability to model intricate decision boundaries.
• Kernel Density Estimation (KDE) for Classification: Estimates the probability
density function for each class using KDE and then uses Bayes' theorem to classify
new points based on these estimated densities.
Condensed nearest neighbor:
Condensed Nearest Neighbor (CNN) is a data reduction technique used in machine learning,
particularly as a preprocessing step for the k-Nearest Neighbors (k-NN) algorithm. The primary
goal of CNN is to reduce the size of the training dataset while preserving or even improving
the performance of the k-NN classifier.
Condensed Nearest Neighbor (CNN) is a data reduction technique used in machine learning,
particularly as a preprocessing step for the k-Nearest Neighbors (k-NN) algorithm. The primary
goal of CNN is to reduce the size of the training dataset while preserving or even improving
the performance of the k-NN classifier.
Here's a breakdown of how it works:
The Problem with Standard k-NN:
The k-NN algorithm is a "lazy learner" because it stores the entire training dataset and performs
computation only at the time of prediction. For large datasets, this can lead to:
• High memory usage: Storing all training samples can be memory-intensive.
• Slow prediction times: Classifying a new instance requires calculating distances to all
training samples.
The Goal of CNN:
CNN aims to identify a smaller subset of the training data, called the "store" or "prototype set,"
which can still correctly classify the original training data using a 1-Nearest Neighbor rule.
This reduced set aims to capture the essential information needed for classification, especially
the samples near the decision boundaries between classes.
The Basic CNN Algorithm:
1. Initialize:
o Create an empty "store" (S).
o Randomly select one sample from each class in the original training set (T) and
add them to S. This ensures that all classes are initially represented.
o Move the selected samples from T to a temporary set (say, T').
2. Iterate:
o Scan through all the samples in T'.
o For each sample (x) in T', find its nearest neighbor in the current store (S).
o If the class label of the nearest neighbor in S is different from the class label of
x, then x is misclassified by the current store. In this case, move x from T' to S.
o Repeat this scan through T' until no more samples are moved to S in a complete
pass.
3. The Result: The final set S is the condensed training set.
Working of CNN
Let us understand the working of CNN.
Suppose that we have a dataset D, given by

where, xᵢ is a data point and yᵢ is its original classification


Step 1: Choose ‘k’: CNN works with a value of ‘k’. The value of ‘k’ which we choose will
give a unique result. Changing ‘k’ changes our final result. Let us assume k = 3 for our case.
Step 2: Start the iteration by choosing randomly any k (here 3) points to keep in the store S.

Step 3: We check whether the store S is ‘Training Set Consistent’ or not. If it is, we stop; else
we add a point to the store so as to improve it and make it training set consistent.
Training Set Consistency: A set is said to be training set consistent if on running KNN on the
dataset with the classifiers as the points in store, we get the same classification as when KNN
was run on the entire dataset.
i.e.

gₓ(aᵢ) represents the classification of aᵢ with respect to dataset X


Now, we check if our store S with the 3 selected points is consistent or not.
Consider the dark blue point.
According to the store, it must be blue (since out of the 3 points in store, 2 are blue).
So,
3-NN with S as the dataset classifies the dark blue point as blue
However, according to the complete dataset, that point must be red, becuase out of 3 nearest
neighbors, 2 are red.
i.e.

3-NN on complete dataset classifies as the dark blue point as red.


Clearly, we have a training set inconsistency.
Step 4: We select a random point from the dataset to add in store such that the inconsistency
with the classification of dark blue point is solved keeping the prediction of the dataset as
gold standard
i.e. select a random point xᵢ from the dataset such that on adding xᵢ we have,

So we add the dark red point (in square) to the store S.

Step 5: Repeat till the Store S is training set consistent.


DIMENSIONALITY REDUCTION
Dimensionality reduction in machine learning is the process of transforming high-
dimensional data into a lower-dimensional space while preserving as much relevant
information as possible. It helps simplify complex datasets, improve model performance, and
make data easier to visualize and interpret.

Dimensionality reduction techniques aim to:


• Reduce the number of features: This simplifies the data without losing crucial
information.
• Transform data: New features (or combinations of existing ones) are created to
represent the data in a lower-dimensional space.
• Preserve important properties: The goal is to retain essential patterns and structures
in the original data.
Benefits of dimensionality reduction:
• Improved model performance: Simplified data can lead to faster training and better
generalization.
• Reduced computational cost: Working with fewer features can make models faster
and more efficient.
• Simplified analysis: Lower-dimensional representations can make it easier to visualize
and understand data.
• Enhanced interpretability: Reduced complexity can make it easier to understand the
relationships between features and the target variable.

Dimensionality reduction techniques can be broadly classified into two categories based on the
approach used:

1. Feature Selection
This approach selects a subset of the original features or variables, discarding the
rest. Feature selection methods can be further divided into three categories -
• Filter methods:
These methods evaluate the relevance of each feature independently of the target
variable and select the most relevant ones based on a specific criterion, such as
correlation or mutual information. A few of the most common filter methods
include Correlation, Chi-Square Test, ANOVA, Information Gain, etc.
• Wrapper methods:
These methods evaluate the performance of a model trained on a subset of features and
select the best subset based on model performance. Some of the wrapper methods
include forward selection, backward selection, bi-directional elimination, etc.
• Embedded methods:
These methods combine feature selection with model training, selecting the most
relevant features during the training process. Some commonly used embedded methods
include LASSO, Ridge Regression, etc.

2. Feature Extraction
This approach transforms the original features into a new set of features, typically of
lower dimensionality, while preserving the most important information. Feature extraction
methods can be further divided into two categories:
• Linear methods:
These methods transform the data using linear transformations, such as Principal
Component Analysis (PCA) or Linear Discriminant Analysis (LDA).
• Non-linear methods:
These methods use non-linear transformations to map the data to a lower-dimensional
space, such as t-Distributed Stochastic Neighbor Embedding (t-
SNE) or Autoencoders.

Subset Selection (Feature selection):


Feature selection in machine learning is the process of identifying and selecting a
subset of the most relevant features (input variables) from a dataset to be used in building
a predictive model. The goal is to improve model performance, reduce computational cost,
and enhance interpretability by eliminating irrelevant or redundant features.
Feature Selection Important
• Improved Model Accuracy: Irrelevant or noisy features can confuse the learning
algorithm and lead to lower accuracy. Selecting the right features can result in a model
that generalizes better to unseen data.
• Reduced Overfitting: Using too many features, especially when some are irrelevant,
can lead to overfitting, where the model learns the training data too well but performs
poorly on new data.
• Faster Training Times: Fewer features mean less data to process, resulting in faster
model training.
• Enhanced Model Interpretability: A model with fewer features is often easier to
understand and explain. Identifying the key drivers of the prediction can provide
valuable insights.
• Mitigation of the Curse of Dimensionality: In high-dimensional datasets, the data
becomes sparse, and the distance between data points becomes less meaningful. Feature
selection helps reduce the dimensionality.
Feature Selection Methods:
Feature selection techniques are broadly categorized into three main types:
1. Filter Methods:
• These methods evaluate the intrinsic properties of each feature independently of
any specific machine learning model. They use statistical measures to score and rank
features, then select the top-ranking ones.
• Advantages: Computationally efficient and can be used as a preprocessing step before
applying any model.
• Disadvantages: They don't consider the interaction between features or the specific
model being used.
• Techniques include:
o Variance Thresholding: Removing features with low variance.
o Correlation Analysis: Identifying and removing highly correlated features.
o Chi-squared Test: Assessing the statistical relationship between categorical
features and the target variable.
o ANOVA (Analysis of Variance): Assessing the statistical relationship between
numerical features and a categorical target variable.
o Information Gain / Mutual Information: Measuring the reduction in entropy
or uncertainty about the target variable given a feature.
o Statistical Tests (e.g., t-test, F-test): Assessing the significance of the
relationship between features and the target variable.

2. Wrapper Methods:
These methods evaluate subsets of features by training and evaluating a specific
machine learning model on each subset. The feature subset that yields the best model
performance (based on a chosen evaluation metric) is selected.
• Advantages: Can find feature subsets that are optimally suited for a particular model
and can capture feature interactions.
• Disadvantages: Computationally expensive, especially for a large number of features,
as it involves training the model multiple times. Can also be prone to overfitting if the
feature selection process is not carefully validated.
• Techniques include:
o Forward Selection: Starts with an empty set of features and iteratively adds the
feature that best improves model performance.
o Backward Elimination: Starts with all features and iteratively removes the
least significant feature until performance degrades.
o Recursive Feature Elimination (RFE): Repeatedly builds a model and
removes the worst-performing feature until the desired number of features is
reached.
o Exhaustive Search: Evaluates all possible subsets of features (computationally
very expensive for a large number of features).

3. Embedded Methods:
These methods perform feature selection as part of the model training
process. The model itself learns which features are most important.
• Advantages: Less computationally expensive than wrapper methods and can consider
feature interactions.
• Disadvantages: Feature selection is specific to the model being used.
• Techniques include:
o L1 Regularization (Lasso): Adds a penalty to the absolute size of the
coefficients in linear models, forcing the coefficients of less important features
to become zero.
o Tree-based Feature Importance (e.g., Random Forests, Gradient
Boosting): Tree-based models naturally provide a ranking of feature importance
based on how much each feature contributes to reducing impurity.
o Feature Importance from Linear Models: The magnitude of the coefficients
in linear models can indicate feature importance (after appropriate scaling).

Principal Components Analysis:


• Principal Component Analysis (PCA) is a powerful unsupervised dimensionality
reduction technique widely used in machine learning and data analysis.
• Its primary goal is to transform high-dimensional datasets into a lower-dimensional
representation while retaining the most significant information or variance present in
the original data.
• PCA aims to find a new set of uncorrelated variables, called principal components,
which are linear combinations of the original features.
• The idea of PCA is simple: reduce the number of variables of a data set, while
preserving as much information as possible.
PCA Works for Dimensionality Reduction:
Step 1: Standardize the Data
Make sure all features (e.g., height, weight, age) are on the same scale. Why? A feature like
"salary" (ranging 0–100,000) could dominate "age" (0–100) otherwise.
Standardizing our dataset to ensures that each variable has a mean of 0 and a standard
deviation of 1.

Step 2: Find Relationships


Calculate how features move together using a covariance matrix. Covariance measures
the strength of joint variability between two or more variables, indicating how much they
change in relation to each other. To find the covariance we can use the formula:
The value of covariance can be positive, negative, or zeros.
• Positive: As the x1 increases x2 also increases.
• Negative: As the x1 increases x2 also decreases.
• Zeros: No direct relation.
Step 3: Find the "Magic Directions" (Principal Components)
• PCA identifies new axes (like rotating a camera) where the data spreads out the most:
o 1st Principal Component (PC1): The direction of maximum variance (most
spread).
o 2nd Principal Component (PC2): The next best direction, perpendicular to
PC1, and so on.
• These directions are calculated using Eigenvalues and Eigenvectors where:
eigenvectors (math tools that find these axes), and their importance is ranked
by eigenvalues (how much variance each captures).
For a square matrix A, an eigenvector X (a non-zero vector) and its
corresponding eigenvalue λ (a scalar) satisfy:
AX=λX
This means:
• When A acts on X, it only stretches or shrinks X by the scalar λ.
• The direction of X remains unchanged (hence, eigenvectors define "stable directions"
of A).
It can also be written as:
AX−λX=0
(A−λI)X=0

where I is the identity matrix of the same shape as matrix A. And the above conditions will be
true only if (A–λI)(A–λI) will be non-invertible (i.e. singular matrix). That means,
∣A–λI∣=0
This determinant equation is called the characteristic equation.
1. Solving it gives the eigenvalues \lambda,
2. and therefore corresponding eigenvector can be found using the equation AX=λX.
How This Connects to PCA?
3. In PCA, the covariance matrix C (from Step 2) acts as matrix A.
4. Eigenvectors of C are the principal components (PCs).
5. Eigenvalues represent the variance captured by each PC.
Step 4: Pick the Top Directions & Transform Data
• Keep only the top 2–3 directions (or enough to capture ~95% of the variance).
• Project the data onto these directions to get a simplified, lower-dimensional version.

Principal Components (PCs):


• PC₁ (First Principal Component): The direction along which the data has the maximum
variance. It captures the most important information.
• PC₂ (Second Principal Component): The direction orthogonal (perpendicular) to PC₁.
It captures the remaining variance but is less significant.

Principal components are new variables that are constructed as linear combinations or mixtures
of the initial variables. These combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information within the initial variables
is squeezed or compressed into the first components. So, the idea is 10-dimensional data gives
you 10 principal components, but PCA tries to put maximum possible information in the first
component, then maximum remaining information in the second and so on, until having
something like shown in the scree plot below.

Factor Analysis
• Factor Analysis is an unsupervised, probabilistic machine learning algorithm used
for dimensionality reduction.
• Factor Analysis is a statistical method used to reduce a large number of observed
variables into a smaller set of underlying unobserved (latent) variables called
"factors." It helps to uncover the underlying structure of data, identify key dimensions
or constructs, and simplify complex datasets.
• There are two main types: Exploratory Factor Analysis (EFA) and Confirmatory
Factor Analysis (CFA). EFA is used when you don't have a preconceived idea of the
factor structure, while CFA tests a pre-specified hypothesis about the factor structure.
• Exploratory Factor Analysis (EFA):
o EFA is a data-driven technique used when the researcher has no clear
hypothesis about the number of factors or how observed variables will
specifically group onto those factors.
o Its primary goal is to discover and identify the underlying factor structure.
Think of it as "exploring" your data to see what natural clusters or dimensions
emerge.
• Confirmatory Factor Analysis (CFA):
o CFA is a theory-driven technique used when the researcher has a clear, pre-
specified hypothesis about the number of factors, which specific observed
variables load onto which factors, and whether these factors are correlated. Its
primary goal is to test or confirm this hypothesized factor structure.

How to do Factor Analysis (Factor Analysis Steps)


1. Data Collection and Preparation
2. Assessing Suitability of Data for Factor Analysis (Factorability)
3. Factor Extraction
4. Factor Rotation
5. Interpretation of Factors
6. Compute Factor Scores (Optional)
7. Validation of the Factor Structure

Multidimensional Scaling
• Multidimensional scaling (MDS) is a dimensionality reduction technique that is
used to project high-dimensional data onto a lower-dimensional space while
preserving the pairwise distances between the data points as much as possible.
• MDS is based on the concept of distance and aims to find a projection of the data
that minimizes the differences between the distances in the original space and
the distances in the lower-dimensional space.
• MDS is commonly used to visualize complex, high-dimensional data, and to
identify patterns and relationships that may not be apparent in the original space.
• It can be applied to a wide range of data types, including numerical, categorical,
and mixed data.
• MDS is implemented using numerical optimization algorithms, such as gradient
descent or simulated annealing, to minimize the difference between the distances in
the original and lower-dimensional spaces.

There are three main types of Multidimensional Scaling: Classical, Metric, and Non-
metric.
1. Classical Multidimensional Scaling (CMDS)
Classical MDS, also known as Principal Coordinates Analysis (PCoA), takes an
input matrix representing dissimilarities between pairs of items and produces a
coordinate matrix that minimizes "strain." Strain quantifies how well the distances in
the low-dimensional representation match the original dissimilarities.
Mathematically, strain is defined as:
The steps of a Classical MDS algorithm involve:
1. Setting up the squared proximity matrix D(2).
2. Applying double cantering to compute matrix B.
3. Determining the m largest eigenvalues and corresponding eigenvectors of B.
4. Obtaining the coordinates matrix X from these eigenvalues and eigenvectors.
Classical MDS is chosen when the distance data are Euclidean and accurate
preservation of these distances is crucial.

Metric Multidimensional Scaling (MMDS)


Metric Multidimensional Scaling generalizes the optimization procedure of
MDS to various loss functions and input matrices with known distances and weights. It
minimizes a cost function called "stress," which is often minimized using a procedure
called stress majorization.
Stress is defined as a residual sum of squares:

Metric MDS is suitable when distances are non-Euclidean or when the scale of
measurement levels varies.
3. Non-metric Multidimensional Scaling (NMDS)
Non-metric Multidimensional Scaling finds a non-parametric monotonic
relationship between dissimilarities and Euclidean distances between items, along with
the location of each item in the low-dimensional space. It defines a "stress" function
to optimize, considering a monotonically increasing function f.
Non-metric MDS is beneficial for qualitative data or when only the order of distances
(not the actual distances) matters.

Linear Discriminant Analysis


UNIT-V
REINFORCEMENT LEARNING: Introduction- Single State Case: K-Armed Bandit-
Elements of Reinforcement Learning- Model-Based Learning- Temporal Difference Learning-
Generalization- Partially Observable States

REINFORCEMENT LEARNING
Reinforcement learning (RL) is a type of machine learning where an agent learns to make
decisions by interacting with an environment and receiving rewards or penalties based on its
actions. The goal is to maximize cumulative rewards over time. It's a trial-and-error approach
where the agent learns through feedback, adjusting its behavior to achieve optimal outcomes.

Key Concepts:
• Agent: The entity that interacts with the environment and makes decisions.
• Environment: The world in which the agent operates.
• Action: A choice made by the agent in the environment.
• Reward: Feedback given to the agent for its actions, positive or negative.
• Policy: The agent's strategy for selecting actions based on the current state.
• Value function: A measure of how good a particular state or action is, based on the
expected future rewards.
• State: The current condition of the environment that the agent perceives.
How it Works:
1. The agent interacts with the environment and takes an action.
2. The environment responds to the action and transitions to a new state.
3. The agent receives a reward or penalty for its action.
4. The agent updates its policy (strategy) based on the feedback it receives.
5. This cycle repeats, with the agent gradually learning to make better decisions to
maximize cumulative rewards.

Examples of Applications:
• Robotics: Training robots to navigate environments, grasp objects, or perform tasks.
• Game Playing: Developing AI agents that can play games like chess, Go, or video
games.
• Autonomous Driving: Building self-driving cars that can navigate roads and make
decisions.
• Resource Management: Optimizing resource allocation in various systems, such as
power grids or data centers.
• Recommendation Systems: Personalizing recommendations to users based on their
preferences.
Types of Reinforcement Learning:
• Policy-based RL: Directly learns the policy (strategy) that maps states to actions.
• Value-based RL: Learns a value function that estimates the expected reward of being
in a particular state or taking a specific action.
• Model-based RL: Learns a model of the environment to predict how the environment
will respond to actions.
Single State Case: K-Armed Bandit- Elements of Reinforcement Learning
• The Multi-Armed Bandit (MAB) problem is a classic problem in probability theory
and decision-making that captures the essence of balancing exploration and
exploitation.
• This problem is named after the scenario of a gambler facing multiple slot machines
(bandits) and needing to determine which machine to play to maximize their
rewards.
• The MAB problem has significant applications in various fields, including online
advertising, clinical trials, adaptive routing in networks, and more.

Formalizing the K-armed bandit problem


Model-Based Learning

• Model-based learning is a powerful paradigm in artificial intelligence and machine


learning where an agent or system explicitly constructs and utilizes a model of its
environment.
• This model is essentially a representation of how the world works, allowing the
system to predict future states, rewards, and outcomes based on its current state and
chosen actions.

• Model Learning:
o The agent interacts with the environment, collecting data in the form of (state,
action, next state, reward) tuples.
o This experience is then used to learn a model of the environment's dynamics
(st+1=f(st,at)) and sometimes also the reward function (r(st,at)). This learning
process can often be framed as a supervised learning problem.
• Planning:
o Once the model is learned, the agent uses it to simulate future interactions
without needing to interact with the real environment. This simulation allows
the agent to:
• Predict outcomes: Given a sequence of actions, the model can predict the
resulting states and rewards.
• Evaluate policies: By simulating different action sequences, the agent can
evaluate the potential effectiveness of various policies (strategies for acting).
• Optimize actions: The agent can use planning algorithms (e.g., Model Predictive
Control (MPC), tree search algorithms) to find the optimal sequence of actions
that maximizes expected future rewards, all within the simulated
environment.
• Action Execution and Model Update: Based on its planning, the agent takes an
action in the real environment. The new experience gained is then used to refine and
update the learned model, leading to continuous improvement.
Model-based learning algorithms:
• Monte Carlo Tree Search (MCTS)
• Model Predictive Control (MPC)
• Dyna-Q
• Model-Based Policy Optimization (MBPO)

Temporal Difference Learning


Temporal Difference (TD) learning is a fundamental concept in Reinforcement Learning (RL)
that bridges the gap between Monte Carlo methods and Dynamic Programming. At its
heart, TD learning is about learning to predict future rewards directly from experience,
without needing a full model of how the environment works.
Let's break down the core ideas:
The Problem: Learning Values
In Reinforcement Learning, an agent interacts with an environment. The goal is often to learn
a value function, which estimates how much accumulated reward the agent can expect to
receive from a given state, or from taking a given action in a given state.
• State-Value Function (V(s)): The expected total future reward starting from state s.
• Action-Value Function (Q (s, a)): The expected total future reward starting from state
s and taking action a.
The challenge is to learn these values when the agent doesn't know the environment's rules
(i.e., it doesn't have a model of the environment).
The "Temporal Difference" - The Core Idea
Imagine you're trying to predict the outcome of a long sequence of events, like the final score
of a cricket match.
• Monte Carlo Approach: You'd wait until the entire match is over, observe the final
score (the actual outcome), and then adjust your initial prediction based on that. This is
like waiting for an entire episode to finish.
• Temporal Difference Approach: You wouldn't wait. Instead, after each ball, over, or
significant event, you'd update your current prediction based on the immediate change
(the score added in that segment) and your new prediction for the future from the current
point.
This "immediate change" plus the "new prediction" is the essence of the TD target. The
difference between your old prediction and this new, more informed prediction is the Temporal
Difference error.
How TD Learning Works (The Simplest Form: TD(0))
Let's focus on learning the state-value function V(s).
1. Agent is in a state St. It has a current estimate of V(St).
2. Agent takes an action At.
3. Environment responds: The agent receives an immediate reward Rt+1 and transitions
to a new state St+1.
4. The "TD Target":
A Better Estimate: Now, from state St+1, the agent also has an estimate of its value,
V(St+1). A better estimate for the value of St can be formed by combining the immediate
reward Rt+1 with the discounted future value from St+1:
TD Target=Rt+1+γ V(St+1)
o Rt+1 is the actual reward just received.
o γ (gamma) is the discount factor (between 0 and 1). It indicates how much
future rewards are valued compared to immediate rewards. A γ of 0 means
only immediate rewards matter; a γ close to 1 means future rewards are nearly
as important as immediate ones.
o V(St+1) is your current best guess of the value of the next state. This is where
"bootstrapping" comes in – you're using an estimate (V(St+1)) to update another
estimate (V(St)).
5. The "TD Error": How Wrong Were You? The TD error is the difference between
this better, updated estimate (the TD Target) and your previous estimate of V(St):
TD Error= (Rt+1+γ V(St+1)) −V(St)
This error tells you how much your current prediction for V(St) was off.
6. Updating the Value Function: You then adjust your estimate of V(St) based on this
error.
V(St)←V(St)+α×TD Error
o α (alpha) is the learning rate (between 0 and 1). It determines how much you
"learn" from each error. A small α means slow, steady updates; a large α
means aggressive updates.
This process is repeated at every step as the agent interacts with the environment.
Generalization
• Reinforcement Learning (RL) is a type of machine learning which is an agent used
make a decision by performing actions in the environment to reach a specify goal.
• Generalization in RL refers to an agent's ability to effectively apply learned strategies
(policies) to new, previously unseen, but similar states, tasks, or environments.
• Achieving good performance on the training data doesn’t guarantee success in novel
situations. Generalization refers to an agent’s ability to transfer its learned experience
to new, unseen environments or tasks.
Partially Observable States
In Reinforcement Learning (RL), an agent learns to make decisions in an environment to
maximize a cumulative reward. The success of many RL algorithms relies on the assumption
that the agent can fully observe the "state" of the environment at each time step. However, this
is often not the case in real-world scenarios.
Partially observable states in reinforcement learning refer to situations where the agent does
not have complete information about the true underlying state of the environment. Instead, the
agent receives an observation which is only a partial or noisy reflection of the actual state.
A Markov Decision Process (MDP) is a mathematical framework used to model sequential
decision-making problems in situations where outcomes are partly random and partly under
the control of a decision-maker (often called an "agent"). It's a foundational concept in
Reinforcement Learning (RL), providing the theoretical basis for how an intelligent agent
learns to make optimal decisions in an environment to maximize a long-term cumulative
reward.
• An MDP is formally defined by a 5-tuple: (S,A,P,R,γ)
• States (S): This is a finite set of all possible "situations" or configurations the agent
can be in.
• Actions (A): This is a finite set of all possible actions the agent can take
• Transition Probability Function (P):Also denoted as P(s′∣s,a) or T(s,a,s′).
• Reward Function (R): Also denoted as R(s,a) or R(s,a,s′).
• Discount Factor (γ): A value between 0 and 1 (inclusive), 0≤γ≤1.
How an MDP Works in Reinforcement Learning
In RL, the agent's interaction with the MDP environment proceeds in discrete time steps:
1. At time t, the agent observes the current state St.
2. Based on St, the agent chooses an action At according to its policy (π). A policy is
essentially a strategy or a mapping from states to actions (or probabilities of actions).
3. The environment transitions to a new state St+1 with probability P (St+1∣St, At).
4. The agent receives an immediate reward Rt+1 based on the state transition.
5. This process repeats.
SIDDARTHA INSTITUTE OF SCIENCE AND TECHNOLOGY:: PUTTUR

(AUTONOMOUS) L T P C
3 - - 3
III B.Tech. – II Sem.

(20CS0535) MACHINE LEARNING


(Professional Elective Course-II)

COURSE OBJECTIVES

The objectives of this course:


1. To investigate various Supervised Learning models of machine learning
2. To investigate various Unsupervised Learning models of machine learning
3. To investigate various Reinforcement Learning models of machine learning
4. To expose students to the Dimensionality Reduction

COURSE OUTCOMES (COs)


On successful completion of this course, the student will be able to
1. Understand the basics of Machine Learning
2. Apply the various supervised learning algorithms to classification and regression
problems
3. Analyze the various unsupervised learning techniques like k-means, EM algorithm and to
apply for real world problems
4. Understand the concepts of Clustering Techniques.
5. Identify the need of Parametric methods and Dimensionality Reduction Techniques in
machine learning.
6. Infer the theoretical and practical concepts of Reinforcement Learning
UNIT-I

INTRODUCTION: What is machine learning? -Examples of machine learning applications- Types of


machine learning. –Model selection and generalization – Guidelines for Machine LearningExperiments

UNIT-II

SUPERVISED LEARNING: Classification, Decision Trees – Univariate Tree –Multivariate Tree –


Pruning, Bayesian Decision Theory, Parametric Methods-Maximum Likelihood Estimation -Evaluating
an Estimator Bias and Variance -The Bayes‟ Estimator, Linear Discrimination- Gradient Descent-
Logistic Discrimination-Discrimination by Regression, Multilayer Perceptron-Perceptron- Multilayer
Perceptrons- Back Propagation Algorithm

UNIT-III

UNSUPERVISED LEARNING: clustering- Introduction- Mixture Densities- k-Means Clustering-


Expectation-Maximization Algorithm- Mixtures of Latent Variable Models- Supervised Learning after
Clustering- Hierarchical Clustering
UNIT-IV

NONPARAMETRIC METHODS- Nonparametric Density Estimation- k-Nearest


NeighborEstimator- Nonparametric Classification- Condensed Nearest Neighbor

DIMENSIONALITY REDUCTION-Subset Selection-Principal Components Analysis- Factor


Analysis- Multidimensional Scaling-Linear Discriminant Analysis.

UNIT-V

REINFORCEMENT LEARNING: Introduction- Single State Case:K-Armed Bandit- Elements of


Reinforcement Learning- Model- Based Learning- Temporal DifferenceLearning- Generalization-
Partially Observable States

TEXT BOOKS

1. Ethem Alpaydin, Introduction to Machine Learning,MIT Press, Second Edition,2010.

REFERENCES

1. Tom M Mitchell, Machine Learning, First Edition, McGraw Hill Education, 2013

2. Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction.MIT


Press
UNIT-1

UNIT-I

INTRODUCTION: What is machine learning? -Examples of machine learning applications- Types of


machine learning. –Model selection and generalization – Guidelines for Machine LearningExperiments

Machine learning is a growing technology which enables computers to learn automatically from past
data. Machine learning uses various algorithms for building mathematical models and making
predictions using historical data or information. Currently, it is being used for various tasks such
as image recognition, speech recognition, email filtering, Facebook auto-tagging, recommender
system, and many more.

What is Machine Learning?

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences on
their own. The term machine learning was first introduced by Arthur Samuel in 1959. We can define it
in a summarized way as:

Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine learning algorithms
build a mathematical model that helps in making predictions or decisions without being explicitly
programmed. Machine learning brings computer science and statistics together for creating predictive
models. Machine learning constructs or uses the algorithms that learn from historical data. The more we
will provide the information, the higher will be the performance.

A machine has the ability to learn if it can improve its performance by gaining more data.

Machine learning is a subfield of artificial intelligence that involves training computers to learn from
data without being explicitly programmed. In other words, machine learning algorithms use statistical
techniques to find patterns in data and use these patterns to make predictions or take actions.

How does Machine Learning work

A Machine Learning system learns from historical data, builds the prediction models, and whenever
it receives new data, predicts the output for it. The accuracy of predicted output depends upon the
amount of data, as the huge amount of data helps to build a better model which predicts the output more
accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead of writing
a code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms,
machine builds the logic as per the data and predict the output. Machine learning has changed our way
of thinking about the problem. The below block diagram explains the working of Machine Learning
algorithm:

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount of the
data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason behind the need for machine learning
is that it is capable of doing tasks that are too complex for a person to implement directly. Asa human,
we have some limitations as we cannot access the huge amount of data manually, so for this, we need
some computer systems and here comes the machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let them
explore the data, construct the models, and predict the required output automatically. The performance
of the machine learning algorithm depends on the amount of data, and it can be determined by the cost
function. With the help of machine learning, we can save both time and money.

The importance of machine learning can be easily understood by its uses cases, Currently, machine
learning is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion
by Facebook, etc. Various top companies such as Netflix and Amazon have build machine learning
models that are using a vast amount of data to analyze the user interest and recommend product
accordingly.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

History of Machine Learning

Before some years (about 40-50 years), machine learning was science fiction, but today it is the part of
our daily life. Machine learning is making our day to day life easy from self-driving cars to Amazon
virtual assistant "Alexa". However, the idea behind machine learning is so old and has a long history.
Below some milestones are given which have occurred in the history of machine learning:

The early history of Machine Learning (Pre-1940):


o 1834: In 1834, Charles Babbage, the father of the computer, conceived a device that could be
programmed with punch cards. However, the machine was never built, but all modern computers
rely on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine can determine and execute a set
of instructions.

The era of stored program computers:


o 1940: In 1940, the first manually operated computer, "ENIAC" was invented, which was the first
electronic general-purpose computer. After that stored program computer such as EDSAC in
1949 and EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an electrical circuit. In 1950, the
scientists started applying their idea to work and analyzed how human neurons might work.

Computer machinery and intelligence:


o 1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery and
Intelligence," on the topic of artificial intelligence. In his paper, he asked, "Can machines
think?"

Machine intelligence in Games:


o 1952: Arthur Samuel, who was the pioneer of machine learning, created a program that helped
an IBM computer to play a checkers game. It performed better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.

The first "AI" winter:


o The duration of 1974 to 1980 was the tough time for AI and ML researchers, and this duration
was called as AI winter.
o In this duration, failure of machine translation occurred, and people had reduced their interest
from AI, which led to reduced funding by the government to the researches.

Machine Learning from theory to reality


o 1959: In 1959, the first neural network was applied to a real-world problem to remove echoes
over phone lines using an adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural network NETtalk,
which was able to teach itself how to correctly pronounce 20,000 words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess game against the chess expert
Garry Kasparov, and it became the first computer which had beaten a human chess expert.

Machine Learning at 21st century


o 2006: In the year 2006, computer scientist Geoffrey Hinton has given a new name to neural net
research as "deep learning," and nowadays, it has become one of the most trending technologies.
o 2012: In 2012, Google created a deep neural network which learned to recognize the image of
humans and cats in YouTube videos.
o 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing Test. It was the first Chabot
who convinced the 33% of human judges that it was not a machine.
o 2014: DeepFace was a deep neural network created by Facebook, and they claimed that it
could recognize a person with the same precision as a human can do.
o 2016: AlphaGo beat the world's number second player Lee sedol at Go game. In 2017 it beat
the number one player of this game Ke Jie.
o 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system that was able to learn
the online trolling. It used to read millions of comments of different websites to learn to stop
online trolling.

Machine Learning at present:

Now machine learning has got a great advancement in its research, and it is present everywhere around
us, such as self-driving cars, Amazon Alexa, Catboats, recommender system, and many more. It
includes Supervised, unsupervised, and reinforcement learning with
clustering, classification, decision tree, SVM algorithms, etc.

Modern machine learning models can be used for making various predictions, including weather
prediction, disease prediction, stock market analysis, etc.

Prerequisites

Before learning machine learning, you must have the basic knowledge of followings so that you can
easily understand the concepts of machine learning:

o Fundamental knowledge of probability and linear algebra.


o The ability to code in any computer language, especially in Python language.
o Knowledge of Calculus, especially derivatives of single variable and multivariate functions.

• interact with it.


Examples of Machine Learning Applications

Machine learning is a buzzword for today's technology, and it is growing very rapidly day by day. We
are using machine learning in our daily life even without knowing it such as Google Maps, Google
assistant, Alexa, etc. Below are some most trending real-world applications of Machine Learning:
1. Image Recognition:

Image recognition is one of the most common applications of machine learning. It is used to identify
objects, persons, places, digital images, etc. The popular use case of image recognition and face detection
is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our
Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind
this is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's
a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech
to text", or "Computer speech recognition." At present, machine learning algorithms are widely used
by various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are
using speech recognition technology to follow the voice instructions.

3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information from the
user and sends back to its database to improve the performance.
4. Product recommendations:

Machine learning is widely used by various e-commerce and entertainment companies suchas
Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product
on Amazon, then we started getting an advertisement for the same product while internet surfing on the
same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests the product
as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc.,
and this is also done with the help of machine learning.

5. Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars. Machine learning plays
a significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on
self-driving car. It is using unsupervised learning method to train the car models to detect people and
objects while driving.

6. Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spambox, and
the technology behind this is Machine learning. Below are some spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction. These assistants canhelp us
in various ways just by our voice instructions such as Play music, call someone, Open an email,
Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.
8. Online Fraud Detection:

Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent transaction
can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to
detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or
a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values become
the input for the next round. For each genuine transaction, there is a specific pattern which gets change
for the fraud transaction hence, it detects it and makes our online transactions more secure.

9. Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is always a risk of
up and downs in shares, so for this machine learning's long short term memory neural network is used
for the prediction of stock market trends.

10. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the
brain. It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:

Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all,
as for this also machine learning helps us by converting the text into our known languages. Google's
GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning
that translates the text into our familiar language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning algorithm, which
is used with image recognition and translates the text from one language to another language.

Types of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning 2.Unsupervised learning 3.Reinforcement learning


Supervised Machine Learning

Supervised learning is the types of machine learning in which machines are trained using well"labelled"
training data, and on basis of that data, machines predict the output. The labelled data meanssome input
data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to map the
input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns about each type
of data. Once the training process is completed, the model is tested on the basis of test data (a subset of
the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:

Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the
shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:


o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough knowledge so
that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine, decision tree,
etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as the control
parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Advantages of Supervised learning:


o With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.

Disadvantages of supervised learning:


o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.

Unsupervised Machine Learning

In the previous topic, we learned supervised machine learning in which models are trained using labeled
data under the supervision of training data. But there may be many cases in which we do not have labeled
data and need to find the hidden patterns from the given dataset. So, to solve such typesof cases in
machine learning, we need unsupervised learning techniques.

What is Unsupervised Learning?

Unsupervised learning is a machine learning technique in which models are not supervised using training
dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be
compared to learning which takes place in the human brain while learning new things. It can be defined
as:
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification problem because unlike
supervised learning, we have the input data but no corresponding output data. The goal of unsupervised
learning is to find the underlying structure of dataset, group that data according to similarities,
and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of
different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset. The task of the unsupervised learning algorithm
is to identify the image features on their own. Unsupervised learning algorithm will perform this task by
clustering the image dataset into the groups according to similarities between images.

Keep Watching

Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
o In real-world, we do not always have input data with the corresponding output so to solve such
cases, we need unsupervised learning.

Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram:


Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in order
to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then will
apply suitable algorithms such as k-means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to
the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of problems:

o Clustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes them
as per the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriority algorithm
o Singular value decomposition

Advantages of Unsupervised Learning


o Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled
data.

Disadvantages of Unsupervised Learning


o Unsupervised learning is intrinsically more difficult than supervised learning as it does not
have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data is not
labeled, and algorithms do not know the exact output in advance.

Difference between Supervised and Unsupervised Learning

Supervised and Unsupervised learning are the two techniques of machine learning. But both the
techniques are used in different scenarios and with different datasets. Below the explanation of both
learning methods along with their difference table is given.
The main differences between Supervised and Unsupervised learning are given below:

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained Unsupervised learning algorithms are


using labeled data. trained using unlabeled data.

Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.

Supervised learning model predicts the output. Unsupervised learning model finds the
hidden patterns in data.

In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.

The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it the hidden patterns and useful insights from
is given new data. the unknown dataset.

Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems. in Clustering and Associations problems.
Supervised learning can be used for those cases Unsupervised learning can be used for those
where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.

Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
learning.

Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine things
predict the correct output. by his experiences.

It includes various algorithms such as Linear It includes various algorithms such as


Regression, Logistic Regression, Support Clustering, KNN, and Apriori algorithm.
Vector Machine, Multi-class Classification,
Decision tree, Bayesian Logic, etc.

Reinforcement learning:

Reinforcement learning is an area of Machine Learning. It is about taking suitable action to


maximize reward in a particular situation. It is employed by various software and machines to find
the best possible behavior or path it should take in a specific situation. Reinforcement learning
differs from supervised learning in a way that in supervised learning the training data has the answer
key with it so the model is trained with the correct answer itself whereas in reinforcement learning,
there is no answer but the reinforcement agent decides what to do to perform the given task. In the
absence of a training dataset, it is bound to learn from its experience.
Example: The problem is as follows: We have an agent and a reward, with many hurdles in
between. The agent is supposed to find the best possible path to reach the reward. The following
problem explains the problem more easily.

The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward that
is the diamond and avoid the hurdles that are fired. The robot learns by trying all the possible paths
and then choosing the path which gives him the reward with the least hurdles. Each right step will
give the robot a reward and each wrong step will subtract the reward of the robot. The total reward
will be calculated when it reaches the final reward that is the diamond.
Main points in Reinforcement learning –

• Input: The input should be an initial state from which the model will start
• Output: There are many possible outputs as there are a variety of solutions to a particular
problem
• Training: The training is based upon the input, The model will return a state and the user will
decide to reward or punish the model based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.

Difference between Reinforcement learning and Supervised learning:


Reinforcement learning Supervised learning

Reinforcement learning is all about making decisions


sequentially. In simple words, we can say that the output In Supervised learning, the
depends on the state of the current input and the next input decision is made on the initial
depends on the output of the previous input input or the input given at the start

In supervised learning the


decisions are independent of each
In Reinforcement learning decision is dependent, So we give other so labels are given to each
labels to sequences of dependent decisions decision.

Example: Chess game Example: Object recognition

Types of Reinforcement: There are two types of Reinforcement:

1. Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words, it has a positive effect
on behavior.
Advantages of reinforcement learning are:
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which can diminish the
results
2. Negative –
Negative Reinforcement is defined as strengthening of behavior because a negative condition is
stopped or avoided.
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
Various Practical applications of Reinforcement Learning –

• RL can be used in robotics for industrial automation.


• RL can be used in machine learning and data processing
• RL can be used to create training systems that provide custom instruction and materials
according to the requirement of students.
• RL can be used in large environments in the following situations:
A model of the environment is known, but an analytic solution is not available;
• Only a simulation model of the environment is given (the subject of simulation-based
optimization)
Model Selection and Generalization

Model selection refers to the process of selecting the best model from a set of candidate models based
on their performance on a given task. This process typically involves splitting the available data into
training and validation sets, using the training set to train each candidate model, and then evaluating
their performance on the validation set. The model with the best performance on the validation set is
selected as the final model.

Fig:Model Selection and Generalization

Generalization refers to the ability of a model to perform well on new, unseen data. When a model is
trained on a dataset, it may overfitt the training data by memorizing specific patterns in the data that
are not representative of the underlying distribution. This can lead to poor performance on new data.
To ensure good generalization, it is important to evaluate a model's performance on a separate test set
that was not used during model selection or training.

To improve generalization, techniques such as regularization, early stopping, and data augmentation
can be used. Regularization involves adding a penalty term to the loss function to discourage complex
models that are prone to overfitting. Early stopping involves monitoring the validation error during
training and stopping the training process when the error begins to increase. Data augmentation
involves generating new training examples by applying transformations to existing examples, which
can increase the size and diversity of the training set and help prevent overfitting.

Overall, model selection and generalization are crucial aspects of machine learning that help ensure
that models are accurate and reliable, and can be applied successfully to new data.
Fig:Model Seleciton

GUIDELINES FOR MACHINE LEARNING EXPERIMENTS

Machine Learning Steps


The task of imparting intelligence to machines seems daunting and impossible. But it is actually really
easy. It can be broken down into 7 major steps :

1. Collecting Data:

As you know, machines initially learn from the data that you give them. It is of the utmost importance
to collect reliable data so that your machine learning model can find the correct patterns. The quality of
the data that you feed to the machine will determine how accurate your model is. If you have incorrect
or outdated data, you will have wrong outcomes or predictions which are not relevant.

Make sure you use data from a reliable source, as it will directly affect the outcome of your model. Good
data is relevant, contains very few missing and repeated values, and has a good representation of the
various subcategories/classes present.
2. Preparing the Data:

After you have your data, you have to prepare it. You can do this by:

• Putting together all the data you have and randomizing it. This helps make sure that data is evenly
distributed, and the ordering does not affect the learning process.

• Cleaning the data to remove unwanted data, missing values, rows, and columns, duplicate values,
data type conversion, etc. You might even have to restructure the dataset and changethe rows
and columns or index of rows and columns.

• Visualize the data to understand how it is structured and understand the relationship between
various variables and classes present.

• Splitting the cleaned data into two sets - a training set and a testing set. The training set is the set
your model learns from. A testing set is used to check the accuracy of your model after training.
Figure 3: Cleaning and Visualizing Data

3. Choosing a Model:

A machine learning model determines the output you get after running a machine learning algorithm
on the collected data. It is important to choose a model which is relevant to the task at hand. Over the
years, scientists and engineers developed various models suited for different tasks like speech
recognition, image recognition, prediction, etc. Apart from this, you also have to see if your model is
suited for numerical or categorical data and choose accordingly.

Figure 4: Choosing a model


4. Training the Model:

Training is the most important step in machine learning. In training, you pass the prepared data to your
machine learning model to find patterns and make predictions. It results in the model learning from the
data so that it can accomplish the task set. Over time, with training, the model gets better at predicting.

Figure 5: Training a model

5. Evaluating the Model:

After training your model, you have to check to see how it‟s performing. This is done by testing the
performance of the model on previously unseen data. The unseen data used is the testing set that you
split our data into earlier. If testing was done on the same data which is used for training, you will not
get an accurate measure, as the model is already used to the data, and finds the same patterns in it, as it
previously did. This will give you disproportionately high accuracy.

When used on testing data, you get an accurate measure of how your model will perform and its speed.
Figure 6: Evaluating a model
6. Parameter Tuning:

Once you have created and evaluated your model, see if its accuracy can be improved in any way. This
is done by tuning the parameters present in your model. Parameters are the variables in the model that
the programmer generally decides. At a particular value of your parameter, the accuracy will be the
maximum. Parameter tuning refers to finding these values.

Figure 7: Parameter Tuning

7. Making Predictions
In the end, you can use your model on unseen data to make predictions accurately.
How to Implement Machine Learning Steps in Python?
You will now see how to implement a machine learning model using Python.
In this example, data collected is from an insurance company, which tells you the variables that come
into play when an insurance amount is set. Using this, you will have to predict the insurance amount
for a person. This data was collected from Kaggle.com, which has many reliable datasets.
You need to start by importing any necessary modules, as shown.

Figure 8: Importing necessary modules

Following this, you will import the data.

Figure 9: Importing data


Figure 10: Insurance dataset

Now, clean your data by removing duplicate values, and transforming columns into numerical values
to make them easier to work with.

Figure 11: Cleaning Data

The final dataset becomes as shown.


Figure 12: Cleaned dataset

Now, split your dataset into training and testing sets.

Figure 13: Splitting the dataset

As you need to predict a numeral value based on some parameters, you will have to use Linear
Regression. The model needs to learn on your training set. This is done by using the '.fit' command.

Figure 14: Choosing and training your model

Now, predict your testing dataset and find how accurate your predictions are.
Figure 15: Predicting using your model

1.0 is the highest level of accuracy you can get. Now, get your parameters.

Figure 16: Model Parameters

The above picture shows the hyperparameters which affect the various variables in your dataset.
AI& ML Differences

AI is a bigger concept to create intelligent machines that can simulate human thinking capability and
behavior, whereas, machine learning is an application or subset of AI that allows machines to learn
from data without being programmed explicitly.

Below are some main differences between AI and machine learning along with the overview of Artificial
intelligence and machine learning

Artificial Intelligence

Artificial intelligence is a field of computer science which makes a computer system that can mimic
human intelligence. It is comprised of two words "Artificial" and "intelligence", which means "a
human-made thinking power." Hence we can define it as,

Artificial intelligence is a technology using which we can create intelligent systems that can simulate
human intelligence.

The Artificial intelligence system does not require to be pre-programmed, instead of that, they use such
algorithms which can work with their own intelligence. It involves machine learning algorithms such as
Reinforcement learning algorithm and deep learning neural networks. AI is being used in multiple places
such as Siri, Google?s AlphaGo, AI in Chess playing, etc.

Based on capabilities, AI can be classified into three types:


o Weak AI
o General AI
o Strong AI

Currently, we are working with weak AI and general AI. The future of AI is Strong AI for which it is
said that it will be intelligent than humans.

Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,
Machine learning is a subfield of artificial intelligence, which enables machines to learn from past
data or experiences without being explicitly programmed.

Artificial Intelligence Machine learning

Artificial intelligence is a technology which Machine learning is a subset of AI which allows a machine
enables a machine to simulate humanbehavior. to automatically learn from past data withoutprogramming
explicitly.

The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from data so
system like humans to solve complex that they can give accurate output.
problems.

In AI, we make intelligent systems to perform In ML, we teach machines with data to perform a
any task like a human. particular task and give an accurate result.

Machine learning and deep learning are the Deep learning is a main subset of machine learning.
two main subsets of AI.

AI has a very wide range of scope. Machine learning has a limited scope.

AI is working to create an intelligent system Machine learning is working to create machines that can
which can perform various complex tasks. perform only those specific tasks for which they are trained.

AI system is concerned about maximizing the Machine learning is mainly concerned about accuracy and
chances of success. patterns.

The main applications of AI are Siri, customer The main applications of machine learning are Online
support using catboats, Expert System, recommender system, Google search
Online game playing, intelligent algorithms, Facebook auto friend tagging suggestions,
humanoid robot, etc. etc.

On the basis of capabilities, AI can be divided Machine learning can also be divided into mainly three
into three types, which are, Weak AI, General types that are Supervised learning, Unsupervised
AI, and Strong AI. learning, and Reinforcement learning.

It includes learning, reasoning, and self- It includes learning and self-correction when introduced
correction. with new data.

AI completely deals with Structured, semi- Machine learning deals with Structured and semi-
structured, and unstructured data. structured data.
UNIT-III

UNSUPERVISED LEARNING: clustering- Introduction- Mixture Densities- k-Means


Clustering- Expectation-Maximization Algorithm- Mixtures of Latent Variable
Models- Supervised Learning after Clustering- Hierarchical Clustering

Unsupervised learning is a type of machine learning where the model is trained on unlabeled
data. Unlike supervised learning, where the algorithm learns from labeled input-output pairs,
unsupervised learning finds patterns, structures, or relationships in the data without explicit
guidance.

Working process of Unsupervised Learning

• Unsupervised learning models are given unlabeled data.


• The models discover patterns and insights without explicit guidance.
• The models can identify natural groupings within the data.
Basically unsupervised learning is categorized into

Types of Unsupervised Learning

1. Clustering – Groups similar data points together.


o Examples:
▪ K-Means
▪ DBSCAN
▪ Hierarchical Clustering
2. Dimensionality Reduction – Reduces the number of features while preserving
essential information.
o Examples:
▪ Principal Component Analysis (PCA)
▪ t-SNE
▪ Autoencoders
3. Association Rule Learning – Identifies relationships between variables in large
datasets.
o Examples:
▪ Apriori Algorithm
▪ FP-Growth Algorithm
4. Anomaly Detection – Identifies rare or unusual data points.
o Examples:
▪ Isolation Forest
▪ One-Class SVM

Applications of Unsupervised Learning

• Customer segmentation
• Anomaly detection in cybersecurity
• Feature extraction for supervised learning
• Image compression

CLUSTERING

Clustering is grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis. This method is defined under the branch
of Unsupervised Learning, which aims at gaining insights from unlabelled data points,
that is, unlike supervised learning we don’t have a target variable.

Clustering aims at forming groups of homogeneous data points from a heterogeneous


dataset. It evaluates the similarity based on a metric like Euclidean distance, Cosine
similarity, Manhattan distance, etc. and then group the points with highest similarity
score together.

For Example, in the graph given below, we can clearly see that there are 3 circular
clusters forming on the basis of distance.
• Now it is not necessary that the clusters formed must be circular in shape. The shape
of clusters can be arbitrary. There are many algortihms that work well with detecting
arbitrary shaped clusters.
• For example, In the below given graph we can see that the clusters formed are not
circular in shape.

Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group similar
data points:
Hard Clustering: In this type of clustering, each data point belongs to a cluster
completely or not. For example, Let’s say there are 4 data point and we have to cluster
them into 2 clusters. So each data point will either belong to cluster 1 or cluster 2.
Soft Clustering: In this type of clustering, instead of assigning each data point into a
separate cluster, a probability or likelihood of that point being that cluster is evaluated.

For example, Let’s say there are 4 data point and we have to cluster them into 2 clusters. So
we will be evaluating a probability of a data point belonging to both clusters. This
probability is calculated for all data points.

Simplify working with large datasets – Each cluster is given a cluster ID after clustering is
complete. Now, you may reduce a feature set’s whole feature set into its cluster ID.
Clustering is effective when it can represent a complicated case with a straightforward
cluster ID. Using the same principle, clustering data can make complex datasets simpler.

Types of Clustering methods


The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But there are
also other various approaches of Clustering exist. Below are the main clustering methods in
Machine learning are
:
• Centroid-based Clustering (Partitioning methods)
• Density-based Clustering (Model-based methods)
• Distribution-based Clustering
• Connectivity-based Clustering (Hierarchical clustering)

• Fuzzy Clustering
Partitioning Clustering (Centroid based Clustering)
Partitional clustering is a method that divides a dataset into a predetermined number of non-
overlapping clusters, where each data point belongs to only one cluster, aiming to optimize a
specific objective function like minimizing intra-cluster distance.

Definition:

Partitional clustering algorithms aim to partition a dataset into a set of disjoint clusters,
meaning each data point belongs to only one cluster. It is a type of clustering that divides the
data into non-hierarchical groups. It is also known as the centroid-based method. The most
common example of partitioning clustering is the K-Means Clustering algorithm.

Process:
These algorithms require the analyst to specify the number of clusters (K) beforehand. The
algorithm then iteratively refines the cluster assignments to minimize the distance between
data points and their respective cluster centroids.
Objective Function:
The goal is to find the optimal partitioning of the data that minimizes the within-cluster
variance or maximizes the between-cluster variance.
Popular Algorithms:
K-means: A widely used algorithm that assigns data points to the nearest cluster centroid,
iteratively updating the centroids until convergence.
K-medoids: Similar to K-means, but instead of using centroids, it uses medoids (representative
data points) to define the clusters.
Mini-batch K-means: An efficient variant of K-means that uses mini-batches of data points to
speed up the clustering process.

Advantages:
Computational Efficiency: Partitional algorithms are generally computationally efficient and easy
to implement.
Suitable for Large Datasets: They can handle large datasets effectively.
Good for Clusters of Similar Shapes and Sizes: They perform well when clusters have similar
shapes and sizes.

Disadvantages:
Requires Predefined Number of Clusters: The analyst needs to specify the number of clusters (K)
in advance, which can be challenging for complex datasets.
Struggles with Clusters of Varying Shapes and Sizes: They may struggle with clusters that have
irregular shapes or sizes.
In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between
the data points of one cluster is minimum as compared to another cluster centroid.

Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser
areas.

These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.

DBSCAN is a density-based clustering algorithm that groups data points that are
closely packed together and marks outliers as noise based on their density in the feature
space. It identifies clusters as dense regions in the data space, separated by areas of lower
density.
Unlike K-Means or hierarchical clustering, which assume clusters are compact and
spherical, DBSCAN excels in handling real-world data irregularities such as:
Arbitrary-Shaped Clusters: Clusters can take any shape, not just circular or convex.
Noise and Outliers: It effectively identifies and handles noise points without assigning
them to any cluster.

Fig.Density based clustering


Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).

Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering as there is
no requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative
Hierarchical algorithm.
Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the
degree of membership to be in a cluster. Fuzzy C-means algorithm is the example of this
type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms

The Clustering algorithms can be divided based on their models.There are different types of
clustering algorithms published, but only a few are commonly used. The clustering algorithm
is based on the kind of data that we are using. Such as, some algorithms need to guess the
number of clusters in the given dataset, whereas some are required to find the minimum
distance between the observation of the dataset.

Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm. It is fast
with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the
smooth density of data points. It is an example of a centroid-based model, that works
on updating the candidates for centroid to be the center of the points within a given
region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of high
density are separated by the areas of low density. Because of this, the clusters can be
found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as
an alternative for the k-means algorithm or for those cases where K-means can be
failed. In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can be
represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not
require to specify the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N2T) time complexity,
which is the main drawback of this algorithm.

Applications of Clustering

Below are some commonly known applications of clustering technique in Machine Learning:

o In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets
into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers
based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands
use in the GIS database. This can be very useful to find that for what purpose the
particular land should be used, that means for which purpose it is more suitable.
o Market Segmentation – Businesses use clustering to group their customers and use
targeted advertisements to attract more audience.
o Market Basket Analysis – Shop owners analyze their sales and figure out which
items are majorly bought together by the customers. For example, In USA,
according to a study diapers and beers were usually bought together by fathers.
o Social Network Analysis – Social media sites use your data to understand your
browsing behaviour and provide you with targeted friend recommendations or
content recommendations.
o Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic
images like X-rays.
o Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting
fraudulent transactions we can use clustering to identify them.

K-Means Clustering Algorithm


K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled


dataset into different clusters. Here K defines the number of pre-defined clusters that need to
be created in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabelled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:


Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting
the below two points as k points, which are not the part of our dataset. Consider the
below image:

o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied to
calculate the distance between two points. So, we will draw a median between both
the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them
as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of
the line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, one of the most appropriate method to
find the number of clusters or value of K. The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in


CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each
data point and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
Note: We can choose the number of clusters equal to the given data points. If we choose the
number of clusters equal to the data points, then the value of WCSS becomes zero, and that
will be the endpoint of the plot.

Hierarchical Clustering in Machine Learning

Hierarchical clustering is another unsupervised machine learning algorithm, which is used to


group the unlabelled datasets into a cluster and also known as hierarchical cluster
analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look similar, but
they both differ depending on how they work. As there is no requirement to predetermine the
number of clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:
1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm
starts with taking all data points as single clusters and merging them until one cluster
is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is
a top-down approach.
NEED of hierarchical clustering

In the K-means clustering that there are some challenges with this algorithm, which are a
predetermined number of clusters, and it always tries to create the clusters of the same size.
To solve these two challenges, we can opt for the hierarchical clustering algorithm because,
in this algorithm, we don't need to have knowledge about the predefined number of clusters.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group


the datasets into clusters, it follows the bottom-up approach. It means, this algorithm
considers each dataset as a single cluster at the beginning, and then start combining the
closest pair of clusters together. It does this until all the clusters are merged into a single
cluster that contains all the datasets. This hierarchy of clusters is represented in the form of
the dendrogram.
The steps for agglomerative clustering are as follows
1. Compute the proximity matrix using a distance metric
2. Use a linkage function to group objects into a hierarchical cluster tree based on
computed distance matrix from the above step.
3. Data points with close proximity are merged together to form a cluster
4. Repeat steps 2 and 3 until a single cluster remains.
•The data points 1,2,...6 are assigned to each individual cluster.
•After calculating the proximity matrix, based on the similarity the points 2,3 and 4,5
are merged together to form clusters.
• Again, the proximity matrix is computed and clusters with points 4,5 and 6 are
merged together.
• And again, the proximity matrix is computed, then the clusters with points 4,5,6 and
2,3 are merged together to form a cluster.
• As a final step, the remaining clusters are merged together to form a single cluster.
PROXIMITY MATRIX AND LINKAGE
The proximity matrix is a matrix consisting of the distance between each pair of data points.
The distance is computed by a distance function. Euclidean distance is one of the most
commonly used distance functions.
The above proximity matrix consists of n points named x, and the d(xi,xj) represents the
distance between the points.

The working of the AHC algorithm can be explained using the below steps:
Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.

Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there
will now be N-1 clusters.

Step-3: Again, take the two closest clusters and merge them together to form one cluster.
There will be N-2 clusters.

Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
Measure for the distance between two clusters
The closest distance between the two clusters is crucial for the hierarchical clustering. There
are various ways to calculate the distance between two clusters, and these ways decide the
rule for clustering. These measures are called Linkage methods. Some of the popular linkage
methods are given below:
Single Linkage:
It is the Shortest Distance between the closest points of the clusters. Consider the below
image:

Complete Linkage:
It is the farthest distance between the two points of two different clusters. It is one of the
popular linkage methods as it forms tighter clusters than single linkage.

Average Linkage:
It is the linkage method in which the distance between each pair of datasets is added up and
then divided by the total number of datasets to calculate the average distance between two
clusters. It is also one of the most popular linkage methods.

Centroid Linkage:
It is the linkage method in which the distance between the centroid of the clusters is
calculated. Consider the below image:

Ward Linkage
The Ward approach analyzes the variance of the clusters rather than measuring distances
directly, minimizing the variance between clusters. With the Ward method, the distance
between two clusters is related to how much the sum of squares (SS) value will increase
when combined.
In other words, the Ward method attempts to minimize the sum of the squared distances of
the points from the cluster centers. Compared to the distance-based measures described
above, the Ward method is less susceptible to noise and outliers. Therefore, Ward's method is
preferred more than others in clustering.
Dendrogram in Hierarchical clustering

The dendrogram is a tree-like structure that is mainly used to store each step as a memory
that the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean
distances between the data points, and the x-axis shows all the data points of the given
dataset.

The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram.
o As we have discussed above, firstly, the datapoints P2 and P3 combine together and
form a cluster, correspondingly a dendrogram is created, which connects P2 and P3
with a rectangular shape. The hight is decided according to the Euclidean distance
between the data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is
created. It is higher than of previous, as the Euclidean distance between P5 and P6 is a
little bit greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one
dendrogram, and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points together.
We can cut the dendrogram tree structure at any level as per our requirement.
Steps for implementation of AHC using Python:
The steps for implementation will be the same as the k-means clustering, except for some
changes such as the method to find the number of clusters. Below are the steps:
1. Data Pre-processing
2. Finding the optimal number of clusters using the Dendrogram
3. Training the hierarchical clustering model
4. Visualizing the clusters

DIVISIVE CLUSTERING

Divisive Clustering works just the opposite of agglomerative clustering. It starts by


considering all the data points into a big single cluster and later on splitting them into smaller
heterogeneous clusters continuously until all data points are in their own cluster. Thus, they
are good at identifying large clusters. It follows a top-down approach and is more efficient
than agglomerative clustering. But, due to its complexity in implementation, it doesn’t have
any predefined implementation in any of the major machine learning frameworks.

STEPS IN DIVISIVE CLUSTERING


• Consider all the data points as a single cluster.
• Split into clusters using any flat-clustering method, say K-Means.

• Choose the best cluster among the clusters to split further, choose the one that has the
largest Sum of Squared Error (SSE).
• Repeat steps 2 and 3 until a single cluster is formed.

In the above figure,

• The data points 1,2,...6 are assigned to large cluster.


• After calculating the proximity matrix, based on the dissimilarity the points are split
up into separate clusters.
• The proximity matrix is again computed until each point is assigned to an individual
cluster.
Limits of Hierarchical Clustering
Hierarchical clustering isn’t a fix-all; it does have some limits. Among them:

• It has high time and space computational complexity. For computing proximity
matrix, the time complexity is O(N2), since it takes N steps to search, the total time
complexity is O(N3)
• There is no objective function for hierarchical clustering.
• Due to high time complexity, it cannot be used for large datasets.
• It is sensitive to noise and outliers since we use distance metrics.

• It has difficulty handling large clusters.

Applications
There are many real-life applications of Hierarchical clustering. They include:
• Bioinformatics: grouping animals according to their biological features to reconstruct
phylogeny trees
• Business: dividing customers into segments or forming a hierarchy of employees
based on salary.
• Image processing: grouping handwritten characters in text recognition based on the
similarity of the character shapes.
• Information Retrieval: categorizing search results based on the query.

MIXTURE DENSITITES
• Mixture densities are a fundamental concept and it plays a significant role in machine
learning, particularly in probabilistic modelling unsupervised learning and density
estimation tasks.
• They are used to represent complex probability distributions by combining simpler
distributions, such as Gaussians, into a weighted mixture.

• This approach is widely used for tasks like clustering, density estimation, and
generative modeling.
Below is a detailed explanation of mixture densities in machine learning:

• Mixture models are probabilistic models that assume the data is generated from a
mixture of several probability distributions. Each component of the mixture represents
a cluster or a subgroup within the data.
• The most common type is the Gaussian Mixture Model (GMM), where each
component is a Gaussian distribution.

Key Concepts:

• Components: The individual distributions (e.g., Gaussians) in the mixture.


• Mixing Coefficients: The weights assigned to each component, representing the
probability that a data point belongs to that component.
• Latent Variables: Unobserved variables that indicate which component a data point
belongs to.

4. Applications in Unsupervised Learning


• Clustering: Mixture models can be used to cluster data points into groups based on
their probabilistic assignments to components.
• Density Estimation: Mixture models can model the underlying probability
distribution of the data, which is useful for tasks like anomaly detection.
• Dimensionality Reduction: Techniques like mixture of factor analyzers combine
mixture models with dimensionality reduction.

5. Advantages
• Flexibility: Can model complex, multi-modal distributions.
• Probabilistic Framework: Provides soft assignments (probabilities) rather than hard
assignments.
6. Challenges
• Choosing the Number of Components: Determining the optimal number of
components KK is often non-trivial.
• Sensitivity to Initialization: The EM algorithm can converge to local optima, so
initialization is crucial.
• Computational Complexity: Fitting mixture models can be computationally expensive
for large datasets.

7. Extensions and Variants


• Dirichlet Process Mixture Models (DPMM): Non-parametric models that
automatically infer the number of components.
• Mixture of Experts: Combines mixture models with conditional distributions for
supervised learning tasks.
• Variational Inference: An alternative to EM for fitting mixture models, often more
scalable.

EM Algorithm
The EM algorithm was proposed and named in a seminal paper published in 1977 by
Arthur Dempster, Nan Laird, and Donald Rubin.
This approach is often referred to as handling missing data. By using the available instances
where the variable is observable, machine learning algorithms can learn patterns and
relationships from the observed data. These learned patterns can then be used to predict the
values of the variable in instances where it is missing or not observable.

The Expectation-Maximization (EM) algorithm is an iterative optimization method that


combines different unsupervised machine learning algorithms that involve unobserved latent
variables.
The EM algorithm is commonly used for latent variable models and can handle missing
data. It consists of an estimation step (E-step) and a maximization step (M-step), forming
an iterative process to improve model fit.
• In the E step, the algorithm computes the latent variables i.e. expectation of the
log-likelihood using the current parameter estimates.
• In the M step, the algorithm determines the parameters that maximize the
expected log-likelihood obtained in the E step, and corresponding model
parameters are updated based on the estimated latent variables.
Expectation-Maximization in EM Algorithm

By iteratively repeating these steps, the EM algorithm seeks to maximize the likelihood of
the observed data. It is commonly used for unsupervised learning tasks, such as clustering,
where latent variables are inferred and has applications in various fields, including machine
learning, computer vision, and natural language processing.

Key Terms in Expectation-Maximization (EM) Algorithm

Some of the most commonly used key terms in the Expectation-Maximization (EM)
Algorithm are as follows:

Latent Variables: Latent variables are unobserved variables in statistical models that
can only be inferred indirectly through their effects on observable variables. They
cannot be directly measured but can be detected by their impact on the observable
variables.
Likelihood: It is the probability of observing the given data given the parameters of the
model. In the EM algorithm, the goal is to find the parameters that maximize the
likelihood.
Log-Likelihood: It is the logarithm of the likelihood function, which measures the
goodness of fit between the observed data and the model. EM algorithm seeks to
maximize the log-likelihood.
Maximum Likelihood Estimation (MLE): MLE is a method to estimate the parameters
of a statistical model by finding the parameter values that maximize the likelihood
function, which measures how well the model explains the observed data.
Posterior Probability: In the context of Bayesian inference, the EM algorithm can be
extended to estimate the maximum a posteriori (MAP) estimates, where the posterior
probability of the parameters is calculated based on the prior distribution and the
likelihood function.
Expectation (E) Step: The E-step of the EM algorithm computes the expected value or
posterior probability of the latent variables given the observed data and current
parameter estimates. It involves calculating the probabilities of each latent variable for
each data point.
Maximization (M) Step: The M-step of the EM algorithm updates the parameter
estimates by maximizing the expected log-likelihood obtained from the E-step. It
involves finding the parameter values that optimize the likelihood function, typically
through numerical optimization methods.
Convergence: Convergence refers to the condition when the EM algorithm has reached
a stable solution. It is typically determined by checking if the change in the log-
likelihood or the parameter estimates falls below a predefined threshold.

Working principle of Expectation-Maximization (EM) Algorithm

The essence of the Expectation-Maximization algorithm is to use the available observed


data of the dataset to estimate the missing data and then use that data to update the values
of the parameters. Let us understand the EM algorithm in detail.

EM Algorithm Flowchart

1. Initialization:
Initially, a set of initial values of the parameters are considered. A set of incomplete
observed data is given to the system with the assumption that the observed data
comes from a specific model.
2. E-Step (Expectation Step): In this step, we use the observed data in order to
estimate or guess the values of the missing or incomplete data. It is basically used to
update the variables.
Compute the posterior probability or responsibility of each latent variable given the
observed data and current parameter estimates.

• Estimate the missing or incomplete data values using the current parameter
estimates.
• Compute the log-likelihood of the observed data based on the current
parameter estimates and estimated missing data.
3. M-step (Maximization Step): In this step, we use the complete data generated in
the preceding “Expectation” – step in order to update the values of the parameters. It is
basically used to update the hypothesis.
• Update the parameters of the model by maximizing the expected complete
data log-likelihood obtained from the E-step.
• This typically involves solving optimization problems to find the parameter
values that maximize the log-likelihood.
• The specific optimization technique used depends on the nature of the
problem and the model being used.
4. Convergence: In this step, it is checked whether the values are converging or not, if
yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.

• Check for convergence by comparing the change in log-likelihood or the


parameter values between iterations.
• If the change is below a predefined threshold, stop and consider the
algorithm converged.
• Otherwise, go back to the E-step and repeat the process until convergence is
achieved.
Applications of EM algorithm
The primary aim of the EM algorithm is to estimate the missing data in the latent variables
through observed data in datasets. The EM algorithm or latent variable model has a broad
range of real-life applications in machine learning. These are as follows:
o The EM algorithm is applicable in data clustering in machine learning.
o It is often used in computer vision and NLP (Natural language processing).
o It is used to estimate the value of the parameter in mixed models such as
the Gaussian Mixture Modeland quantitative genetics.
o It is also used in psychometrics for estimating item parameters and latent abilities of
item response theory models.
o It is also applicable in the medical and healthcare industry, such as in image
reconstruction and structural engineering.
o It is used to determine the Gaussian density of a function.
Advantages of EM algorithm
o It is very easy to implement the first two basic steps of the EM algorithm in various
machine learning problems, which are E-step and M- step.
o It is mostly guaranteed that likelihood will enhance after each iteration.
o It often generates a solution for the M-step in the closed form.
Disadvantages of EM algorithm
o The convergence of the EM algorithm is very slow.
o It can make convergence for the local optima only.
o It takes both forward and backward probability into consideration. It is opposite to
that of numerical optimization, which takes only forward probabilities.

Expectation-Maximization | EM | Algorithm Steps Uses Advantages and Disadvantages by


Mahesh Huddar

For example problem using EM algorithm follow the below link


Expectation Maximization | EM Algorithm Solved Example | Coin Flipping Problem | EM by Mahesh
Huddar

Gaussian Mixture Model (GMM)

The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function. Further, GMM also
requires estimated statistics values such as mean and standard deviation or parameters. It
is used to estimate the parameters of the probability distributions to best fit the density of a
given training dataset.

Although there are plenty of techniques available to estimate the parameter of the Gaussian
Mixture Model (GMM), the Maximum Likelihood Estimation is one of the most popular
techniques among them.
The processes used to generate the data point represent a latent variable or unobservable data.
In such cases, the Estimation-Maximization algorithm is one of the best techniques which
helps us to estimate the parameters of the gaussian distributions. In the EM algorithm, E-step
estimates the expected value for each latent variable, whereas M-step helps in optimizing
them significantly using the Maximum Likelihood Estimation (MLE). Further, this process
is repeated until a good set of latent values, and a maximum likelihood is achieved that fits
the data.

Finite and countable mixtures

Given a finite set of probability density functions p1(x), ..., pn(x), or corresponding cumulative
distribution functions P1(x), ..., Pn(x) and weights w1, ..., wn such that wi ≥ 0 and Σwi = 1, the
mixture distribution can be represented by writing either the density, f, or the distribution
function, F, as a sum (which in both cases is a convex combination):

This type of mixture, being a finite sum, is called a finite mixture, and in
applications, an unqualified reference to a "mixture density" usually means a finite
mixture. The case of a countably infinite set of components is covered formally by
allowing
MIXTURE OF LATENT VARIABLE MODELS

A mixture of latent variable models is a powerful framework in machine learning and


statistics that combines the strengths of mixture models and latent variable models. This
approach is particularly useful for capturing complex data distributions, clustering, and
dimensionality reduction.

Key Concepts:
1. Mixture Models:
o A mixture model is a probabilistic model that assumes the data is generated
from a mixture of several distributions (e.g., Gaussian distributions). Each
component in the mixture corresponds to a cluster or subgroup in the data.
o The most common example is the Gaussian Mixture Model (GMM), where
each component is a Gaussian distribution.
2. Latent Variable Models:
o Latent variable models assume that the observed data is generated from some
underlying latent (unobserved) variables. These models are often used for
dimensionality reduction or to capture hidden structure in the data.
o Examples include Factor Analysis, Probabilistic Principal Component
Analysis (PPCA), and Variational Autoencoders (VAEs).
3. Mixture of Latent Variable Models:
o This combines the two ideas: the data is assumed to be generated from a
mixture of distributions, where each component is itself a latent variable
model.
o For example, a mixture of factor analyzers (MFA) assumes that the data is
generated from a mixture of factor analysis models, where each component
has its own latent variables.

Applications:
1. Clustering:
o The mixture components can represent different clusters in the data, while the
latent variables capture the structure within each cluster.
o For example, in a mixture of factor analysers, each cluster has its own low-
dimensional representation.
2. Dimensionality Reduction:
o Latent variable models reduce the dimensionality of the data, and the mixture
model allows for multiple low-dimensional representations corresponding to
different clusters.
3. Density Estimation:
o Mixture of latent variable models can model complex, multi-modal data
distributions more effectively than simple models.

Examples of Mixture of Latent Variable Models:


1. Mixture of Factor Analyzers (MFA):
o Each component in the mixture is a factor analysis model, which assumes that
the data is generated from a linear transformation of latent variables plus
noise.
o MFA is useful for high-dimensional data clustering and dimensionality
reduction.
2. Mixture of Probabilistic Principal Component Analyzers (MPPCA):
o Similar to MFA, but each component is a probabilistic PCA model. This is a
special case of MFA where the noise covariance is isotropic.
3. Mixture of Variational Autoencoders:
o Combines the flexibility of VAEs with the clustering ability of mixture
models. Each component in the mixture is a VAE, which can model complex
non-linear relationships in the data.
4. Mixture of Hidden Markov Models (MHMM):
o Used in sequential data, where each component is an HMM that captures
temporal dependencies in the data.

Advantages:
• Flexibility: Can model complex, multi-modal data distributions.
• Interpretability: Latent variables provide insights into the underlying structure of the
data.
• Scalability: Can handle high-dimensional data by reducing dimensionality within
each cluster.

Challenges:
1. Model Selection:
o Choosing the number of mixture components and the dimensionality of the
latent variables can be difficult.
o Techniques like cross-validation, Bayesian Information Criterion (BIC), or
Dirichlet Process Mixtures can help.
2. Optimization:
o Training mixture of latent variable models often involves non-convex
optimization, which can be computationally expensive and prone to local
optima.
o Expectation-Maximization (EM) or variational inference are commonly used.
3. Overfitting:
o Complex models with many parameters can overfit the data, especially with
limited samples. Regularization or Bayesian approaches can mitigate this.

SUPERVISED LEARNING AFTER CLUSTERING

After clustering, supervised learning can be applied in several ways depending on the specific
goal of your analysis. Here are a few common approaches:

Label Propagation: If you have labels for a subset of your data, you can propagate these
labels to the entire dataset based on the clusters. For example, if most of the data points in a
cluster have a certain label, you can assign that label to all data points in that cluster.
Cluster as a Feature: You can treat the cluster assignments as additional features in your
dataset and then use these features in a supervised learning model. This can sometimes
improve the performance of the model, especially if the clusters capture useful information
about the data.
Cluster-Specific Models: You can train a separate supervised learning model for each
cluster. This allows you to capture the different characteristics of each cluster and can
sometimes lead to better performance compared to a single model for the entire dataset.
Cluster Ensemble: You can create an ensemble of supervised learning models, where each
model is trained on a different cluster. This can help capture the heterogeneity of the data and
improve the overall performance of the ensemble.

Supervised learning after clustering is a hybrid approach where clustering is first used to
identify structure in the data, followed by supervised learning to make predictions based on
the discovered clusters. This technique is useful when labels are scarce or noisy, and
clustering can help extract meaningful representations before classification or regression.

Steps in Supervised Learning After Clustering

1. Perform Clustering
o Apply clustering algorithms like K-Means, DBSCAN, Hierarchical
Clustering, or Gaussian Mixture Models (GMM) to group similar data
points.
o This helps discover inherent structures, segment the data, or reduce noise.
2. Assign Cluster Labels
o Use the cluster assignments as new categorical features for supervised
learning.
o Optionally, analyze clusters manually and assign meaningful labels if
available.
3. Train a Supervised Learning Model
o Use labeled data (if available) or pseudo-labels derived from clusters.
o Common models: Decision Trees, Random Forests, Support Vector
Machines (SVM), Neural Networks.
o Features can include original attributes plus cluster membership.
4. Evaluate Performance
o Compare models trained with and without clustering.
o Measure metrics like accuracy, precision, recall, F1-score for classification
or RMSE, MAE for regression.

Use Cases

• Anomaly Detection: Clustering helps identify rare patterns before classification.


• Customer Segmentation: Clustered customers are classified into purchasing
behaviors.
• Medical Diagnosis: Clusters may represent hidden disease subtypes, later used for
classification.
• Cybersecurity: Clustering suspicious behavior in network traffic before supervised
detection.

Supervised learning after clustering


Combining clustering (an unsupervised learning technique) with supervised learning is a
common strategy in machine learning, especially when you want to leverage the structure
discovered by clustering to improve the performance of a supervised model. This approach is
often referred to as semi-supervised learning or feature engineering using clustering.
Below is a detailed explanation of how to use clustering in conjunction with supervised
learning:

1. Clustering as Feature Engineering


• Idea: Use the results of clustering (e.g., cluster labels or distances to cluster centroids)
as additional features in a supervised learning model.
• Steps:
1. Perform clustering on the dataset (e.g., using K-Means, Gaussian Mixture
Models, or DBSCAN).
2. Extract cluster-related features, such as:
▪ Cluster labels (categorical feature).
▪ Distance to each cluster centroid (continuous features).
▪ Probability of belonging to each cluster (for soft clustering like GMM).
3. Add these features to the original dataset.
4. Train a supervised model (e.g., regression, classification) on the augmented
dataset.
• Example:
o In a customer segmentation problem, you might cluster customers based on
their behavior and then use the cluster labels as a feature in a churn prediction
model.

2. Cluster-Based Model Ensembling


• Idea: Train separate supervised models for each cluster and combine their predictions.
• Steps:
1. Cluster the data into distinct groups.
2. Train a supervised model (e.g., classifier or regressor) on each cluster
separately.
3. For a new data point, first assign it to a cluster, then use the corresponding
model to make a prediction.
• Example:
o In a recommendation system, you might cluster users into different groups
based on their preferences and train separate recommendation models for each
group.

3. Semi-Supervised Learning
• Idea: Use clustering to propagate labels in a partially labeled dataset.
• Steps:
1. Perform clustering on the entire dataset (both labeled and unlabeled data).
2. Use the cluster structure to infer labels for the unlabeled data (e.g., by
assigning the majority label of the cluster).
3. Train a supervised model on the now fully labeled dataset.
• Example:
o In a text classification task, you might cluster documents and use the cluster
assignments to label unlabeled documents, then train a classifier on the
expanded dataset.

4. Cluster-Based Regularization
• Idea: Use clustering to guide the training of a supervised model by incorporating
cluster information into the loss function.
• Steps:
1. Perform clustering on the dataset.
2. Modify the loss function of the supervised model to encourage similar
predictions for data points in the same cluster.
• Example:
o In a deep learning model, you might add a regularization term to the loss
function that penalizes differences in predictions for points within the same
cluster.

5. Cluster-Based Data Augmentation


• Idea: Use clustering to generate synthetic data for training a supervised model.
• Steps:
1. Perform clustering on the dataset.
2. For each cluster, generate synthetic data points (e.g., by sampling from the
cluster distribution or using techniques like SMOTE).
3. Train a supervised model on the augmented dataset.
• Example:
o In an imbalanced classification problem, you might oversample minority
clusters to balance the dataset.

6. Cluster-Based Model Interpretation


• Idea: Use clustering to interpret or explain the behavior of a supervised model.
• Steps:
1. Train a supervised model on the dataset.
2. Cluster the data based on the model's predictions or intermediate
representations (e.g., hidden layer activations in a neural network).
3. Analyze the clusters to understand how the model is making decisions.
• Example:
o In a fraud detection model, you might cluster the predictions to identify
different types of fraudulent behavior.

Practical Considerations:
• Choice of Clustering Algorithm:
o The choice of clustering algorithm (e.g., K-Means, DBSCAN, hierarchical
clustering) depends on the data and the problem. For example, DBSCAN is
better for data with noise, while K-Means is faster for large datasets.
• Number of Clusters:
o The number of clusters is a hyperparameter that can significantly impact
performance. Use techniques like the elbow method, silhouette score, or
domain knowledge to choose an appropriate number.
• Feature Scaling:
o Clustering algorithms like K-Means are sensitive to the scale of features, so
ensure proper normalization or standardization before clustering.
• Overfitting:
o Be cautious when using clustering results as features, as they might introduce
noise or overfitting if the clusters do not generalize well to new data.
Example Workflow:
1. Dataset: A dataset with features X and labels y (for supervised learning).
2. Clustering:
o Apply K-Means clustering to X to create k clusters.
o Add the cluster labels as a new feature to X.
3. Supervised Learning:
o Split the augmented dataset into training and test sets.
o Train a classifier (e.g., Random Forest) on the training set.
o Evaluate the model on the test set.

Benefits of Combining Clustering with Supervised Learning:


• Improved Performance: Clustering can reveal hidden patterns in the data that
improve the predictive power of the supervised model.
• Interpretability: Clusters can provide insights into the structure of the data, making
the model's predictions more interpretable.
• Handling Unlabeled Data: Clustering can be used to infer labels in semi-supervised
learning scenarios.

By integrating clustering with supervised learning, you can leverage the strengths of both
approaches to build more robust and interpretable models.

You might also like