0% found this document useful (0 votes)
2 views

Module 3_ Machine Learning Algorithms

Module 3 covers machine learning classification, detailing how algorithms categorize data into binary, multiclass, and multi-label classifications using labeled datasets. It explains the steps of model training, evaluation, and prediction, along with examples of real-life applications. Additionally, it introduces artificial neural networks, their structure, learning process, and various types, alongside regression techniques like linear regression for predicting continuous values.

Uploaded by

727824TUIO042
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 3_ Machine Learning Algorithms

Module 3 covers machine learning classification, detailing how algorithms categorize data into binary, multiclass, and multi-label classifications using labeled datasets. It explains the steps of model training, evaluation, and prediction, along with examples of real-life applications. Additionally, it introduces artificial neural networks, their structure, learning process, and various types, alongside regression techniques like linear regression for predicting continuous values.

Uploaded by

727824TUIO042
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Module 3: Machine Learning

Algorithms
CLASSIFICATION
Classification teaches a machine to sort items into categories.

It learns from labeled examples (e.g., emails marked as "spam" or "not spam").

After learning, it can categorize new items (e.g., identifying if a new email is spam).

Example: A model trained on images of dogs and cats can predict the class of new images
based on features like color, texture, and shape.

The horizontal axis represents the combined values of color and texture features.

The vertical axis represents the combined values of shape and size features.

Each colored dot represents an individual image, with the color indicating the model's
prediction (dog or cat).

Shaded areas show the decision boundary, which the model uses to decide which category
an image belongs to.

Types of Classification
Classification sorts data into categories based on features.

1. Binary Classification
Sorts data into two distinct categories.

Like making a choice between two options.

Example: A system that sorts emails into spam or not spam.

It examines email features (keywords, sender details) and decides if it's spam.

2. Multiclass Classification
Sorts data into more than two categories.

The model picks the category that best matches the input.

Example: An image recognition system that sorts pictures of animals into categories like
cat, dog, and bird.

The system looks at features (shape, color, texture) and chooses the most likely animal.

3. Multi-Label Classification
A single piece of data can belong to multiple categories at once.
Differs from multiclass classification, where each data point belongs to only one class.

Example: A movie recommendation system tags a movie as both action and comedy.

The system checks features (plot, actors, genre tags) and assigns multiple labels to a single
piece of data.

How Classification in Machine Learning Works


Classification involves training a model using a labeled dataset.

Each input is paired with its correct output label.

The model learns patterns and relationships in the data.

It can then predict labels for new, unseen inputs.

Steps:
1. Data Collection: Start with a dataset where each item is labeled with the correct class (e.g.,
"cat" or "dog").

2. Feature Extraction: Identify features (color, shape, texture) that distinguish one class from
another.

3. Model Training: The classification algorithm uses labeled data to learn how to map the
features to the correct class, looking for patterns and relationships.

4. Model Evaluation: Test the trained model on unseen data to check its classification
accuracy.

5. Prediction: The model predicts the class of new data based on learned features.

6. Model Evaluation: Check how well the model performs on new data using different
metrics.

If the quality metric is not satisfactory, adjust the ML algorithm or hyperparameters and
retrain the model until satisfactory performance is achieved.

Classification in machine learning involves using labeled data to teach the model how to
predict the class of new, unlabeled data based on learned patterns.

Examples of Machine Learning Classification in Real


Life
Email spam filtering

Credit risk assessment: Predicts loan default likelihood based on credit score, income, and
loan history.

Medical diagnosis: Classifies whether a patient has a condition (e.g., cancer, diabetes) based
on medical data.

Image classification: Used in facial recognition, autonomous driving, and medical imaging.

Sentiment analysis: Determines if the sentiment of a text is positive, negative, or neutral.

Fraud detection: Detects fraudulent activities by analyzing transaction patterns.

Recommendation systems: Recommends products or content based on past user behavior.


Classification Modeling in Machine Learning
Classification modeling uses machine learning algorithms to categorize data into
predefined classes or labels.

Key characteristics:
1. Class Separation: Distinguishes between distinct classes.

2. Decision Boundaries: Draws decision boundaries in the feature space.

3. Sensitivity to Data Quality: Requires well-labeled, representative data.

4. Handling Imbalanced Data: Uses techniques like resampling or weighting to handle class
imbalances.

5. Interpretability: Some algorithms like Decision Trees offer higher interpretability.

Classification Algorithms
Linear Classifiers:
Create a linear decision boundary between classes.

Simple and computationally efficient.

Examples: Logistic Regression, Support Vector Machines (with linear kernel), Single-layer
Perceptron, Stochastic Gradient Descent (SGD) Classifier

Non-linear Classifiers:
Create a non-linear decision boundary between classes.

Capture more complex relationships between input features and the target variable.

Examples: K-Nearest Neighbors, Kernel SVM, Naive Bayes, Decision Tree Classification

Ensemble learning classifiers:


Random Forests, AdaBoost, Bagging Classifier, Voting Classifier, Extra Trees Classifier, Multi-
layer Artificial Neural Networks

Decision Tree in Machine Learning


A supervised learning algorithm used for classification and regression tasks.

Models decisions as a tree-like structure.

Internal nodes represent attribute tests.

Branches represent attribute values.

Leaf nodes represent final decisions or predictions.

Versatile, interpretable, and widely used for predictive modeling.

Intuition behind the Decision Tree


Imagine you’re deciding whether to buy an umbrella:
1. Step 1 – Ask a Question (Root Node): Is it raining? If yes, you might decide to
buy an umbrella. If no, you move to the next question.

2. Step 2 – More Questions (Internal Nodes):

Is it likely to rain later? If yes, you buy an umbrella; if no, you don’t.

3. Step 3 – Decision (Leaf Node): Based on your answers, you either buy or skip the
umbrella

Example: Predicting Whether a Person Likes Computer Games


1. Start with the Root Question (Age):

The first question is: Is the person's age less than 15?

If Yes, move to the left.

If No, move to the right.

2. Branch Based on Age:

If the person is younger than 15, they are likely to enjoy computer games (+2
prediction score).

If the person is 15 or older, ask the next question: Is the person male?

3. Branch Based on Gender (For Age 15+):

If the person is male, they are somewhat likely to enjoy computer games (+0.1
prediction score).

If the person is not male, they are less likely to enjoy computer games (-1
prediction score)

Example: Predicting Whether a Person Likes Computer Games


Using Two Decision Trees
Tree 1: Age and Gender

The first tree asks two questions:

Is the person’s age less than 15?

If Yes, they get a score of +2. If No, proceed to the next


question.

Is the person male?

If Yes, they get a score of +0.1. If No, they get a score of -1.

Tree 2: Computer Usage

The second tree focuses on daily computer usage:

Does the person use a computer daily?

If Yes, they get a score of +0.9. If No, they get a score of


-0.9.

Combining Trees: Final Prediction

The final prediction score is the sum of scores from both trees

Attribute Selection Measures


1. Information Gain

2. Gini Index

Information Gain

Measures the usefulness of a question (or feature) for splitting data into groups.

Tells us how much the uncertainty decreases after the split.

A good question creates clearer groups.

The feature with the highest Information Gain is chosen to make the decision.

Example: Splitting people into "Young" and "Old" based on age, where all young people
bought a product while all old people did not, would result in high Information Gain
because the split perfectly separates the groups with no uncertainty.

Gain(S, A) = Entropy(S) − ΣSv.Entropy(Sv)


Where:

S is a set of instances

A is an attribute

Sv is the subset of S

v represents an individual value that the attribute A can take

Values(A) is the set of all possible values of A

Entropy

Measures the uncertainty of a random variable.

Characterizes the impurity of an arbitrary collection of examples.

Higher entropy means more information content.

Example: If a dataset has an equal number of "Yes" and "No" outcomes, the entropy is high
because it's uncertain which outcome to predict. If all outcomes are the same, the entropy
is 0.

Gain(S, A) = Entropy(S) − ΣvϵValues(A) Sv


S
Entropy(Sv ) ​ ​

Where:

S is a set of instances, A is an attribute,

Sv is the subset of S with A = v, and


Values (A) is the set of all possible values of A.

ARTIFICIAL NEURAL NETWORKS


Artificial Neural Networks contain artificial neurons which are called units.

These units are arranged in series of layers that together constitute the whole Artificial
Neural Network in a system.

A layer can have only a dozen units or millions of units as this depends on how the
complex neural networks will be required to learn the hidden patterns in the dataset.
Commonly, Artificial Neural Network has an input layer, an output layer as well as hidden
layers.

The input layer receives data from the outside world which the neural network needs to
analyze or learn about.

Then this data passes through one or multiple hidden layers that transform the input into
data that is valuable for the output layer.

Finally, the output layer provides an output in the form of a response of the Artificial Neural
Networks to input data provided.

In the majority of neural networks, units are interconnected from one layer to another.

Each of these connections has weights that determine the influence of one unit on another
unit.

As the data transfers from one unit to another, the neural network learns more and more
about the data which eventually results in an output from the output layer.

Neural Networks Architecture


Input Layer
Hidden Layers
Output Layer
The structures and operations of human neurons serve as the basis for artificial neural
networks.
It is also known as neural networks or neural nets.
The input layer of an artificial neural network is the first layer, and it receives input from
external sources and releases it to the hidden layer, which is the second layer.
In the hidden layer, each neuron receives input from the previous layer neurons, computes
the weighted sum, and sends it to the neurons in the next layer.
These connections are weighted means effects of the inputs from the previous layer are
optimized more or less by assigning different-different weights to each input and it is
adjusted during the training process by optimizing these weights for improved model
performance.

Artificial neurons vs Biological neurons


The concept of artificial neural networks comes from biological neurons found in animal
brains

So they share a lot of similarities in structure and function wise.

Structure:

The structure of artificial neural networks is inspired by biological neurons.

A biological neuron has a cell body or soma to process the impulses, dendrites to receive
them, and an axon that transfers them to other neurons.

The input nodes of artificial neural networks receive input signals, the hidden
layer nodes compute these input signals, and the output layer nodes compute
the final output by processing the hidden layer's results using activation
functions.
Synapses:

Synapses are the links between biological neurons that enable the transmission of impulses
from dendrites to the cell body.

Synapses are the weights that join the one-layer nodes to the next-layer nodes in artificial
neurons.

The strength of the links is determined by the weight value.

Learning:

In biological neurons, learning happens in the cell body nucleus or soma, which has a
nucleus that helps to process the impulses.

An action potential is produced and travels through the axons if the impulses are powerful
enough to reach the threshold.

This becomes possible by synaptic plasticity, which represents the ability of synapses to
become stronger or weaker over time in reaction to changes in their activity.

In artificial neural networks, backpropagation is a technique used for learning, which


adjusts the weights between nodes according to the error or differences between predicted
and actual outcomes.

Activation:

In biological neurons, activation is the firing rate of the neuron which happens when the
impulses are strong enough to reach the threshold.

In artificial neural networks, A mathematical function known as an activation function maps


the input to the output, and executes activations.

How do Artificial Neural Networks learn?


Artificial neural networks are trained using a training set.

For example, suppose you want to teach an ANN to recognize a cat.

Then it is shown thousands of different images of cats so that the network can learn to
identify a cat.

Once the neural network has been trained enough using images of cats, then you need to
check if it can identify cat images correctly.

This is done by making the ANN classify the images it is provided by deciding whether they
are cat images or not.

The output obtained by the ANN is corroborated by a human-provided description of


whether the image is a cat image or not.

If the ANN identifies incorrectly then back-propagation is used to adjust whatever it has
learned during training.

Backpropagation is done by fine-tuning the weights of the connections in ANN units based
on the error rate obtained.
This process continues until the artificial neural network can correctly recognize a cat in an
image with minimal possible error rates.

What are the types of Artificial Neural Networks?


Feedforward Neural Network

Convolutional Neural Network

Modular Neural Network

Radial basis function Neural Network

Recurrent Neural Network

Feedforward Neural Network:

The feedforward neural network is one of the most basic artificial neural networks.

In this ANN, the data or the input provided travels in a single direction.

It enters into the ANN through the input layer and exits through the output layer while
hidden layers may or may not exist.

So the feedforward neural network has a front-propagated wave only and usually does not
have backpropagation.

Convolutional Neural Network:

A Convolutional neural network has some similarities to the feed- forward neural network,
where the connections between units have weights that determine the influence of one
unit on another unit.

But a CNN has one or more than one convolutional layer that uses a convolution operation
on the input and then passes the result obtained in the form of output to the next layer.

CNN has applications in speech and image processing which is particularly useful in
computer vision.

Modular Neural Network:

A Modular Neural Network contains a collection of different neural networks that work
independently towards obtaining the output with no interaction between them.

Each of the different neural networks performs a different sub-task by obtaining unique
inputs compared to other networks.

The advantage of this modular neural network is that it breaks down a large and complex
computational process into smaller components, thus decreasing its complexity while still
obtaining the required output.

Radial basis function Neural Network:

Radial basis functions are those functions that consider the distance of a point concerning
the center.

RBF functions have two layers.


In the first layer, the input is mapped into all the Radial basis functions in the hidden layer
and then the output layer computes the output in the next step.

Radial basis function nets are normally used to model the data that represents any
underlying trend or function.

Recurrent Neural Network:

The Recurrent Neural Network saves the output of a layer and feeds this output back to the
input to better predict the outcome of the layer.

The first layer in the RNN is quite similar to the feed-forward neural network and the
recurrent neural network starts once the output of the first layer is computed.

After this layer, each unit will remember some information from the previous step so that it
can act as a memory cell in performing computations.

APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS:


Social Media

Marketing and Sales

Healthcare

Personal Assistants

REGRESSION
Regression in machine learning refers to a supervised learning technique where the goal is
to predict a continuous numerical value based on one or more independent features.

It finds relationships between variables so that predictions can be made.

we have two types of variables present in regression:

Dependent Variable (Target): The variable we are trying to predict e.g house
price.

Independent Variables (Features): The input variables that influence the


prediction e.g locality, number of rooms.

Regression analysis problem works with if output variable is a real or continuous value such
as “salary” or “weight”.

Many different regression models can be used but the simplest model in them is linear
regression.

LINEAR REGRESSION
Linear regression is a statistical method used to model the relationship between a
dependent variable and one or more independent variables.

It provides valuable insights for prediction and data analysis.

Linear regression is also a type of supervised machine-learning algorithm that learns from
the labelled datasets and maps the data points with most optimized linear functions which
can be used for prediction on new datasets.
It computes the linear relationship between the dependent variable and one or more
independent features by fitting a linear equation with observed data.

It predicts the continuous output variables based on the independent input variable.

For example if we want to predict house price we consider various factor such as house
age, distance from the main road, location, area and number of room, linear regression
uses all these parameter to predict house price as it consider a linear relation between all
these features and price of house.

Why Linear Regression is Important?


The interpretability of linear regression is one of its greatest strengths.

The model’s equation offers clear coefficients that illustrate the influence of each
independent variable on the dependent variable, enhancing our understanding of the
underlying relationships.

Its simplicity is a significant advantage; linear regression is transparent, easy to implement,


and serves as a foundational concept for more advanced algorithms.

What is the best Fit Line?


Our primary objective while using linear regression is to locate the best-fit line, which
implies that the error between the predicted and actual values should be kept to a
minimum.
There will be the least error in the best-fit line.
The best Fit Line equation provides a straight line that represents the relationship between
the dependent and independent variables.
The slope of the line indicates how much the dependent variable changes for a unit change
in the independent variable(s).
Here Y is called a dependent or target variable and X is called an independent variable also
known as the predictor of Y.
There are many types of functions or modules that can be used for regression.
A linear function is the simplest type of function.
Here, X may be a single feature or multiple features representing the problem.
Linear regression performs the task to predict a dependent variable value (y) based on a
given independent variable (x)).
Hence, the name is Linear Regression.
In linear regression some hypothesis are made to ensure reliability of the model’s results.

Hypothesis function in Linear Regression


Assumptions are:
Linearity: It assumes that there is a linear relationship between the independent
and dependent variables. This means that changes in the independent variable
lead to proportional changes in the dependent variable.

Independence: The observations should be independent from each other that is


the errors from one observation should not influence other.
As we have discussed that our independent feature is the experience i.e X and the
respective salary Y is the dependent variable.
Let’s assume there is a linear relationship between X and Y then the salary can be predicted
using:

Y^ = 01 + 02X

OR

^ = 01 + 02xi
yi​

Here,
yi ϵY (i = 1, 2, … , n) are labels to data (Supervised learning)

xi ϵX(i = 1, 2, … , n) are the input independent training data (univariate -


one input variable(parameter))

y^i ϵY^ (i = 1, 2, … , n) are the predicted values.


​ ​

The model gets the best regression fit line by finding the best 01 and 02 values.

01 : intercept

02 : coefficient of x

Once we find the best 01 and 02 values, we get the best-fit line. So when we are finally
using our model for prediction, it will predict the value of y for the input value of x.

How to update 01 and 02 values to get the best-fit line?


To achieve the best-fit regression line, the model aims to predict the target value Ŷ such
that the error difference between the predicted value Ŷ and the true value Y is minimum.

So, it is very important to update the 01 and 02 values, to reach the best value that
minimizes the error between the predicted y value (pred) and the true y value (y).
^ − Yi)2
minimize(yi ​

Types of Linear Regression


When there is only one independent feature it is known as Simple Linear Regression or
Univariate Linear Regression

When there are more than one feature it is known as Multiple Linear Regression or
Multivariate Regression.

Assumptions of Simple Linear Regression


Linear regression is a powerful tool for understanding and predicting the behavior of a
variable, however, it needs to meet a few conditions in order to be accurate and
dependable solutions.

1. Linearity: The independent and dependent variables have a linear relationship with one
another. This implies that changes in the dependent variable follow those in the
independent variable(s) in a linear fashion. This means that there should be a straight line
that can be drawn through the data points. If the relationship is not linear, then linear
regression will not be an accurate model.
2. Independence: The observations in the dataset are independent of each other. This means
that the value of the dependent variable for one observation does not depend on the value
of the dependent variable for another observation. If the observations are not independent,
then linear regression will not be an accurate model.

3. Homoscedasticity: Across all levels of the independent variable(s), the variance of the errors
is constant. This indicates that the amount of the independent variable(s) has no impact on
the variance of the errors. If the variance of the residuals is not constant, then linear
regression will not be an accurate model.

4. Normality: The residuals should be normally distributed. This means that the residuals
should follow a bell-shaped curve. If the residuals are not normally distributed, then linear
regression will not be an accurate model.

Use Case of Simple Linear Regression


In a case study evaluating student performance analysts use simple linear regression to
examine the relationship between study hours and exam scores.

By collecting data on the number of hours students studied and their corresponding exam
results the analysts developed a model that reveal correlation, for each additional hour
spent studying, students exam scores increased by an average of 5 points.

This case highlights the utility of simple linear regression in understanding and improving
academic performance.

Another case study focus on marketing and sales where businesses uses simple linear
regression to forecast sales based on historical data particularly examining how factors like
advertising expenditure influence revenue.

By collecting data on past advertising spending and corresponding sales figures analysts
developed a regression model that tells the relationship between these variables.

For instance if the analysis reveals that for every additional dollar spent on advertising sales
increase by $10. This predictive capability enables companies to optimize their advertising
strategies and allocate resources effectively.

Multiple Linear Regression


Multiple linear regression involves more than one independent variable and one
dependent variable.
The equation for multiple linear regression is:

y = B0 + B1X1 + B2X2 + ……… BnXn ​

where:

Y is the dependent variable

X1, X2, Xn are the independent variables


B0 is the intercept

B1, B2, Bn are the slopes


The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
In regression set of records are present with X and Y values and these values are used to
learn a function so if you want to predict Y from an unknown X this learned function can be
used.
In regression we have to find the value of Y, So, a function is required that predicts
continuous Y in the case of regression given X as independent features.

Assumptions of Multiple Linear Regression


For Multiple Linear Regression, all four of the assumptions from Simple Linear Regression
apply. In addition to this, below are few more:

1. No multicollinearity: There is no high correlation between the independent variables. This


indicates that there is little or no correlation between the independent variables.
Multicollinearity occurs when two or more independent variables are highly correlated with
each other, which can make it difficult to determine the individual effect of each variable on
the dependent variable. If there is multicollinearity, then multiple linear regression will not
be an accurate model.

2. Additivity: The model assumes that the effect of changes in a predictor variable on the
response variable is consistent regardless of the values of the other variables. This
assumption implies that there is no interaction between variables in their effects on the
dependent variable.

3. Feature Selection: In multiple linear regression, it is essential to carefully select the


independent variables that will be included in the model. Including irrelevant or redundant
variables may lead to overfitting and complicate the interpretation of the model.

4. Overfitting: Overfitting occurs when the model fits the training data too closely, capturing
noise or random fluctuations that do not represent the true underlying relationship
between variables. This can lead to poor generalization performance on new, unseen data.

Multiple linear regression sometimes faces issues like multicollinearity.

Multicollinearity
Multicollinearity is a statistical phenomenon where two or more independent variables in a
multiple regression model are highly correlated, making it difficult to assess the individual
effects of each variable on the dependent variable.

Detecting Multicollinearity includes two techniques:

Correlation Matrix: Examining the correlation matrix among the independent variables is a
common way to detect multicollinearity. High correlations (close to 1 or -1) indicate
potential multicollinearity.

VIF (Variance Inflation Factor): VIF is a measure that quantifies how much the variance of an
estimated regression coefficient increases if your predictors are correlated. A high VIF
(typically above 10) suggests multicollinearity.

Use Case of Multiple Linear Regression


Multiple linear regression allows us to analyze relationship between multiple independent
variables and a single dependent variable. Here are some use cases:
Real Estate Pricing: In real estate MLR is used to predict property prices based
on multiple factors such as location, size, number of bedrooms, etc. This helps
buyers and sellers understand market trends and set competitive prices.

Financial Forecasting: Financial analysts use MLR to predict stock prices or


economic indicators based on multiple influencing factors such as interest rates,
inflation rates and market trends. This enables better investment strategies and
risk management24.

Agricultural Yield Prediction: Farmers can use MLR to estimate crop yields based
on several variables like rainfall, temperature, soil quality and fertilizer usage.
This information helps in planning agricultural practices for optimal productivity

E-commerce Sales Analysis: An e-commerce company can utilize MLR to assess


how various factors such as product price, marketing promotions and seasonal
trends impact sales.

Evaluation Metrics for Linear Regression


A variety of evaluation measures can be used to determine the strength of any linear
regression model.

These assessment metrics often give an indication of how well the model is producing the
observed outputs.

The most common measurements are:

Mean Absolute Error (MAE)

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

Mean Absolute Error is an evaluation metric used to calculate the accuracy of a regression
model.

MAE measures the average absolute difference between the predicted values and actual
values.

Mathematically, MAE is expressed as:


^∣
∑ ∣Yi−Yi
M AE = n

Here,

n is the number of observations

Yi represents the actual values.


Y^i represents the predicted values


​ ​

Lower MAE value indicates better model performance.

It is not sensitive to the outliers as we consider absolute differences.

Root Mean Squared Error (RMSE)

The square root of the residuals' variance is the Root Mean Squared Error.
It describes how well the observed data points match the expected values, or the model's
absolute fit to the data.

In mathematical notation, it can be expressed as:


∑ni=1 (actual−predicted)2
RMSE = RSS
=

T T
​ ​ ​ ​

Rather than dividing the entire number of data points in the model by the number of
degrees of freedom, one must divide the sum of the squared residuals to obtain an
unbiased estimate.

Then, this figure is referred to as the Residual Standard Error (RSE).

In mathematical notation, it can be expressed as:


∑ni=1 (actual−predicted)2
RMSE = RSS =

(n−2) ​ ​

(n−2) ​ ​

RSME is not as good of a metric as R-squared.

Root Mean Squared Error can fluctuate when the units of the variables vary since its value is
dependent on the variables' units (it is not a normalized measure).

CLUSTERING
The task of grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis.

This method is defined under the branch of Unsupervised Learning, which aims at gaining
insights from unlabelled data points, that is, unlike supervised learning we don't have a
target variable.

Clustering aims at forming groups of homogeneous data points from a heterogeneous


dataset.

It evaluates the similarity based on a metric like Euclidean distance, Cosine similarity,
Manhattan distance, etc. and then group the points with highest similarity score together.

Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group similar
data points.

Hard Clustering:

In this type of clustering, each data point belongs to a cluster completely or not.

For example, Let's say there are 4 data point and we have to cluster them into 2 clusters. So
each data point will either belong to cluster 1 or cluster 2.

What is clustering for?

Groups people of similar sizes together to make "small", "medium" and "large" T-Shirts

Tailor-made for each person: too expensive

One-size-fits-all: does not fit all

In marketing, segment customers according to their similarities


To do targeted marketing

Given a collection of text documents, we want to organize them according to their content
similarities

To produce a topic hierarchy

Aspects of Clustering
A (similarity, or distance or dissimilarity) function

Clustering quality

Inter-clusters distance → maximized

Intra-clusters distance → minimized

The quality of a clustering result depends on the algorithm, the distance function, and the
application

What is Cluster Analysis?


Finding groups of objects in data such that the objects in a group will be similar (or related)
to one another and different from (or unrelated to) the objects in other groups

Intra-cluster distances are minimized

Inter-cluster distances are maximized

Types of Clustering
A clustering is a set of clusters

Partitional Clustering

A division data objects into non-overlapping subsets (clusters) such that each data object is
in exactly one subset

Hierarchical Clustering

A set of nested clusters organized as a hierarchical tree

Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the use cases
of Clustering algorithms. Clustering algorithms are majorly used for:

Market Segmentation - Businesses use clustering to group their customers and


use targeted advertisements to attract more audience.

Market Basket Analysis - Shop owners analyze their sales and figure out which
items are majorly bought together by the customers.

Social Network Analysis - Social media sites use your data to understand your
browsing behaviour and provide you with targeted friend recommendations or
content recommendations.
Medical Imaging - Doctors use Clustering to find out diseased areas in
diagnostic images like X-rays.

Anomaly Detection - To find outliers in a stream of real-time dataset or


forecasting fraudulent transactions we can use clustering to identify them.

Types of Clustering Algorithms


Centroid-based Clustering (Partitioning methods)

Density-based Clustering (Model-based methods)

Connectivity-based Clustering (Hierarchical clustering)

Distribution-based Clustering

Centroid-based Clustering (Partitioning methods)

Partitioning methods are the most easiest clustering algorithms.

They group data points on the basis of their closeness.

Generally, the similarity measure chosen for these algorithms are Euclidian distance,
Manhattan Distance or Minkowski Distance.

The datasets are separated into a predetermined number of clusters, and each cluster is
referenced by a vector of values.

When compared to the vector value, the input data variable shows no difference and joins
the cluster.

The primary drawback for these algorithms is the requirement that we establish the
number of clusters, "k," either intuitively or scientifically (using the Elbow Method) before
any clustering machine learning system starts allocating the data points.

Despite this, it is still the most popular type of clustering.

K-means and K-medoids clustering are some examples of this

You might also like