0% found this document useful (0 votes)
11 views20 pages

QUESTIONS

Naive Bayes classifiers are supervised machine learning algorithms based on Bayes' Theorem, primarily used for classification tasks such as spam filtration and sentiment analysis. They assume feature independence and come in three types: Gaussian, Multinomial, and Bernoulli, each suited for different data types. Support Vector Machines (SVM) and Decision Tree Induction are also discussed, with SVM focusing on finding optimal hyperplanes for classification and Decision Trees providing an intuitive model for predictions.

Uploaded by

Isha Govind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views20 pages

QUESTIONS

Naive Bayes classifiers are supervised machine learning algorithms based on Bayes' Theorem, primarily used for classification tasks such as spam filtration and sentiment analysis. They assume feature independence and come in three types: Gaussian, Multinomial, and Bernoulli, each suited for different data types. Support Vector Machines (SVM) and Decision Tree Induction are also discussed, with SVM focusing on finding optimal hyperplanes for classification and Decision Trees providing an intuitive model for predictions.

Uploaded by

Isha Govind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

NAÏVE BAYES

Naive Bayes classifiers are supervised machine learning algorithms used for
classification tasks, based on Bayes’ Theorem to find probabilities. This article
will give you an overview as well as more advanced use and implementation of
Naive Bayes in machine learning.
Key Features of Naive Bayes Classifiers
The main idea behind the Naive Bayes classifier is to use Bayes’ Theorem to
classify data based on the probabilities of different classes given the features
of the data. It is used mostly in high-dimensional text classification
• The Naive Bayes Classifier is a simple probabilistic classifier and it
has very few number of parameters which are used to build the ML
models that can predict at a faster speed than other classification
algorithms.
• It is a probabilistic classifier because it assumes that one feature in
the model is independent of existence of another feature. In other
words, each feature contributes to the predictions with no relation
between each other.
• Naïve Bayes Algorithm is used in spam filtration, Sentimental
analysis, classifying articles and many more.
Why it is Called Naive Bayes?
It is named as “Naive” because it assumes the presence of one feature does not
affect other features.
The “Bayes” part of the name refers to for the basis in Bayes’ Theorem.
Consider a fictional dataset that describes the weather conditions for playing a
game of golf. Given the weather conditions, each tuple classifies the conditions
as fit(“Yes”) or unfit(“No”) for playing golf. Here is a tabular representation of
our dataset.
Types of Naive Bayes Model
There are three types of Naive Bayes Model :
Gaussian Naive Bayes
In Gaussian Naive Bayes, continuous values associated with each feature are
assumed to be distributed according to a Gaussian distribution. A Gaussian
distribution is also called Normal distribution When plotted, it gives a bell
shaped curve which is symmetric about the mean of the feature values as
shown below:
Multinomial Naive Bayes
Multinomial Naive Bayes is used when features represent the frequency of
terms (such as word counts) in a document. It is commonly applied in text
classification, where term frequencies are important.
Bernoulli Naive Bayes
Bernoulli Naive Bayes deals with binary features, where each feature
indicates whether a word appears or not in a document. It is suited for
scenarios where the presence or absence of terms is more relevant than their
frequency. Both models are widely used in document classification tasks
Advantages of Naive Bayes Classifier
• Easy to implement and computationally efficient.
• Effective in cases with a large number of features.
• Performs well even with limited training data.
• It performs well in the presence of categorical features.
• For numerical features data is assumed to come from normal
distributions
Disadvantages of Naive Bayes Classifier
• Assumes that features are independent, which may not always hold
in real-world data.
• Can be influenced by irrelevant attributes.
• May assign zero probability to unseen events, leading to poor
generalization.
Applications of Naive Bayes Classifier
• Spam Email Filtering: Classifies emails as spam or non-spam based
on features.
• Text Classification: Used in sentiment analysis, document
categorization, and topic classification.
• Medical Diagnosis: Helps in predicting the likelihood of a disease
based on symptoms.
• Credit Scoring: Evaluates creditworthiness of individuals for loan
approval.
• Weather Prediction: Classifies weather conditions based on various
factors.

2. SVM
Support Vector Machine (SVM) is a supervised machine learning algorithm
used for classification and regression tasks. While it can handle regression
problems, SVM is particularly well-suited for classification tasks.
SVM aims to find the optimal hyperplane in an N-dimensional space to
separate data points into different classes. The algorithm maximizes the
margin between the closest points of different classes.
Support Vector Machine (SVM) Terminology
• Hyperplane: A decision boundary separating different classes in
feature space, represented by the equation wx + b = 0 in linear
classification.
• Support Vectors: The closest data points to the hyperplane, crucial
for determining the hyperplane and margin in SVM.
• Margin: The distance between the hyperplane and the support
vectors. SVM aims to maximize this margin for better classification
performance.
• Kernel: A function that maps data to a higher-dimensional space,
enabling SVM to handle non-linearly separable data.
• Hard Margin: A maximum-margin hyperplane that perfectly
separates the data without misclassifications.
• Soft Margin: Allows some misclassifications by introducing slack
variables, balancing margin maximization and misclassification
penalties when data is not perfectly separable.
• C: A regularization term balancing margin maximization and
misclassification penalties. A higher C value enforces a stricter
penalty for misclassifications.
• Hinge Loss: A loss function penalizing misclassified points or
margin violations, combined with regularization in SVM.
• Dual Problem: Involves solving for Lagrange multipliers associated
with support vectors, facilitating the kernel trick and efficient
computation.
How does Support Vector Machine Algorithm Work?
The key idea behind the SVM algorithm is to find the hyperplane that best
separates two classes by maximizing the margin between them. This margin is
the distance from the hyperplane to the nearest data points (support vectors)
on each side.

Multiple hyperplanes separate the data from two classes

The best hyperplane, also known as the “hard margin,” is the one that
maximizes the distance between the hyperplane and the nearest data points
from both classes. This ensures a clear separation between the classes. So,
from the above figure, we choose L2 as hard margin.
Let’s consider a scenario like shown below:

Selecting hyperplane for data with outlier

Here, we have one blue ball in the boundary of the red ball.
How does SVM classify the data?
It’s simple! The blue ball in the boundary of red ones is an outlier of blue
balls. The SVM algorithm has the characteristics to ignore the outlier and finds
the best hyperplane that maximizes the margin. SVM is robust to outliers.

Hyperplane which is the most optimized one

A soft margin allows for some misclassifications or violations of the margin to


improve generalization. The SVM optimizes the following equation to balance
margin maximization and penalty minimization:
Objective Function=(1margin)+λ∑penalty Objective Function
=(margin1)+λ∑penalty
The penalty used for violations is often hinge loss, which has the following
behavior:
• If a data point is correctly classified and within the margin, there is
no penalty (loss = 0).
• If a point is incorrectly classified or violates the margin, the hinge
loss increases proportionally to the distance of the violation.
Till now, we were talking about linearly separable data(the group of blue balls
and red balls are separable by a straight line/linear line).
What to do if data are not linearly separable?
When data is not linearly separable (i.e., it can’t be divided by a straight line),
SVM uses a technique called kernels to map the data into a higher-
dimensional space where it becomes separable. This transformation helps
SVM find a decision boundary even for non-linear data.

Original 1D dataset for classification

A kernel is a function that maps data points into a higher-dimensional space


without explicitly computing the coordinates in that space. This allows SVM
to work efficiently with non-linear data by implicitly performing the mapping.
For example, consider data points that are not linearly separable. By applying
a kernel function, SVM transforms the data points into a higher-dimensional
space where they become linearly separable.
• Linear Kernel: For linear separability.
• Polynomial Kernel: Maps data into a polynomial space.
• Radial Basis Function (RBF) Kernel: Transforms data into a space
based on distances between data points.

Mapping 1D data to 2D to become able to separate the two classes

In this case, the new variable y is created as a function of distance from the
origin.
Mathematical Computation: SVM
Consider a binary classification problem with two classes, labeled as +1 and -
1. We have a training dataset consisting of input feature vectors X and their
corresponding class labels Y.
The equation for the linear hyperplane can be written as:
wTx+b=0wTx+b=0
Where:
• ww is the normal vector to the hyperplane (the direction
perpendicular to it).
• bb is the offset or bias term, representing the distance of the
hyperplane from the origin along the normal vector ww.
Distance from a Data Point to the Hyperplane
The distance between a data point x_i and the decision boundary can be
calculated as:
di=wTxi+b∣∣w∣∣di=∣∣w∣∣wTxi+b
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean
norm of the normal vector W
Linear SVM Classifier
Distance from a Data Point to the Hyperplane:
y^={1: wTx+b≥00: wTx+b <0y^={10: wTx+b≥0: wTx+b <0
Where y^y^ is the predicted label of a data point.
Optimization Problem for SVM
For a linearly separable dataset, the goal is to find the hyperplane that
maximizes the margin between the two classes while ensuring that all data
points are correctly classified. This leads to the following optimization
problem:
minimizew,b12∥w∥2w,bminimize21∥w∥2
Subject to the constraint:
yi(wTxi+b)≥1fori=1,2,3,⋯,myi(wTxi+b)≥1fori=1,2,3,⋯,m
Where:
• yiyi is the class label (+1 or -1) for each training instance.
• xixi is the feature vector for the ii-th training instance.
• mm is the total number of training instances.
The condition yi(wTxi+b)≥1yi(wTxi+b)≥1 ensures that each data point is
correctly classified and lies outside the margin.
Soft Margin Linear SVM Classifier
In the presence of outliers or non-separable data, the SVM allows some
misclassification by introducing slack variables ζiζi. The optimization problem
is modified as:
minimize w,b12∥w∥2+C∑i=1mζiw,bminimize 21∥w∥2+C∑i=1m
ζi
Subject to the constraints:
yi(wTxi+b)≥1–ζiandζi≥0for i=1,2,…,myi(wTxi+b)≥1–ζiandζi
≥0for i=1,2,…,m
Where:
• CC is a regularization parameter that controls the trade-off between
margin maximization and penalty for misclassifications.
• ζiζi are slack variables that represent the degree of violation of the
margin by each data point.
Dual Problem for SVM
The dual problem involves maximizing the Lagrange multipliers associated
with the support vectors. This transformation allows solving the SVM
optimization using kernel functions for non-linear classification.
The dual objective function is given by:
maximize α12∑i=1m∑j=1mαiαjtitjK(xi,xj)–∑i=1mαiαmaximize 21
∑i=1m∑j=1mαiαjtitjK(xi,xj)–∑i=1mαi
Where:
• αiαi are the Lagrange multipliers associated with the ii-th training
sample.
• titi is the class label for the iii-th training sample (+1+1+1 or −1-
1−1).
• K(xi,xj)K(xi,xj) is the kernel function that computes the similarity
between data points xixi and xjxj. The kernel allows SVM to handle
non-linear classification problems by mapping data into a higher-
dimensional space.
The dual formulation optimizes the Lagrange multipliers αiαi, and the support
vectors are those training samples where αi>0αi>0.
SVM Decision Boundary
Once the dual problem is solved, the decision boundary is given by:
w=∑i=1mαitiK(xi,x)+bw=∑i=1mαitiK(xi,x)+b
Where ww is the weight vector, xx is the test data point, and bb is the bias
term.
Finally, the bias term bb is determined by the support vectors, which satisfy:
ti(wTxi–b)=1⇒b=wTxi–titi(wTxi–b)=1⇒b=wTxi–ti
Where xixi is any support vector.
This completes the mathematical framework of the Support Vector Machine
algorithm, which allows for both linear and non-linear classification using the
dual problem and kernel trick.
Types of Support Vector Machine
Based on the nature of the decision boundary, Support Vector Machines
(SVM) can be divided into two main parts:
• Linear SVM: Linear SVMs use a linear decision boundary to
separate the data points of different classes. When the data can be
precisely linearly separated, linear SVMs are very suitable. This
means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective
classes. A hyperplane that maximizes the margin between the classes
is the decision boundary.
• Non-Linear SVM: Non-Linear SVM can be used to classify data
when it cannot be separated into two classes by a straight line (in the
case of 2D). By using kernel functions, nonlinear SVMs can handle
nonlinearly separable data. The original input data is transformed by
these kernel functions into a higher-dimensional feature space, where
the data points can be linearly separated. A linear SVM is used to
locate a nonlinear decision boundary in this modified space.

3. decision TREE INDUCTION

Decision Tree Induction in Data Mining

• Decision tree induction is a common technique in data mining


that is used to generate a predictive model from a dataset. This
technique involves constructing a tree-like structure, where each
internal node represents a test on an attribute, each branch
represents the outcome of the test, and each leaf node represents
a prediction. The goal of decision tree induction is to build a model
that can accurately predict the outcome of a given event, based
on the values of the attributes in the dataset.
• To build a decision tree, the algorithm first selects the attribute
that best splits the data into distinct classes. This is typically done
using a measure of impurity, such as entropy or the Gini index,
which measures the degree of disorder in the data. The algorithm
then repeats this process for each branch of the tree, splitting the
data into smaller and smaller subsets until all of the data is
classified.
• Decision tree induction is a popular technique in data mining
because it is easy to understand and interpret, and it can handle
both numerical and categorical data. Additionally, decision trees
can handle large amounts of data, and they can be updated with
new data as it becomes available. However, decision trees can be
prone to overfitting, where the model becomes too complex and
does not generalize well to new data. As a result, data scientists
often use techniques such as pruning to simplify the tree and
improve its performance.

Advantages of Decision Tree Induction

1. Easy to understand and interpret: Decision trees are a visual and


intuitive model that can be easily understood by both experts and
non-experts.
2. Handle both numerical and categorical data: Decision trees can
handle a mix of numerical and categorical data, which makes
them suitable for many different types of datasets.
3. Can handle large amounts of data: Decision trees can handle
large amounts of data and can be updated with new data as it
becomes available.
4. Can be used for both classification and regression tasks: Decision
trees can be used for both classification, where the goal is to
predict a discrete outcome, and regression, where the goal is to
predict a continuous outcome.

Disadvantages of Decision Tree Induction

1. Prone to overfitting: Decision trees can become too complex


and may not generalize well to new data. This can lead to poor
performance on unseen data.
2. Sensitive to small changes in the data: Decision trees can be
sensitive to small changes in the data, and a small change in the
data can result in a significantly different tree.
3. Biased towards attributes with many levels: Decision trees can
be biased towards attributes with many levels, and may not
perform well on attributes with a small number of levels.

4. KNN
K-Nearest Neighbors (KNN) is a simple way to classify things by looking
at what’s nearby. Imagine a streaming service wants to predict if a new
user is likely to cancel their subscription (churn) based on their age.
They checks the ages of its existing users and whether they churned or
stayed. If most of the “K” closest users in age of new user canceled
their subscription KNN will predict the new user might churn too. The
key idea is that users with similar ages tend to have similar behaviors
and KNN uses this closeness to make decisions.
Getting Started with K-Nearest Neighbors
K-Nearest Neighbors is also called as a lazy learner algorithm because it
does not learn from the training set immediately instead it stores the
dataset and at the time of classification it performs an action on the
dataset.
As an example, consider the following table of data points containing
two features:

KNN Algorithm working visualization

The new point is classified as Category 2 because most of its closest


neighbors are blue squares. KNN assigns the category based on the
majority of nearby points.
The image shows how KNN predicts the category of a new data
point based on its closest neighbours.
• The red diamonds represent Category 1 and the blue
squares represent Category 2.
• The new data point checks its closest neighbours (circled
points).
• Since the majority of its closest neighbours are blue squares
(Category 2) KNN predicts the new data point belongs to
Category 2.
KNN works by using proximity and majority voting to make predictions.
What is ‘K’ in K Nearest Neighbour ?
In the k-Nearest Neighbours (k-NN) algorithm k is just a number that
tells the algorithm how many nearby points (neighbours) to look at when
it makes a decision.
Example:
Imagine you’re deciding which fruit it is based on its shape and size. You
compare it to fruits you already know.
• If k = 3, the algorithm looks at the 3 closest fruits to the new
one.
• If 2 of those 3 fruits are apples and 1 is a banana, the algorithm
says the new fruit is an apple because most of its neighbours
are apples.
How to choose the value of k for KNN Algorithm?
The value of k is critical in KNN as it determines the number of neighbors
to consider when making predictions. Selecting the optimal value of k
depends on the characteristics of the input data. If the dataset has
significant outliers or noise a higher k can help smooth out the
predictions and reduce the influence of noisy data. However choosing
very high value can lead to underfitting where the model becomes too
simplistic.
Statistical Methods for Selecting k:
• Cross-Validation: A robust method for selecting the best k is to
perform k-fold cross-validation. This involves splitting the data
into k subsets training the model on some subsets and testing it
on the remaining ones and repeating this for each subset. The
value of k that results in the highest average validation accuracy
is usually the best choice.
• Elbow Method: In the elbow method we plot the model’s error
rate or accuracy for different values of k. As we increase k the
error usually decreases initially. However after a certain point
the error rate starts to decrease more slowly. This point where
the curve forms an “elbow” that point is considered as best k.
• Odd Values for k: It’s also recommended to choose an odd
value for k especially in classification tasks to avoid ties when
deciding the majority class.
Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbour, these
neighbours are used for classification and regression task. To identify
nearest neighbour we use below distance metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two
points in a plane or space. You can think of it like the shortest path you
would walk if you were to go directly from one point to another.
distance(x,Xi)=∑j=1d(xj–Xij)2]distance(x,Xi)=∑j=1d(xj–Xij)2]
2. Manhattan Distance
This is the total distance you would travel if you could only move along
horizontal and vertical lines (like a grid or city streets). It’s also called
“taxicab distance” because a taxi can only drive along the grid-like
streets of a city.
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes
both Euclidean and Manhattan distances as special cases.
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above we can say that when p = 2 then it is the same
as the formula for the Euclidean distance and when p = 1 then we obtain
the formula for the Manhattan distance.
So, you can think of Minkowski as a flexible distance formula that can
look like either Manhattan or Euclidean distance depending on the value
of p
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of
similarity where it predicts the label or value of a new data point by
considering the labels or values of its K nearest neighbors in the training
dataset.

Step-by-Step explanation of how KNN works is discussed below:


Step 1: Selecting the optimal value of K
• K represents the number of nearest neighbors that needs to be
considered while making prediction.
Step 2: Calculating distance
• To measure the similarity between target and training data
points Euclidean distance is used. Distance is calculated
between data points in the dataset and target point.
Step 3: Finding Nearest Neighbors
• The k data points with the smallest distances to the target point
are nearest neighbors.
Step 4: Voting for Classification or Taking Average for
Regression
• When you want to classify a data point into a category (like
spam or not spam), the K-NN algorithm looks at the K closest
points in the dataset. These closest points are called neighbors.
The algorithm then looks at which category the neighbors
belong to and picks the one that appears the most. This is
called majority voting.
• In regression, the algorithm still looks for the K closest points.
But instead of voting for a class in classification, it takes
the average of the values of those K neighbors. This average is
the predicted value for the new point for the algorithm.

Working of KNN Algorithm

It shows how a test point is classified based on its nearest neighbors. As


the test point moves the algorithm identifies the closest ‘k’ data points i.e
5 in this case and assigns test point the majority class label that is grey
label class here.

5. CLUSTERING TECHNIQUES in data mining

1. Introduction to Clustering
Clustering is an unsupervised machine learning technique used to group similar data points
into clusters. The goal is to ensure that:

• Data points within a cluster are similar, and


• Data points in different clusters are dissimilar.

Clustering is used in pattern recognition, image processing, market segmentation,


recommendation systems, and bioinformatics.

2. Types of Clustering Techniques

Clustering methods can be categorized into several types:

I. Partitioning Methods

These methods divide the data set into k non-overlapping clusters, where k is user-defined.

1. K-Means Clustering

• Divides data into k clusters.


• Each cluster has a centroid (mean of points in that cluster).
• Points are assigned to the nearest centroid.
• Iteratively updates centroids until convergence.

Algorithm Steps:

1. Choose the number of clusters (k).


2. Initialize centroids randomly.
3. Assign each point to the nearest centroid.
4. Recalculate centroids.
5. Repeat steps 3–4 until convergence.

Advantages:

• Simple and fast


• Works well for spherical clusters

Disadvantages:

• Need to specify k
• Sensitive to outliers
• Assumes equal-sized clusters

2. K-Medoids (PAM)

• Similar to k-means but uses medoids instead of means.


• More robust to noise and outliers.

II. Hierarchical Clustering

Builds a tree-like structure (dendrogram) showing how data points are merged/split at each
level.

Types:

1. Agglomerative (bottom-up):
o Start with each data point as a single cluster.
o Iteratively merge the closest clusters.
2. Divisive (top-down):
o Start with all data points in one cluster.
o Recursively split into smaller clusters.

Linkage Criteria:

• Single linkage: min distance between points in two clusters.


• Complete linkage: max distance between points.
• Average linkage: average distance between all points.

Advantages:

• No need to specify k
• Dendrogram gives detailed cluster structure

Disadvantages:

• Computationally expensive
• Not suitable for very large datasets

III. Density-Based Methods

These methods group data based on regions of high density separated by low-density areas.

1. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

• Groups points that are closely packed together.


• Points in sparse regions are considered noise or outliers.

Parameters:

• ε (epsilon): radius of the neighborhood


• MinPts: minimum number of points in a neighborhood to form a dense region
Advantages:

• Can find arbitrarily shaped clusters


• Handles noise

Disadvantages:

• Difficult to choose ε and MinPts


• Struggles with clusters of varying density

IV. Grid-Based Clustering

These methods divide the data space into a finite number of cells to form a grid and perform
clustering on these grids.

1. STING (Statistical Information Grid)

• Divides space into rectangular cells at different levels of resolution.


• Uses statistical information stored in each cell to form clusters.

Advantages:

• Fast processing
• Suitable for large databases

Disadvantages:

• Not very accurate for discovering clusters of arbitrary shape

V. Model-Based Clustering

Assumes that the data is generated by a mixture of underlying probability distributions.

1. Gaussian Mixture Model (GMM)

• Each cluster is modeled as a Gaussian distribution.


• Uses Expectation-Maximization (EM) algorithm to estimate parameters.

Advantages:

• Can model elliptical clusters


• Probabilistic framework

Disadvantages:

• Computationally expensive
• May overfit with many components

3. Evaluation of Clustering

Since clustering is unsupervised, evaluation is not straightforward.

Common Evaluation Methods:

• Silhouette Coefficient: Measures how similar an object is to its own cluster vs.
others.
• Davies-Bouldin Index
• Dunn Index
• Intra-cluster distance (low) vs Inter-cluster distance (high)

4. Applications of Clustering

• Customer Segmentation in marketing


• Document or News Grouping
• Image Segmentation
• Anomaly Detection
• Genomic Data Analysis
• Social Network Analysis

5. Advantages of Clustering

• Can discover hidden patterns in data


• No need for labeled data
• Can be used for preprocessing (e.g., grouping data before classification)

6. Disadvantages of Clustering

• Results may vary depending on initialization


• Sensitive to noise and outliers (in some methods)
• Some methods require user-defined parameters (like k)
• Difficult to interpret clusters in high-dimensional space
Kmeans
Partitioning Method: This clustering method classifies the information
into multiple groups based on the characteristics and similarity of the
data. Its the data analysts to specify the number of clusters that has to
be generated for the clustering methods. In the partitioning method
when database(D) that contains multiple(N) objects then the partitioning
method constructs user-specified(K) partitions of the data in which each
partition represents a cluster and a particular region. There are many
algorithms that come under partitioning method some of the popular
ones are K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large
Applications) etc. In this article, we will be seeing the working of K Mean
algorithm in detail.
K-Mean (A centroid based Technique): The K means algorithm takes the
input parameter K from the user and partitions the dataset containing N
objects into K clusters so that resulting similarity among the data objects
inside the group (intracluster) is high but the similarity of data objects
with the data objects from outside the cluster is low (intercluster). The
similarity of the cluster is determined with respect to the mean value of
the cluster. It is a type of square error algorithm. At the start randomly k
objects from the dataset are chosen in which each of the objects
represents a cluster mean(centre). For the rest of the data objects, they
are assigned to the nearest cluster based on their distance from the
cluster mean. The new mean of each of the cluster is then calculated
with the added data objects.
Algorithm:
K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster
centres(C)
2. (Re) Assign each object to which object is most similar based
upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster
with the updated values.
4. Repeat Step 2 until no change occurs.

Figure – K-mean Clustering

Flowchart: Figure – K-mean


Clustering

You might also like