0% found this document useful (0 votes)

69 views25 pages

K-Means Clustering for Data Analysts

Uploaded by

Kalighat Okira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views25 pages

K-Means Clustering for Data Analysts

Uploaded by

Kalighat Okira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Machine Learning with Python

Machine Learning Algorithms - K-Means

Clustering

Prof. Shibdas Dutta,

Associate Professor,
DCG DATA CORE SYSTEMS INDIA PVT LTD
Kolkata

Company Confidential: Data-Core Systems, Inc. | [Link]

Machine Learning Algorithms – Classification Algo- K-Means Clustering

Introduction - K-Means Clustering

Before K-Means After K-Means

Clustering System

Company Confidential: Data-Core Systems, Inc. | [Link]

In general, Clustering is defined as the grouping of data points such that the data points in a group will be similar or
related to one another and different from the data points in another group. The goal of clustering is to determine the
intrinsic grouping in a set of unlabelled data.

K- means is an unsupervised partitional clustering algorithm that is based on grouping data into k – numbers of clusters by
determining centroid using the Euclidean or Manhattan method for distance calculation. It groups the object based on
minimum distance.

euclidean distance formula

Company Confidential: Data-Core Systems, Inc. | [Link]

ALGORITHM

1. First, initialize the number of clusters, K (Elbow method is generally used in selecting the number of clusters )

2. Randomly select the k data points for centroid. A centroid is the imaginary or real location representing the center of
the cluster.

3. Categorize each data items to its closest centroid and update the centroid coordinates calculating the average of items
coordinates categorized in that group so far

4. Repeat the process for a number of iterations till successive iterations clusters data items into the same group

Company Confidential: Data-Core Systems, Inc. | [Link]

HOW IT WORKS ?
In the beginning, the algorithm chooses k centroids in the dataset randomly after shuffling the data. Then it calculates
the distance of each point to each centroid using the euclidean distance calculation method. Each centroid assigned
represents a cluster and the points are assigned to the closest cluster. At the end of the first iteration, the centroid values
are recalculated, usually taking the arithmetic mean of all points in the cluster. In every iteration, new centroid values are
calculated until successive iterations provide the same centroid value.

Let’s kick off K-Means Clustering Scratch with a simple example: Suppose we have data points (1,1), (1.5,2), (3,4), (5,7),
(3.5,5), (4.5,5), (3.5,4.5). Let us suppose k = 2 i.e. dataset should be grouped in two clusters. Here we are using the
Euclidean distance method.

Company Confidential: Data-Core Systems, Inc. | [Link]

Step 1 : It is already defined that k = 2 for this problem

Step-2: Since k = 2, we are randomly selecting two centroid as c1(1,1) and c2(5,7)

Step 3: Now, we calculate the distance of each point to each centroid using the euclidean distance calculation method
using Pythogoras theoream :

ITERATION 01
X1 Y1 X2 Y2 D1 X1 Y1 X2 Y2 D2 Remarks
1 1 1 1 0 1 1 5 7 7.21 D1<D2 : (1,1) belongs to c1
1.5 2 1 1 1.12 1.5 2 5 7 6.1 D1<D2 : (1.5,2) belongs to c1
3 4 1 1 3.61 3 4 5 7 3.61 D1<D2 : (3,4) belongs to c1
5 7 1 1 7.21 5 7 5 7 0 D1>D2 : (5,7) belongs to c2
3.5 5 1 1 4.72 3.5 5 5 7 2.5 D1>D2 : (3.5,5) belongs to c2
4.5 5 1 1 5.32 4.5 5 5 7 2.06 D1>D2 : (5.5,5) belongs to c2
3.5 4.5 1 1 4.3 3.5 4.5 5 7 2.91 D1>D2 : (3.5,4.5) belongs to c2

Company Confidential: Data-Core Systems, Inc. | [Link]

Note: D1 & D2 are euclidean distance between centroid (x2,y2) and data points (x1,y1)

In cluster c1 we have (1,1), (1.5,2) and (3,4) whereas centroid c2 contains (5,7), (3.5,5), (4.5,5) & (3.5,4.5). Here, a new
centroid is the algebraic mean of all the data items in a cluster.

C1(new) = ( (1+1.5+3)/3 , (1+2+4)/3) = (1.83, 2.33)

C2(new) = ((5+3.5+4.5+3.5)/4, (7+5+5+4.5)/4) = (4.125, 5.375)

Company Confidential: Data-Core Systems, Inc. | [Link]

ITERATION 02
X1 Y1 X2 Y2 D1 X1 Y1 X2 Y2 D2 Remarks
1 1 1.83 2.33 1.56 1 1 4.12 5.37 5.37 (1,1) belongs to c1
1.5 2 1.83 2.33 0.46 1.5 2 4.12 5.37 4.27 (1.5,2) belongs to c1
3 4 1.83 2.33 2.03 3 4 4.12 5.37 1.77 (3,4) belongs to c2
5 7 1.83 2.33 5.64 5 7 4.12 5.37 1.84 (5,7) belongs to c2
3.5 5 1.83 2.33 3.14 3.5 5 4.12 5.37 0.72 (3.5,5) belongs to c2
4.5 5 1.83 2.33 3.77 4.5 5 4.12 5.37 0.53 (5.5,5) belongs to c2
3.5 4.5 1.83 2.33 2.73 3.5 4.5 4.12 5.37 1.07 (3.5,4.5) belongs to c2

In cluster c1 we have (1,1), (1.5,2) ) whereas centroid c2 contains (3,4),(5,7), (3.5,5), (4.5,5) & (3.5,4.5). Here, new centroid
is the algebraic mean of all the data items in a cluster.

C1(new) = ( (1+1.5)/2 , (1+2)/2) = (1.25,1.5)

C2(new) = ((3+5+3.5+4.5+3.5)/5, (4+7+5+5+4.5)/5) = (3.9, 5.1)

Company Confidential: Data-Core Systems, Inc. | [Link]

Company Confidential: Data-Core Systems, Inc. | [Link]
ITERATION 03
X1 Y1 X2 Y2 D1 X1 Y1 X2 Y2 D2 Remarks
1 1 1.25 1.5 0.56 1 1 3.9 5.1 5.02 (1,1) belongs to c1
1.5 2 1.25 1.5 0.56 1.5 2 3.9 5.1 3.92 (1.5,2) belongs to c1
3 4 1.25 1.5 3.05 3 4 3.9 5.1 1.42 (3,4) belongs to c2
5 7 1.25 1.5 6.66 5 7 3.9 5.1 2.19 (5,7) belongs to c2
3.5 5 1.25 1.5 4.16 3.5 5 3.9 5.1 0.41 (3.5,5) belongs to c2
4.5 5 1.25 1.5 4.77 4.5 5 3.9 5.1 0.60 (5.5,5) belongs to c2
3.5 4.5 1.25 1.5 3.75 3.5 4.5 3.9 5.1 0.72 (3.5,4.5) belongs to c2

In cluster c1 we have (1,1), (1.5,2) ) whereas centroid c2 contains (3,4),(5,7), (3.5,5), (4.5,5) & (3.5,4.5). Here, new
centroid is the algebraic mean of all the data items in a cluster.

C1(new) = ( (1+1.5)/2 , (1+2)/2) = (1.25,1.5)

C2(new) = ((3+5+3.5+4.5+3.5)/5, (4+7+5+5+4.5)/5) = (3.9, 5.1)
Step 04: In the 2nd and 3rd iteration, we obtained the same centroid points. Hence clusters of above data point is :

Company Confidential: Data-Core Systems, Inc. | [Link]

Company Confidential: Data-Core Systems, Inc. | [Link]
K-Means Clustering Code
So far, we have learnt about the introduction to the K-Means algorithm. We have learnt in detail about the mathematics
behind the K-means clustering algorithm and have learnt how Euclidean distance method is used in grouping the data
items in K number of clusters.
Here were are implementing K-means clustering using python.
But the problem is how to choose the number of clusters?
In this example, assigning the number of clusters ourselves and later we will be discussing various ways of finding the
best number of clusters.
import pandas as pd
import numpy as np
import random as rd
import [Link] as plt
import math

class K_Means:

def init(self, k=2, tolerance = 0.001, max_iter = 500):

self.k = k
self.max_iterations = max_iter
[Link] = tolerance

Company Confidential: Data-Core Systems, Inc. | [Link]

We have defined a K-means class with init consisting default value of k as 2, error tolerance as 0.001, and maximum
iteration as 500.
Before diving into the code, let’s remember some mathematical terms involved in K-means clustering:- centroids &
euclidean distance. On a quick note centroid of a data is the average or mean of the data and Euclidean distance is the
distance between two points in the coordinate plane calculated using Pythagoras theorem.

def euclidean_distance(self, point1, point2):

#return [Link]((point1[0]-point2[0])**2 + (point1[1]-point2[1])**2 + (point1[2]-point2[2])**2)
#sqrt((x1-x2)^2 + (y1-y2)^2)
return [Link](point1-point2, axis=0)

We find the euclidean distance from each point to all the centroids. If you look for efficiency it is better to use the NumPy
function ([Link](point1-point2, axis=0))

def fit(self, data):

[Link] = {}
for i in range(self.k):
[Link][i] = data[i]

Company Confidential: Data-Core Systems, Inc. | [Link]

ASSIGNING CENTROIDS
There are various methods of assigning k centroid initially. Mostly used is a random selection but let’s go in the most basic
way. We assign the first k points from the dataset as the initial centroids.
for i in range(self.max_iterations):

[Link] = {}
for j in range(self.k):
[Link][j] = []

for point in data:

distances = []
for index in [Link]:
[Link](self.euclidean_distance(point,[Link][index]))
cluster_index = [Link](min(distances))
[Link][cluster_index].append(point)

Company Confidential: Data-Core Systems, Inc. | [Link]

Till now, we have defined the K-means class and initialized some default parameters. We have defined the euclidean
distance calculation function and we have also assigned initial k clusters. Now, In order to know which cluster and data
item belong to, we are calculating Euclidean distance from the data items to each centroid. Data item closest to the
cluster belongs to that respective cluster.

previous = dict([Link])
for cluster_index in [Link]:
[Link][cluster_index] = [Link]([Link][cluster_index], axis = 0)

isOptimal = True

for centroid in [Link]:

original_centroid = previous[centroid]
curr = [Link][centroid]
if [Link]((curr - original_centroid)/original_centroid * 100.0) > [Link]:
isOptimal = False
if isOptimal:
break

At the end of the first iteration, the centroid values are recalculated, usually taking the arithmetic mean of all points in the
cluster. In every iteration, new centroid values are calculated until successive iterations provide the same centroid value.

Company Confidential: Data-Core Systems, Inc. | [Link]

CLUSTERING WITH DEMO DATA
We’ve now completed the K Means scratch code of this Machine Learning tutorial series. Now, let’s test our code by
clustering with randomly generated data:

#generate dummy cluster datasets

# Set three centers, the model should predict similar results
center_1 = [Link]([1,1])
center_2 = [Link]([5,5])
center_3 = [Link]([8,1])

# Generate random data and center it to the three centers

cluster_1 = [Link](100, 2) + center_1
cluster_2 = [Link](100,2) + center_2
cluster_3 = [Link](100,2) + center_3

data = [Link]((cluster_1, cluster_2, cluster_3), axis = 0)

Here we have created 3 groups of data of two-dimension with a different centre. We have defined the value of k as 3.
Now, let’s fit the model created

Company Confidential: Data-Core Systems, Inc. | [Link]

k_means = K_Means(K)
k_means.fit(data)

# Plotting starts here

colors = 10*["r", "g", "c", "b", "k"]

for centroid in k_means.centroids:

[Link](k_means.centroids[centroid][0], k_means.centroids[centroid][1], s = 130, marker = "x")

for cluster_index in k_means.classes:

color = colors[cluster_index]
for features in k_means.classes[cluster_index]:
[Link](features[0], features[1], color = color,s = 30)

K-Means Clustering
Company Confidential: Data-Core Systems, Inc. | [Link]
CHOOSING VALUE OF K
While working with the k-means clustering scratch, one thing we must keep in mind is the number of clusters ‘k’. We
should make sure that we are choosing the optimum number of clusters for the given data set. But, here arises a
question, how to choose the optimum value of k ?? We use the elbow method which is generally used in analyzing the
optimum value of k.

The Elbow method is based on the principle that “Sum of squares of distances of every data point from its
corresponding cluster centroid should be as minimum as possible”.

STEPS OF CHOOSING BEST K VALUE

1. Run k-means clustering model on various values of k

2. For each value of K, calculate the Sum of squares of distances of every data point from its corresponding cluster centroid
which is called WCSS ( Within-Cluster Sums of Squares)

3. Plot the value of WCSS with respect to various values of K

4. To select the value of k, we choose the value where there is bend (knee) on the plot i.e. WCSS isn’t increasing rapidly.

Company Confidential: Data-Core Systems, Inc. | [Link]

elbow method to find k

Company Confidential: Data-Core Systems, Inc. | [Link]

Find out Accuracy Score in K-Means Clustering Algo
from [Link] import accuracy_score
from [Link] import KMeans

# Load the data

X = pd.read_csv('[Link]').drop('label', axis=1)
y = pd.read_csv('[Link]')['label']

# Create the KMeans model

kmeans = KMeans(n_clusters=3)

# Fit the model to the data

[Link](X)

# Predict the labels for the data

y_pred = [Link](X)

# Calculate the accuracy

accuracy = accuracy_score(y, y_pred)

# Print the accuracy

print(accuracy)

Company Confidential: Data-Core Systems, Inc. | [Link]

Plotting of Confusion Matrix in K-Means Clustering Algo

import numpy as np
from [Link] import confusion_matrix
import [Link] as plt

# Create the data

X = [Link](100, 2)
y = [Link](0, 2, size=100)

# Fit the KMeans model

kmeans = KMeans(n_clusters=2)
[Link](X)

# Predict the labels

y_pred = [Link](X)

# Create the confusion matrix

cm = confusion_matrix(y, y_pred)

Company Confidential: Data-Core Systems, Inc. | [Link]

# Plot the confusion matrix
[Link](cm, interpolation='nearest', cmap=[Link])
[Link]('Confusion matrix')
[Link]()
tick_marks = [Link](len(kmeans.classes_))
[Link](tick_marks, kmeans.classes_, rotation=45)
[Link](tick_marks, kmeans.classes_)

fmt = '.2f'
thresh = [Link]() / 2.
for i, j in [Link](range([Link][0]), range([Link][1])):
[Link](j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
[Link]('True/Actual label')
[Link]('Predicted label')
[Link]()

Company Confidential: Data-Core Systems, Inc. | [Link]

PROS OF K-MEANS

1. Relatively simple to learn and understand as the algorithm solely depends on the euclidean method of distance
calculation.
2. K means works on minimizing Sum of squares of distances, hence it guarantees convergence
3. Computational cost is O(K*n*d), hence K means is fast and efficient

CONS OF K-MEANS

1. Difficulty in choosing the optimum number of clusters K

2. K means has a problem when clusters are of different size, densities, and non-globular shapes
3. K means has problems when data contains outliers
4. As the number of dimensions increases, the difficulty in getting the algorithm to converge increases due to the curse of
dimensionality
5. If there is overlapping between clusters, k-means doesn’t have an intrinsic measure for uncertainty

Company Confidential: Data-Core Systems, Inc. | [Link]

Applications of K- Means Clustering Algorithm
The main goals of cluster analysis are:
· To get a meaningful intuition from the data we are working with.

· Cluster-then-predict where different models will be built for different subgroups.

To fulfill the above-mentioned goals, K-means clustering is performing well enough.

It can be used in following applications:

· Market segmentation
· Document Clustering
· Image segmentation
· Image compression
· Customer segmentation
· Analyzing the trend on dynamic data

Company Confidential: Data-Core Systems, Inc. | [Link]

Thank You

Company Confidential: Data-Core Systems, Inc. | [Link]

K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
27 pages
K Means
No ratings yet
K Means
40 pages
K-Means Clustering Algorithm Overview
No ratings yet
K-Means Clustering Algorithm Overview
47 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
K-Means With Elbow Method
No ratings yet
K-Means With Elbow Method
24 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
K-Means Clustering and Elbow Method Guide
No ratings yet
K-Means Clustering and Elbow Method Guide
53 pages
K-Means Clustering and Davies Bouldin Index
No ratings yet
K-Means Clustering and Davies Bouldin Index
4 pages
Unit V
No ratings yet
Unit V
165 pages
Algo
No ratings yet
Algo
59 pages
Clustering
No ratings yet
Clustering
18 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
K-Means Clustering Overview
No ratings yet
K-Means Clustering Overview
17 pages
K-Means Clustering Steps Explained
No ratings yet
K-Means Clustering Steps Explained
54 pages
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
No ratings yet
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
6 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
0006 - K Means Clustering - Introduction - 2025
No ratings yet
0006 - K Means Clustering - Introduction - 2025
19 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
Wa0033.
No ratings yet
Wa0033.
38 pages
Pilot
No ratings yet
Pilot
3 pages
A Paper With 12pt Global Font Size
No ratings yet
A Paper With 12pt Global Font Size
13 pages
Optimal k-Means++ for Scalar Data
No ratings yet
Optimal k-Means++ for Scalar Data
6 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Enhanced K-Means for Big Data Clustering
No ratings yet
Enhanced K-Means for Big Data Clustering
16 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
26 pages
Unsupervised Learning 2024-PPG
No ratings yet
Unsupervised Learning 2024-PPG
85 pages
Exp 7
No ratings yet
Exp 7
3 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
24 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
K-Means Clustering Algorithm Explained
No ratings yet
K-Means Clustering Algorithm Explained
6 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
20 pages
Session 3-Clustering
No ratings yet
Session 3-Clustering
41 pages
K-means Clustering Explained
No ratings yet
K-means Clustering Explained
38 pages
ML Seminar
No ratings yet
ML Seminar
37 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
8 pages
K Mean Clustering
No ratings yet
K Mean Clustering
32 pages
L7 Clustering
No ratings yet
L7 Clustering
58 pages
Simple K-Means Clustering Explained
No ratings yet
Simple K-Means Clustering Explained
3 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
K-Means Clustering Guide & Python Implementation
No ratings yet
K-Means Clustering Guide & Python Implementation
21 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
20 pages
K - Means Clustering
No ratings yet
K - Means Clustering
13 pages
K Mean Clustering
No ratings yet
K Mean Clustering
45 pages
08 K-Means
No ratings yet
08 K-Means
19 pages
K Clustering
No ratings yet
K Clustering
28 pages
K Means
No ratings yet
K Means
66 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
The Math Behind The K-Means and Hierarchical Clust+
No ratings yet
The Math Behind The K-Means and Hierarchical Clust+
13 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
K-Means Clustering
No ratings yet
K-Means Clustering
7 pages
Clustering
No ratings yet
Clustering
125 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
K Means Example
No ratings yet
K Means Example
14 pages
Introduction To Operating System (OS) : Associate Professor, DCG Data Core Systems India PVT LTD Kolkata
No ratings yet
Introduction To Operating System (OS) : Associate Professor, DCG Data Core Systems India PVT LTD Kolkata
59 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit-3 Hadoop Environment
No ratings yet
Unit-3 Hadoop Environment
31 pages
Unit-1 Introduction To Big Data Analytics
No ratings yet
Unit-1 Introduction To Big Data Analytics
57 pages
R for Big Data and Statistics
No ratings yet
R for Big Data and Statistics
57 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
17 pages
2.9 Analysing Forces in Equilibrium: Chapter 2 Forces and Motion
No ratings yet
2.9 Analysing Forces in Equilibrium: Chapter 2 Forces and Motion
31 pages
Chemical Engineering Mock Exam
No ratings yet
Chemical Engineering Mock Exam
42 pages
Electricity Master Lab Manual Sample
No ratings yet
Electricity Master Lab Manual Sample
8 pages
Biotransformation of Cedrol by Curvularia Lunata ATCC 12017: Dwight O. Collins, Paul B. Reese
No ratings yet
Biotransformation of Cedrol by Curvularia Lunata ATCC 12017: Dwight O. Collins, Paul B. Reese
5 pages
ADE GTU Study Material E-Notes Unit-4a 04072020062704AM
No ratings yet
ADE GTU Study Material E-Notes Unit-4a 04072020062704AM
34 pages
Oil Quality Testing Standards Summary
No ratings yet
Oil Quality Testing Standards Summary
1 page
CEG 2136 - Fall 2017 - Final Exam Sample
No ratings yet
CEG 2136 - Fall 2017 - Final Exam Sample
14 pages
Formula Sheet in Final Exam Paper (FIN3IPM 2018 Semester 2)
No ratings yet
Formula Sheet in Final Exam Paper (FIN3IPM 2018 Semester 2)
2 pages
Understanding Containers & GKE Essentials
0% (1)
Understanding Containers & GKE Essentials
7 pages
HPLC Troubleshooting 30 Questions and Answers
No ratings yet
HPLC Troubleshooting 30 Questions and Answers
22 pages
Global Routings
No ratings yet
Global Routings
2 pages
Operating Instructions: Metering Pump Pneumados PNDB
No ratings yet
Operating Instructions: Metering Pump Pneumados PNDB
32 pages
4 Exp (05) - Uniformly Accelerated Motion (Lab. Report)
No ratings yet
4 Exp (05) - Uniformly Accelerated Motion (Lab. Report)
4 pages
Question AFM GM
No ratings yet
Question AFM GM
22 pages
2.5 Second Derivative Test PDF
100% (1)
2.5 Second Derivative Test PDF
7 pages
Smart Three Phase Reference Standard Meter
No ratings yet
Smart Three Phase Reference Standard Meter
4 pages
Introduction to Object Databases
No ratings yet
Introduction to Object Databases
3 pages
Name Tags: Stitch Guide
No ratings yet
Name Tags: Stitch Guide
5 pages
Practice Questions CAT-2
No ratings yet
Practice Questions CAT-2
2 pages
MBIST Final 22062016
No ratings yet
MBIST Final 22062016
94 pages
Web URL
No ratings yet
Web URL
4 pages
24.leeming Senior High School 2024 Year 12 Booklist FINAL 6
No ratings yet
24.leeming Senior High School 2024 Year 12 Booklist FINAL 6
6 pages
KA3511BS: Intelligent Voltage Mode PWM IC
No ratings yet
KA3511BS: Intelligent Voltage Mode PWM IC
20 pages
Ratliperl: The Modern Solution For Energy Efficient Building
100% (1)
Ratliperl: The Modern Solution For Energy Efficient Building
18 pages
Audi Q5 Quattro (8RB) - EWD Headlamps
100% (1)
Audi Q5 Quattro (8RB) - EWD Headlamps
43 pages
Navigation Exam Results Summary
No ratings yet
Navigation Exam Results Summary
34 pages
E2EG Series Proximity Sensors Overview
No ratings yet
E2EG Series Proximity Sensors Overview
29 pages
P4-Ipsec: Site-To-Site and Host-To-Site VPN With Ipsec in P4-Based SDN
No ratings yet
P4-Ipsec: Site-To-Site and Host-To-Site VPN With Ipsec in P4-Based SDN
20 pages
Worked Example Question Sheets For D4 HL
No ratings yet
Worked Example Question Sheets For D4 HL
9 pages
Aqwa-Drift Manual
No ratings yet
Aqwa-Drift Manual
119 pages

K-Means Clustering for Data Analysts

Uploaded by

K-Means Clustering for Data Analysts

Uploaded by

Machine Learning with Python

Machine Learning Algorithms - K-Means

Prof. Shibdas Dutta,

Company Confidential: Data-Core Systems, Inc. | [Link]

Introduction - K-Means Clustering

Before K-Means After K-Means

Company Confidential: Data-Core Systems, Inc. | [Link]

euclidean distance formula

Company Confidential: Data-Core Systems, Inc. | [Link]

Company Confidential: Data-Core Systems, Inc. | [Link]

Company Confidential: Data-Core Systems, Inc. | [Link]

Company Confidential: Data-Core Systems, Inc. | [Link]

C1(new) = ( (1+1.5+3)/3 , (1+2+4)/3) = (1.83, 2.33)

Company Confidential: Data-Core Systems, Inc. | [Link]

C1(new) = ( (1+1.5)/2 , (1+2)/2) = (1.25,1.5)

Company Confidential: Data-Core Systems, Inc. | [Link]

C1(new) = ( (1+1.5)/2 , (1+2)/2) = (1.25,1.5)

Company Confidential: Data-Core Systems, Inc. | [Link]

def __init__(self, k=2, tolerance = 0.001, max_iter = 500):

Company Confidential: Data-Core Systems, Inc. | [Link]

def euclidean_distance(self, point1, point2):

def fit(self, data):

Company Confidential: Data-Core Systems, Inc. | [Link]

for point in data:

Company Confidential: Data-Core Systems, Inc. | [Link]

for centroid in [Link]:

Company Confidential: Data-Core Systems, Inc. | [Link]

#generate dummy cluster datasets

# Generate random data and center it to the three centers

data = [Link]((cluster_1, cluster_2, cluster_3), axis = 0)

Company Confidential: Data-Core Systems, Inc. | [Link]

# Plotting starts here

for centroid in k_means.centroids:

for cluster_index in k_means.classes:

STEPS OF CHOOSING BEST K VALUE

1. Run k-means clustering model on various values of k

3. Plot the value of WCSS with respect to various values of K

Company Confidential: Data-Core Systems, Inc. | [Link]

Company Confidential: Data-Core Systems, Inc. | [Link]

# Load the data

# Create the KMeans model

# Fit the model to the data

# Predict the labels for the data

# Calculate the accuracy

# Print the accuracy

Company Confidential: Data-Core Systems, Inc. | [Link]

# Create the data

# Fit the KMeans model

# Predict the labels

# Create the confusion matrix

Company Confidential: Data-Core Systems, Inc. | [Link]

Company Confidential: Data-Core Systems, Inc. | [Link]

1. Difficulty in choosing the optimum number of clusters K

Company Confidential: Data-Core Systems, Inc. | [Link]

· Cluster-then-predict where different models will be built for different subgroups.

It can be used in following applications:

Company Confidential: Data-Core Systems, Inc. | [Link]

Company Confidential: Data-Core Systems, Inc. | [Link]

You might also like

def init(self, k=2, tolerance = 0.001, max_iter = 500):