0% found this document useful (0 votes)
26 views4 pages

DSE Lab Assignment - Writeup - 7

The document discusses performing clustering analysis on workout data using K-means clustering in Python. It includes loading and exploring the data, selecting K-means modeling, training the model to assign clusters, and evaluating performance.

Uploaded by

1032212420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views4 pages

DSE Lab Assignment - Writeup - 7

The document discusses performing clustering analysis on workout data using K-means clustering in Python. It includes loading and exploring the data, selecting K-means modeling, training the model to assign clusters, and evaluating performance.

Uploaded by

1032212420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

B.

Tech Electrical and Computer Engineering

Semester: VI Subject: Data Science for Engineers


Name: Abhishek Agrawal Class: TY El&CE
Roll no.: 03 Batch: A3

Experiment No.: 07
Name of the Experiment: Clustering using Python

Aim:
Write a Python program to perform Clustering: We have the data for the workout as below.
Date Distance_km Duration_min Delta_last_workout Day_category
10/17/17 4.3 21.58 1 0
11/04/17 1.9 9.25 18 1
11/18/17 1.9 9.0 14 1
11/23/17 1.9 8.93 5 0
11/28/17 2.3 11.94 5 0
11/29/17 2.8 14.05 1 0

To keep track of your performance you need to identify similar workout sessions. Clustering
can help you group the data into distinct groups, guaranteeing that the data points in each
group are similar to each other. Perform the following steps:
i. Load the Data
ii. Data Exploratory Analysis: Pair Plot and Distance versus workout duration,
distance versus duration with the number of days, and correlation (Scatter plot) to
get idea about the correlation between different features.
iii. Select K-means clustering for the model and get the clusters.
iv. Evaluate the performance of the model.

Theory:
Clustering based Machine Learning:
The task of grouping data points based on their similarity with each other is called Clustering
or Cluster Analysis. This method is defined under the branch of Unsupervised Learning,
which aims at gaining insights from unlabeled data points, that is, unlike supervised
learning we don’t have a target variable. Clustering aims at forming groups of homogeneous
data points from a heterogeneous dataset. It evaluates the similarity based on a metric like
Euclidean distance, Cosine similarity, Manhattan distance, etc. and then group the points with
highest similarity score together.
For Example, In the graph given below, we can clearly see that there are 3 circular clusters
forming on the basis of distance.

It is not necessary that the clusters formed must be circular in shape. The shape of clusters
can be arbitrary. There are many algorithms that work well with detecting arbitrary shaped
clusters.
For example, In the below given graph we can see that the clusters formed are not circular in
shape.

Types of Clustering:
● Hard Clustering: In this type of clustering, each data point belongs to a cluster
completely or not.
● Soft Clustering: In this type of clustering, instead of assigning each data point into a
separate cluster, a probability or likelihood of that point being that cluster is
evaluated.
Types of Clustering Algorithms:
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)
4. Distribution-based Clustering
Uses of Clustering: Clustering algorithms are majorly used for:
● Market Segmentation – Businesses use clustering to group their customers and use
targeted advertisements to attract more audience.
● Market Basket Analysis – Shop owners analyze their sales and figure out which items
are majorly bought together by the customers.
● Social Network Analysis – Social media sites use your data to understand your
browsing behavior and provide you with targeted friend recommendations or content
recommendations.
● Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic
images like X-rays.
● Anomaly Detection – To find outliers in a stream of real-time datasets or forecast
fraudulent transactions we can use clustering to identify them.
K-means Clustering:
Unsupervised machine learning is the process of teaching a computer to use unlabeled,
unclassified data and enabling the algorithm to operate on that data without supervision.
Without any previous data training, the machine’s job in this case is to organize unsorted data
according to parallels, patterns, and variations.
K means clustering, assigns data points to one of the K clusters depending on their distance
from the center of the clusters. It starts by randomly assigning the clusters centroid in the
space. Then each data point assign to one of the cluster based on its distance from centroid of
the cluster. After assigning each point to one of the cluster, new cluster centroids are
assigned. This process runs iteratively until it finds good cluster.

Procedure:

1. Load the Data: Begin by loading the workout data into a Python environment. You can
use libraries such as pandas to read the data from a CSV file into a Data Frame.
2. Data Exploratory Analysis: Perform exploratory analysis on the data to understand its
structure and characteristics. Create visualizations such as pair plots to visualize the
relationships between different features. Plot Distance versus workout duration, distance
versus duration with the number of days, and correlation scatter plots to identify any
correlations between features.
3. Select K-means Clustering: Choose K-means clustering as the clustering algorithm for
the model. Determine the optimal number of clusters (K) using techniques such as the
elbow method or silhouette analysis.
4. Train the Model and Get Clusters: Train the K-means clustering model using the
workout data. Assign each data point to a cluster based on its proximity to the cluster
centroids.
5. Evaluate the Performance of the Model: Evaluate the performance of the clustering
model using metrics such as silhouette score or inertia. Visualize the clusters to gain
insights into the patterns and groupings within the data.
Conclusion:
In this lab, we performed clustering analysis on workout data using the K-means clustering
algorithm. By grouping similar workout sessions together, we can gain insights into patterns
and trends in the data. Clustering analysis can be a valuable tool for organizing and analyzing
large datasets, helping to uncover hidden patterns and relationships.

Post Lab Questions:

1. What are some real-world applications of clustering algorithms, and how do they
benefit from clustering?
2. Explain the different types of clustering algorithms.
3. Discuss different techniques to calculate the distance between centroids and data
elements.
4. Describe the various performance measures that can be used for clustering
algorithms.

You might also like