0% found this document useful (0 votes)
5 views

Objectives of Clustering

What is Objectives of Clustering?

Uploaded by

Abu Sufian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Objectives of Clustering

What is Objectives of Clustering?

Uploaded by

Abu Sufian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Objectives of Clustering

1. Getting Data

Objective: Gather the raw data needed for analysis.

 Data Sources:
o Internal Sources: Databases, transaction logs, customer records.
o External Sources: APIs, web scraping, third-party datasets.
o Generated Data: Surveys, experiments.
 Data Types:
o Structured Data: Tabular data, such as CSV files or databases.
o Unstructured Data: Text, images, audio.
o Semi-structured Data: JSON, XML.

Example: If you're clustering customers, you might gather data on purchase


history, website behavior, and demographic information from your company's
CRM system.

2. Cleaning Data

Objective: Ensure the data is free from errors and inconsistencies.

 Common Issues:
o Missing Data: Handle missing values by filling them with
mean/median values, using algorithms like K-Nearest Neighbors
(KNN), or simply removing the affected rows/columns.
o Duplicate Data: Remove duplicate entries to avoid skewing the
clustering results.
o Outliers: Detect and handle outliers that could distort cluster
formation, either by removing them or treating them separately.
 Data Cleaning Techniques:
o Imputation: Filling missing data with appropriate values.
o Normalization: Adjusting data to ensure consistency, such as
converting all dates to a single format.
o Filtering: Removing unnecessary columns or rows.

Example: You might find missing age data in customer profiles. This could be filled
with the median age of the customers or treated as a separate cluster.
3. Data Preprocessing

Objective: Prepare the data for the clustering algorithm by transforming it into a
suitable format.

 Normalization and Scaling:


o Normalization: Adjust values to a common scale without distorting
differences in the ranges of values.
o Standardization: Scale data to have a mean of 0 and a standard
deviation of 1, useful for algorithms like K-Means.
 Dimensionality Reduction:
o Techniques like Principal Component Analysis (PCA) reduce the
number of variables while retaining the most important information.
 Feature Engineering:
o Creating new features from existing data to enhance the clustering
process.
o Encoding Categorical Variables: Convert categories into numerical
values using methods like one-hot encoding.

Example: If your customer data includes income, it may vary widely. Normalizing
the income data ensures that customers with high incomes don't
disproportionately influence the clustering.

4. Data Visualization

Objective: Visualize the data to understand its structure and the results of the
clustering.

 Exploratory Visualization:
o Histograms and Box Plots: Used to understand the distribution of
individual features.
o Scatter Plots: Visualize relationships between two variables, helping
to identify potential clusters before applying any algorithm.
 Visualizing Clusters:
o 2D/3D Scatter Plots: After clustering, plot the clusters to visualize
how the data points have been grouped.
o Cluster Centroids: In K-Means, visualize the centroids to understand
the center of each cluster.
o Heatmaps: Show the correlation between features, which can help in
understanding the structure of the clusters.

Example: After clustering customers based on their purchasing behavior, you


could use a 2D scatter plot to visualize the clusters, where each point represents a
customer, and colors distinguish different clusters.

5. Clustering Process

Objective: Apply a clustering algorithm to group the data into meaningful


clusters.

 Choosing the Right Algorithm:


o K-Means: Simple and widely used for partitioning data into K
clusters.
o Hierarchical Clustering: Builds a tree of clusters, useful when the
number of clusters is not known beforehand.
o DBSCAN: Useful for finding arbitrarily shaped clusters and handling
noise/outliers.
 Running the Algorithm:
o Initialize the clustering process by choosing the appropriate
parameters (e.g., number of clusters for K-Means).
o Fit the model to the data, allowing it to group similar data points
together.
 Evaluating the Clusters:
o Silhouette Score: Measures how similar a data point is to its own
cluster compared to other clusters.
o Elbow Method: Used in K-Means to determine the optimal number
of clusters by plotting the sum of squared distances and looking for
an "elbow" point.

Example: Using K-Means to group customers into 3 clusters based on their


shopping habits, you could evaluate the clustering quality with a silhouette score.

You might also like