Objectives of Clustering
Objectives of Clustering
1. Getting Data
Data Sources:
o Internal Sources: Databases, transaction logs, customer records.
o External Sources: APIs, web scraping, third-party datasets.
o Generated Data: Surveys, experiments.
Data Types:
o Structured Data: Tabular data, such as CSV files or databases.
o Unstructured Data: Text, images, audio.
o Semi-structured Data: JSON, XML.
2. Cleaning Data
Common Issues:
o Missing Data: Handle missing values by filling them with
mean/median values, using algorithms like K-Nearest Neighbors
(KNN), or simply removing the affected rows/columns.
o Duplicate Data: Remove duplicate entries to avoid skewing the
clustering results.
o Outliers: Detect and handle outliers that could distort cluster
formation, either by removing them or treating them separately.
Data Cleaning Techniques:
o Imputation: Filling missing data with appropriate values.
o Normalization: Adjusting data to ensure consistency, such as
converting all dates to a single format.
o Filtering: Removing unnecessary columns or rows.
Example: You might find missing age data in customer profiles. This could be filled
with the median age of the customers or treated as a separate cluster.
3. Data Preprocessing
Objective: Prepare the data for the clustering algorithm by transforming it into a
suitable format.
Example: If your customer data includes income, it may vary widely. Normalizing
the income data ensures that customers with high incomes don't
disproportionately influence the clustering.
4. Data Visualization
Objective: Visualize the data to understand its structure and the results of the
clustering.
Exploratory Visualization:
o Histograms and Box Plots: Used to understand the distribution of
individual features.
o Scatter Plots: Visualize relationships between two variables, helping
to identify potential clusters before applying any algorithm.
Visualizing Clusters:
o 2D/3D Scatter Plots: After clustering, plot the clusters to visualize
how the data points have been grouped.
o Cluster Centroids: In K-Means, visualize the centroids to understand
the center of each cluster.
o Heatmaps: Show the correlation between features, which can help in
understanding the structure of the clusters.
5. Clustering Process