DMBI
DMBI
ROLL: 09,15,08
DMBI CA-2
Dataset:
(https://www.kaggle.com/code/tanmay111999/clustering-pca-k-means-d
bscan-hierarchical/input)
PROBLEM STATEMENT:
“HELP International is an international humanitarian NGO that is committed to
fighting poverty and providing the people of backward countries with basic
amenities and relief during the time of disasters and natural calamities. HELP
International have been able to raise around $ 10 million. This money now needs
to be allocated strategically and effectively. Hence, inorder to decide the selection
of the countries that are in the direst need of aid, data driven decisions are to be
made. Thus, it becomes necessary to categorize the countries using socio-economic
and health factors that determine the overall development of the country. Thus,
based on these clusters of the countries depending on their conditions, funds will
be allocated for assistance during the time of disasters and natural calamities. It is a
clear cut case of unsupervised learning where we have to create clusters of the
countries based on the different features present.”
Data Pre-processing:
The dataset Provided already is pre-processed i.e it doesnt have any missing values
Dataset attributes:
country : Name of the country
● child_mort : Death of children under 5 years of age per 1000 live births
● exports : Exports of goods and services per capita. Given as %age of the GDP per capita
● health : Total health spending per capita. Given as %age of GDP per capita
● imports : Imports of goods and services per capita. Given as %age of the GDP per capita
● Income : Net income per person
● Inflation : The measurement of the annual growth rate of the Total GDP
● life_expec : The average number of years a new born child would live if the
current mortality patterns are to rem...
● total_fer : The number of children that would be born to each woman if
the current age-fertility rates remain th...
● gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.
Key Concepts:
● Core Points: A point is considered a core point if it has at least a
specified number of neighboring points (MinPts) within a specified radius
(ε).
● Border Points: A point that is not a core point but is within
the ε-neighborhood of a core point.
● Noise Points (Outliers): Points that are neither core points nor
border points.
Hierarchical Clustering:
Hierarchical Clustering is a clustering algorithm that builds a hierarchy of
clusters. It can be represented as a tree (dendrogram) where each node represents
a cluster. Hierarchical clustering can be divided into two types:
1. Agglomerative (bottom-up):
● Start with each data point as a singleton cluster.
● Merge the closest pairs of clusters until only one cluster remains.
2. Divisive (top-down):
● Start with one cluster containing all data points.
● Split the cluster recursively until each cluster contains only one data point.
● Distance Metric: A measure of dissimilarity between two data points or
clusters. Common metrics include Euclidean distance, Manhattan
distance, and cosine similarity.
● Linkage Criteria: Determines the distance between clusters.
Common linkage criteria include:
○ Single Linkage: Minimum distance between points in two clusters.
○ Complete Linkage: Maximum distance between points in
two clusters.
○ Average Linkage: Average distance between points in two clusters.
○ Ward's Method: Minimizes the variance within each cluster.
B) THE DATASET WE HAVE CHOSEN FOR OUR MINI PROJECT:
https://www.kaggle.com/code/faressayah/ensemble-ml-algorithms-bagging-boostin
g-voting#Ensemble-Machine-Learning-Algorithms-in-Python-with-scikit-learn
2) Dataset Information.
3) Checking for missing values (df.isnull().sum()).
2. Silhouette Score:
● DBSCAN: -0.13552284456117616
● Hierarchical Clustering: -0.0385851544159992
It appears that Hierarchical Clustering performs better than DBSCAN based on the
ARI, which indicates a higher similarity between the clustering results and the
ground truth (or between the two clustering results if ground truth is not available).
However, both models have low Silhouette Scores, which suggests that the
clusters are not well-defined.To summarize, Hierarchical Clustering is preferred
over DBSCAN based on the ARI metric, but both models may not be optimal for
this dataset due to the low Silhouette Scores.