0% found this document useful (0 votes)
4 views16 pages

DMBI

The document outlines a project by HELP International to allocate $10 million in aid to countries in need, using clustering techniques to categorize countries based on socio-economic and health factors. It discusses the use of DBSCAN and Hierarchical Clustering algorithms to analyze the dataset, revealing that Hierarchical Clustering outperforms DBSCAN based on Adjusted Rand Index and Silhouette Score. The findings emphasize the importance of targeted aid programs and collaboration with other organizations to effectively address the needs of high-need countries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views16 pages

DMBI

The document outlines a project by HELP International to allocate $10 million in aid to countries in need, using clustering techniques to categorize countries based on socio-economic and health factors. It discusses the use of DBSCAN and Hierarchical Clustering algorithms to analyze the dataset, revealing that Hierarchical Clustering outperforms DBSCAN based on Adjusted Rand Index and Silhouette Score. The findings emphasize the importance of targeted aid programs and collaboration with other organizations to effectively address the needs of high-need countries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

GROUP: Vaishnavi Chaudhary, Simran Dung, Goutam Chandnani

ROLL: 09,15,08

DMBI CA-2
Dataset:
(https://www.kaggle.com/code/tanmay111999/clustering-pca-k-means-d
bscan-hierarchical/input)

PROBLEM STATEMENT:
“HELP International is an international humanitarian NGO that is committed to
fighting poverty and providing the people of backward countries with basic
amenities and relief during the time of disasters and natural calamities. HELP
International have been able to raise around $ 10 million. This money now needs
to be allocated strategically and effectively. Hence, inorder to decide the selection
of the countries that are in the direst need of aid, data driven decisions are to be
made. Thus, it becomes necessary to categorize the countries using socio-economic
and health factors that determine the overall development of the country. Thus,
based on these clusters of the countries depending on their conditions, funds will
be allocated for assistance during the time of disasters and natural calamities. It is a
clear cut case of unsupervised learning where we have to create clusters of the
countries based on the different features present.”

A) WHICH DATA MINING TASK IS NEEDED IN


OUR DATASET:
Based on the dataset provided and the problem statement, the data mining task needed is
clustering. Specifically, the goal is to cluster countries based on their socio-economic and
health factors to identify those in the direst need of aid. This falls under the realm of
unsupervised learning, where the algorithm will group similar countries together without
the need for labeled data. Clustering is a common data mining task used to discover
natural groupings or patterns in data.

Data Pre-processing:
The dataset Provided already is pre-processed i.e it doesnt have any missing values
Dataset attributes:
country : Name of the country

● child_mort : Death of children under 5 years of age per 1000 live births
● exports : Exports of goods and services per capita. Given as %age of the GDP per capita
● health : Total health spending per capita. Given as %age of GDP per capita
● imports : Imports of goods and services per capita. Given as %age of the GDP per capita
● Income : Net income per person
● Inflation : The measurement of the annual growth rate of the Total GDP
● life_expec : The average number of years a new born child would live if the
current mortality patterns are to rem...
● total_fer : The number of children that would be born to each woman if
the current age-fertility rates remain th...
● gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.

Density-Based Spatial Clustering Of Applications With Noise


(DBSCAN) DBSCAN (Density-Based Spatial Clustering of Applications with
Noise) is a popular clustering algorithm in data mining and machine learning.
Unlike traditional clustering algorithms like K-means, DBSCAN does not require
the user to specify the number of clusters beforehand. Instead, it groups together
closely packed points and identifies points that are in low-density regions as
outliers or noise.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a
popular clustering algorithm in data mining and machine learning. Unlike
traditional clustering algorithms like K-means, DBSCAN does not require the
user to specify the number of clusters beforehand. Instead, it groups together
closely packed points and identifies points that are in low-density regions as
outliers or noise.

Key Concepts:
● Core Points: A point is considered a core point if it has at least a
specified number of neighboring points (MinPts) within a specified radius
(ε).
● Border Points: A point that is not a core point but is within
the ε-neighborhood of a core point.
● Noise Points (Outliers): Points that are neither core points nor
border points.

Hierarchical Clustering:
Hierarchical Clustering is a clustering algorithm that builds a hierarchy of
clusters. It can be represented as a tree (dendrogram) where each node represents
a cluster. Hierarchical clustering can be divided into two types:
1. Agglomerative (bottom-up):
● Start with each data point as a singleton cluster.
● Merge the closest pairs of clusters until only one cluster remains.
2. Divisive (top-down):
● Start with one cluster containing all data points.
● Split the cluster recursively until each cluster contains only one data point.
● Distance Metric: A measure of dissimilarity between two data points or
clusters. Common metrics include Euclidean distance, Manhattan
distance, and cosine similarity.
● Linkage Criteria: Determines the distance between clusters.
Common linkage criteria include:
○ Single Linkage: Minimum distance between points in two clusters.
○ Complete Linkage: Maximum distance between points in
two clusters.
○ Average Linkage: Average distance between points in two clusters.
○ Ward's Method: Minimizes the variance within each cluster.
B) THE DATASET WE HAVE CHOSEN FOR OUR MINI PROJECT:
https://www.kaggle.com/code/faressayah/ensemble-ml-algorithms-bagging-boostin
g-voting#Ensemble-Machine-Learning-Algorithms-in-Python-with-scikit-learn

C) Performing Exploratory Data Analysis (EDA):


1) Loading the dataset.

2) Dataset Information.
3) Checking for missing values (df.isnull().sum()).

4) Displaying descriptive statistics of the dataset (df.describe()).

5) Checking for categorical and continuous variables.


6) Visualizing Feature Distribution
7) Visualizing the distribution and correlation of features with respect
to the outcome using histograms .
Correlation matrix:

8) Standardization and Normalization


9) Visualizing the numerical features using boxplot

Summarizing the EDA:


● From the visualizations and the list of features of an economically
backward nations, a host of insights can be gained!
● When it comes to health conditions, African countries hold higher ranks in
all the wrong situations. They hold a significant presence in high
child_mort, low life_expec and high total_fer.
● All these problems are already pretty serious and hence it is very
important to assist them during the periods of unforeseen turmoil. Despite
such numbers, Haiti grabs the top spot with high values of child_mort.
Asian & European countries are present at the other end of it.
● US citizens are the highest spenders on their health however they are not
present in the top 5 ranks of life_expec & total_fer. None of the
countries with a high life_expec are present in the top 5 of health. Asian
countries crowd lower end of health.
● Singapore, Malta, Luxembourg & Seychelles are present in the top 5 of
exports as well as imports. Population size and geographical locations play
a pivotal role when it comes to imports and exports.
● Sudan is the only African nation with low imports and Brazil has the
lowest imports out of all.
● African countries display very high values of inflation whereas
countries from all the continents can be found with low inflation values.
● Citizens of Qatar are the highest paid with Singapore & Luxembourg
again grabbing spots in top 5 of income.
● For gdp, Luxembourg is in the top ranks. Switzerland & Qatar are present
in the top 5 similar to income.
● African nations are present in the lower end of income as well as
gdp. Colonization has had a huge toll on the African nations.

E) NOW THE ALGORITHM HERE , WE ARE IMPLEMENTING IS


1) Hierarchical Clustering:
Hierarchical clustering is a clustering algorithm that builds a hierarchy of
clusters. It can be either agglomerative (bottom-up) or divisive (top-down).
In agglomerative hierarchical clustering, each data point starts as a singleton
cluster, and pairs of clusters are merged based on a distance metric until
only one cluster remains. Divisive hierarchical clustering starts with one
cluster containing all data points and recursively splits clusters until each
cluster contains only one data point. Hierarchical clustering does not require
specifying the number of clusters in advance and produces a dendrogram
that visualizes the clustering process.

2) DBSCAN (Density-Based Spatial Clustering of Applications with


Noise):
DBSCAN is a density-based clustering algorithm that groups together
closely packed points and identifies points that are in low-density regions as
outliers or noise. DBSCAN requires two parameters: epsilon (ε), which
defines the radius within which to search for neighboring points, and
minPts, the minimum number of points required to form a dense region
(core point). Points that are within ε distance of a core point are added to the
same cluster. DBSCAN is effective for clustering data with irregular shapes
and varying cluster densities, but it may struggle with high-dimensional data
and requires careful parameter tuning.

IMPLEMENTATION OF DBSCAN AND HIERARCHICAL


CLUSTERING
1. Building DBSCAN model:
Visualizing the box plots:
From the above plot we can conclude :
-1 : Noise / Outliers
0 : Might Need Help
1 : No Help Needed
2 : Help Needed

2. Building Hierarchical Clustering Model :


Dendrograms:

● In this case, we need to divide the countries into 3 categories. That


is why we will select a 3 clusters directly. Dendrogram analysis for
this dataset is kind of redundant.
● Here, we can see that 1 blue line alongwith 2 red lines are
the penultimate clusters that before connecting together.
● It has 3 branches, thus indicating the 3 clusters that it creates
before merging into 1!
Prediction:

3. Silhouette score for DBSCAN and Hierarchical Clustering:

4. ARI for DBSCAN and Hierarchical Clustering:


TO IDENTIFY FROM BOTH MODELS WHICH PERFORMS THE BEST:
Based on the performance metrics calculated:
1. Adjusted Rand Index (ARI):
● DBSCAN: 0.17219374066520834
● Hierarchical Clustering: 1.0

2. Silhouette Score:
● DBSCAN: -0.13552284456117616
● Hierarchical Clustering: -0.0385851544159992
It appears that Hierarchical Clustering performs better than DBSCAN based on the
ARI, which indicates a higher similarity between the clustering results and the
ground truth (or between the two clustering results if ground truth is not available).
However, both models have low Silhouette Scores, which suggests that the
clusters are not well-defined.To summarize, Hierarchical Clustering is preferred
over DBSCAN based on the ARI metric, but both models may not be optimal for
this dataset due to the low Silhouette Scores.

Business Intelligence (BI) Decision


Based on the clustering results:

● Focus on High-Need Countries: Allocate a significant portion of the aid


budget to countries identified as high-need based on the clustering
analysis. These countries have higher child mortality rates, lower incomes,
and other indicators of economic and social challenges.
● Targeted Aid Programs: Develop targeted aid programs for countries in
each cluster, focusing on specific areas such as healthcare, education,
infrastructure, and economic development. Tailoring programs to the
needs of each cluster can maximize the impact of aid.
● Partnerships and Collaboration: Collaborate with other organizations,
governments, and NGOs to address the needs of countries in each cluster
more effectively. Partnerships can help leverage resources and expertise
to make a greater impact.

You might also like