0% found this document useful (0 votes)

4 views16 pages

DMBI

The document outlines a project by HELP International to allocate $10 million in aid to countries in need, using clustering techniques to categorize countries based on socio-economic and health factors. It discusses the use of DBSCAN and Hierarchical Clustering algorithms to analyze the dataset, revealing that Hierarchical Clustering outperforms DBSCAN based on Adjusted Rand Index and Silhouette Score. The findings emphasize the importance of targeted aid programs and collaboration with other organizations to effectively address the needs of high-need countries.

Uploaded by

2021.vaishnavi.chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views16 pages

DMBI

Uploaded by

2021.vaishnavi.chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

GROUP: Vaishnavi Chaudhary, Simran Dung, Goutam Chandnani

ROLL: 09,15,08

DMBI CA-2
Dataset:
(https://www.kaggle.com/code/tanmay111999/clustering-pca-k-means-d
bscan-hierarchical/input)

PROBLEM STATEMENT:
“HELP International is an international humanitarian NGO that is committed to
fighting poverty and providing the people of backward countries with basic
amenities and relief during the time of disasters and natural calamities. HELP
International have been able to raise around $ 10 million. This money now needs
to be allocated strategically and effectively. Hence, inorder to decide the selection
of the countries that are in the direst need of aid, data driven decisions are to be
made. Thus, it becomes necessary to categorize the countries using socio-economic
and health factors that determine the overall development of the country. Thus,
based on these clusters of the countries depending on their conditions, funds will
be allocated for assistance during the time of disasters and natural calamities. It is a
clear cut case of unsupervised learning where we have to create clusters of the
countries based on the different features present.”

A) WHICH DATA MINING TASK IS NEEDED IN

OUR DATASET:
Based on the dataset provided and the problem statement, the data mining task needed is
clustering. Specifically, the goal is to cluster countries based on their socio-economic and
health factors to identify those in the direst need of aid. This falls under the realm of
unsupervised learning, where the algorithm will group similar countries together without
the need for labeled data. Clustering is a common data mining task used to discover
natural groupings or patterns in data.

Data Pre-processing:
The dataset Provided already is pre-processed i.e it doesnt have any missing values
Dataset attributes:
country : Name of the country

● child_mort : Death of children under 5 years of age per 1000 live births
● exports : Exports of goods and services per capita. Given as %age of the GDP per capita
● health : Total health spending per capita. Given as %age of GDP per capita
● imports : Imports of goods and services per capita. Given as %age of the GDP per capita
● Income : Net income per person
● Inflation : The measurement of the annual growth rate of the Total GDP
● life_expec : The average number of years a new born child would live if the
current mortality patterns are to rem...
● total_fer : The number of children that would be born to each woman if
the current age-fertility rates remain th...
● gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.

Density-Based Spatial Clustering Of Applications With Noise

(DBSCAN) DBSCAN (Density-Based Spatial Clustering of Applications with
Noise) is a popular clustering algorithm in data mining and machine learning.
Unlike traditional clustering algorithms like K-means, DBSCAN does not require
the user to specify the number of clusters beforehand. Instead, it groups together
closely packed points and identifies points that are in low-density regions as
outliers or noise.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a
popular clustering algorithm in data mining and machine learning. Unlike
traditional clustering algorithms like K-means, DBSCAN does not require the
user to specify the number of clusters beforehand. Instead, it groups together
closely packed points and identifies points that are in low-density regions as
outliers or noise.

Key Concepts:
● Core Points: A point is considered a core point if it has at least a
specified number of neighboring points (MinPts) within a specified radius
(ε).
● Border Points: A point that is not a core point but is within
the ε-neighborhood of a core point.
● Noise Points (Outliers): Points that are neither core points nor
border points.

Hierarchical Clustering:
Hierarchical Clustering is a clustering algorithm that builds a hierarchy of
clusters. It can be represented as a tree (dendrogram) where each node represents
a cluster. Hierarchical clustering can be divided into two types:
1. Agglomerative (bottom-up):
● Start with each data point as a singleton cluster.
● Merge the closest pairs of clusters until only one cluster remains.
2. Divisive (top-down):
● Start with one cluster containing all data points.
● Split the cluster recursively until each cluster contains only one data point.
● Distance Metric: A measure of dissimilarity between two data points or
clusters. Common metrics include Euclidean distance, Manhattan
distance, and cosine similarity.
● Linkage Criteria: Determines the distance between clusters.
Common linkage criteria include:
○ Single Linkage: Minimum distance between points in two clusters.
○ Complete Linkage: Maximum distance between points in
two clusters.
○ Average Linkage: Average distance between points in two clusters.
○ Ward's Method: Minimizes the variance within each cluster.
B) THE DATASET WE HAVE CHOSEN FOR OUR MINI PROJECT:
https://www.kaggle.com/code/faressayah/ensemble-ml-algorithms-bagging-boostin
g-voting#Ensemble-Machine-Learning-Algorithms-in-Python-with-scikit-learn

C) Performing Exploratory Data Analysis (EDA):

1) Loading the dataset.

2) Dataset Information.
3) Checking for missing values (df.isnull().sum()).

4) Displaying descriptive statistics of the dataset (df.describe()).

5) Checking for categorical and continuous variables.

6) Visualizing Feature Distribution
7) Visualizing the distribution and correlation of features with respect
to the outcome using histograms .
Correlation matrix:

8) Standardization and Normalization

9) Visualizing the numerical features using boxplot

Summarizing the EDA:

● From the visualizations and the list of features of an economically
backward nations, a host of insights can be gained!
● When it comes to health conditions, African countries hold higher ranks in
all the wrong situations. They hold a significant presence in high
child_mort, low life_expec and high total_fer.
● All these problems are already pretty serious and hence it is very
important to assist them during the periods of unforeseen turmoil. Despite
such numbers, Haiti grabs the top spot with high values of child_mort.
Asian & European countries are present at the other end of it.
● US citizens are the highest spenders on their health however they are not
present in the top 5 ranks of life_expec & total_fer. None of the
countries with a high life_expec are present in the top 5 of health. Asian
countries crowd lower end of health.
● Singapore, Malta, Luxembourg & Seychelles are present in the top 5 of
exports as well as imports. Population size and geographical locations play
a pivotal role when it comes to imports and exports.
● Sudan is the only African nation with low imports and Brazil has the
lowest imports out of all.
● African countries display very high values of inflation whereas
countries from all the continents can be found with low inflation values.
● Citizens of Qatar are the highest paid with Singapore & Luxembourg
again grabbing spots in top 5 of income.
● For gdp, Luxembourg is in the top ranks. Switzerland & Qatar are present
in the top 5 similar to income.
● African nations are present in the lower end of income as well as
gdp. Colonization has had a huge toll on the African nations.

E) NOW THE ALGORITHM HERE , WE ARE IMPLEMENTING IS

1) Hierarchical Clustering:
Hierarchical clustering is a clustering algorithm that builds a hierarchy of
clusters. It can be either agglomerative (bottom-up) or divisive (top-down).
In agglomerative hierarchical clustering, each data point starts as a singleton
cluster, and pairs of clusters are merged based on a distance metric until
only one cluster remains. Divisive hierarchical clustering starts with one
cluster containing all data points and recursively splits clusters until each
cluster contains only one data point. Hierarchical clustering does not require
specifying the number of clusters in advance and produces a dendrogram
that visualizes the clustering process.

2) DBSCAN (Density-Based Spatial Clustering of Applications with

Noise):
DBSCAN is a density-based clustering algorithm that groups together
closely packed points and identifies points that are in low-density regions as
outliers or noise. DBSCAN requires two parameters: epsilon (ε), which
defines the radius within which to search for neighboring points, and
minPts, the minimum number of points required to form a dense region
(core point). Points that are within ε distance of a core point are added to the
same cluster. DBSCAN is effective for clustering data with irregular shapes
and varying cluster densities, but it may struggle with high-dimensional data
and requires careful parameter tuning.

IMPLEMENTATION OF DBSCAN AND HIERARCHICAL

CLUSTERING
1. Building DBSCAN model:
Visualizing the box plots:
From the above plot we can conclude :
-1 : Noise / Outliers
0 : Might Need Help
1 : No Help Needed
2 : Help Needed

2. Building Hierarchical Clustering Model :

Dendrograms:

● In this case, we need to divide the countries into 3 categories. That

is why we will select a 3 clusters directly. Dendrogram analysis for
this dataset is kind of redundant.
● Here, we can see that 1 blue line alongwith 2 red lines are
the penultimate clusters that before connecting together.
● It has 3 branches, thus indicating the 3 clusters that it creates
before merging into 1!
Prediction:

3. Silhouette score for DBSCAN and Hierarchical Clustering:

4. ARI for DBSCAN and Hierarchical Clustering:

TO IDENTIFY FROM BOTH MODELS WHICH PERFORMS THE BEST:
Based on the performance metrics calculated:
1. Adjusted Rand Index (ARI):
● DBSCAN: 0.17219374066520834
● Hierarchical Clustering: 1.0

2. Silhouette Score:
● DBSCAN: -0.13552284456117616
● Hierarchical Clustering: -0.0385851544159992
It appears that Hierarchical Clustering performs better than DBSCAN based on the
ARI, which indicates a higher similarity between the clustering results and the
ground truth (or between the two clustering results if ground truth is not available).
However, both models have low Silhouette Scores, which suggests that the
clusters are not well-defined.To summarize, Hierarchical Clustering is preferred
over DBSCAN based on the ARI metric, but both models may not be optimal for
this dataset due to the low Silhouette Scores.

Business Intelligence (BI) Decision

Based on the clustering results:

● Focus on High-Need Countries: Allocate a significant portion of the aid

budget to countries identified as high-need based on the clustering
analysis. These countries have higher child mortality rates, lower incomes,
and other indicators of economic and social challenges.
● Targeted Aid Programs: Develop targeted aid programs for countries in
each cluster, focusing on specific areas such as healthcare, education,
infrastructure, and economic development. Tailoring programs to the
needs of each cluster can maximize the impact of aid.
● Partnerships and Collaboration: Collaborate with other organizations,
governments, and NGOs to address the needs of countries in each cluster
more effectively. Partnerships can help leverage resources and expertise
to make a greater impact.

HQPDS 2019.5.5
No ratings yet
HQPDS 2019.5.5
622 pages
000 - MX 15 19 - sn20600 49999 - D.P
No ratings yet
000 - MX 15 19 - sn20600 49999 - D.P
48 pages
Grouping
No ratings yet
Grouping
98 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
Clustering
No ratings yet
Clustering
55 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
Ifferent Methods of Clustering
No ratings yet
Ifferent Methods of Clustering
8 pages
Department of Electrical/Elcetronic Engineering
No ratings yet
Department of Electrical/Elcetronic Engineering
29 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages
Unit 2
No ratings yet
Unit 2
33 pages
Partition
No ratings yet
Partition
52 pages
Big Data Analytics
No ratings yet
Big Data Analytics
25 pages
Data Mining Business Report 2
No ratings yet
Data Mining Business Report 2
18 pages
ML Unit 4
No ratings yet
ML Unit 4
15 pages
Clustering
No ratings yet
Clustering
53 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Ds Econtent
No ratings yet
Ds Econtent
8 pages
Examen y Respuestas Mtcre
100% (1)
Examen y Respuestas Mtcre
3 pages
Clustering
No ratings yet
Clustering
75 pages
Business Report Data Mining
91% (11)
Business Report Data Mining
18 pages
DWM PT 2 QB Soln
No ratings yet
DWM PT 2 QB Soln
8 pages
Clustering
No ratings yet
Clustering
14 pages
ML - 8
No ratings yet
ML - 8
70 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Clustering
No ratings yet
Clustering
6 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Joseph Xavier J - FML
No ratings yet
Joseph Xavier J - FML
15 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
L08 Hierachical Agglomerative Clustering
No ratings yet
L08 Hierachical Agglomerative Clustering
41 pages
Solve These
No ratings yet
Solve These
7 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Unit 5
No ratings yet
Unit 5
10 pages
Unit 4
No ratings yet
Unit 4
16 pages
Ds Un4
No ratings yet
Ds Un4
11 pages
Ambo University: Inistitute of Technology
No ratings yet
Ambo University: Inistitute of Technology
15 pages
Clustering
No ratings yet
Clustering
11 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
DW&M Unit 3 Part II
No ratings yet
DW&M Unit 3 Part II
50 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
3 pages
Clustering
No ratings yet
Clustering
38 pages
HTCB Unit 5
No ratings yet
HTCB Unit 5
3 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
ETERNUS DX Disk Storage Systems User's Guide - Server Connection
No ratings yet
ETERNUS DX Disk Storage Systems User's Guide - Server Connection
59 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Zara
No ratings yet
Zara
47 pages
Chatgpt Unit - 4
No ratings yet
Chatgpt Unit - 4
4 pages
Unit 4
No ratings yet
Unit 4
5 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
Math 111 Practice Test Chapter 2 Answers
No ratings yet
Math 111 Practice Test Chapter 2 Answers
4 pages
Clustering
No ratings yet
Clustering
11 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Unit 5
No ratings yet
Unit 5
5 pages
Clustering
No ratings yet
Clustering
39 pages
SSRN Id3768295
No ratings yet
SSRN Id3768295
7 pages
Veb1043 Geomatics January 2020 Lab Report: Experiment: Group: Group Members
No ratings yet
Veb1043 Geomatics January 2020 Lab Report: Experiment: Group: Group Members
7 pages
Operating Instructions Heater Thermorizer TR 75: D GB NL PL RUS
No ratings yet
Operating Instructions Heater Thermorizer TR 75: D GB NL PL RUS
28 pages
Trends, Networks, and Critical Thinking in The 21st Century
100% (4)
Trends, Networks, and Critical Thinking in The 21st Century
17 pages
DP Biometric 13115 Drivers
No ratings yet
DP Biometric 13115 Drivers
185 pages
Smart LOCK LEZN Smart Zoom Company Door Lock Offers
No ratings yet
Smart LOCK LEZN Smart Zoom Company Door Lock Offers
11 pages
XPression User Guide
No ratings yet
XPression User Guide
576 pages
Packard Bell Easynote M3 Disassembly Manual
No ratings yet
Packard Bell Easynote M3 Disassembly Manual
20 pages
SNORTNEW
No ratings yet
SNORTNEW
23 pages
Principles of Cyber Security: Cybertaipan - Csiro.au
No ratings yet
Principles of Cyber Security: Cybertaipan - Csiro.au
34 pages
Business, Operations & Marketing Student Handbook
No ratings yet
Business, Operations & Marketing Student Handbook
7 pages
07 - Ai-900 71-90
No ratings yet
07 - Ai-900 71-90
6 pages
COMP246-016 - Fridge Management System - Parts A, B, & C
No ratings yet
COMP246-016 - Fridge Management System - Parts A, B, & C
56 pages
Workshop Layout: Teaching / Learning Areas Size Area Total Area
No ratings yet
Workshop Layout: Teaching / Learning Areas Size Area Total Area
3 pages
Festo Training Book
No ratings yet
Festo Training Book
39 pages
Microgrid Monitoring and Controlling Using PLC
No ratings yet
Microgrid Monitoring and Controlling Using PLC
4 pages
Starcoder 2
No ratings yet
Starcoder 2
61 pages
9.3 技术服务合同
No ratings yet
9.3 技术服务合同
9 pages
The Activities Are Carried Out by The Following Three People: Administrative Support Person: Filing
No ratings yet
The Activities Are Carried Out by The Following Three People: Administrative Support Person: Filing
3 pages
Auction Research Paper
No ratings yet
Auction Research Paper
8 pages
Ws Word Project 7
No ratings yet
Ws Word Project 7
5 pages
Resume SYEDAHMADHASHMI
No ratings yet
Resume SYEDAHMADHASHMI
2 pages
Capstone Project 2-SQL-DataETL
No ratings yet
Capstone Project 2-SQL-DataETL
7 pages
Acc111 Register Vendor
No ratings yet
Acc111 Register Vendor
3 pages
Emplois Du Temps Génie Élèctrique
No ratings yet
Emplois Du Temps Génie Élèctrique
4 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Mapping the Spatial Distribution of Poverty Using Satellite Imagery in Thailand
From Everand
Mapping the Spatial Distribution of Poverty Using Satellite Imagery in Thailand
Asian Development Bank
No ratings yet

DMBI

Uploaded by

DMBI

Uploaded by

GROUP: Vaishnavi Chaudhary, Simran Dung, Goutam Chandnani

A) WHICH DATA MINING TASK IS NEEDED IN

Density-Based Spatial Clustering Of Applications With Noise

C) Performing Exploratory Data Analysis (EDA):

4) Displaying descriptive statistics of the dataset (df.describe()).

5) Checking for categorical and continuous variables.

8) Standardization and Normalization

Summarizing the EDA:

E) NOW THE ALGORITHM HERE , WE ARE IMPLEMENTING IS

2) DBSCAN (Density-Based Spatial Clustering of Applications with

IMPLEMENTATION OF DBSCAN AND HIERARCHICAL

2. Building Hierarchical Clustering Model :

● In this case, we need to divide the countries into 3 categories. That

3. Silhouette score for DBSCAN and Hierarchical Clustering:

4. ARI for DBSCAN and Hierarchical Clustering:

Business Intelligence (BI) Decision

● Focus on High-Need Countries: Allocate a significant portion of the aid

You might also like