Data Mining Assignment 1: Group 3: Ankita (BLP008), Arnab (BLP013), Kaustav (BLP025), Pubali (BLP040)
Data Mining Assignment 1: Group 3: Ankita (BLP008), Arnab (BLP013), Kaustav (BLP025), Pubali (BLP040)
Objective
The company wanted to expand its business in countries. To do so, the company collected socio-economic data from
168 countries around the world and shortlisted about nine variables or deciding factors on the basis of which they
would be able to design their insurance plans from different organisations under UNO. Then the company wanted to
group the data of all the countries which were similar in terms of the social and economic parameters and identify the
countries which were similar to Germany and determine a list of their target market.
Methodology
• Firstly, the reliability of the data was checked by the chronbach alpha.
• Then the no of predictors was reduced by the dimension reduction method (PCA and EFA).
• After this the records and observations were grouped using the k-means algorithm of clustering.
Variables identified
Column Name Description
Country Name of the country
child_mort Death of children under 5 years of age per 1000 live births
exports Exports of goods and services per capita. Given as %age of the GDP per capita
health Total health spending per capita. Given as %age of GDP per capita
imports Imports of goods and services per capita. Given as %age of the GDP per capita
Income Net income per person
Inflation The measurement of the annual growth rate of the Total GDP
The average number of years a new born child would live if the current mortality patterns are
life_expec to remain the same
The number of children that would be born to each woman if the current age-fertility rates
total_fer remain the same.
gdpp The GDP per capita. Calculated as the Total GDP divided by the total population.
1
Procedure and Analysis
R output
Since the alpha value is 0.55, we can say that the data is reliable.
KMOS(cd2)
The KMO Criterion is more than 0.685, which signifies that there is 68.52% variability of the data.
bart_spher(cd2)
The p-value here is less than 0.05 i.e. 5% LOS. So, we reject H0 i.e. there is multicollinearity in the data.
pc<-princomp(cd2, cor = TRUE)
summary(pc)
Here the Eigen values or Standard deviation values of components 1 and 2 are 1.98 and 1.22 which are all more than
1 so according to the thumb rule these three components should be considered. Also, from proportion of variance we
can see that components 1and 2 explains 56.4% and 21.5% variability of the data.
loadings(pc)
2
fa<-factanal(cd2, factors = 2, rotation = "varimax", scores = "regression")
fa
Here we are creating a data frame consisting of the country name, health, inflation and the 2 factors
i1<-i[,-c(1)]
3
c<-scale(i1)
set.seed(24)
result<-kmeans(c, 4)
result$centers
result$size
table(cd3$country, result$cluster)
Cluster Analysis
Strategies
• From the analysis we found out that Germany is in Cluster 2.
• There are 28 countries in cluster 2, spread over all continents.
• Our initial strategy will be to expand in Western European countries in cluster 2 like Austria, Belgium, Denmark,
Finland and France for ease of operations.
• The policies to be introduced should focus on providing premium facilities in the health sector to the
beneficiaries and charge a higher amount for it in order to derive more profits. Since the economy of these
countries are good, they can pay the amount charged.