60% found this document useful (10 votes)

7K views25 pages

Data Mining

- The dataset contains census data from India with 640 rows and 61 columns - It includes information on area, households, population, scheduled castes, tribes, literacy rates, and workers from the 2011 Indian census - The analyst loads and inspects the data, checking for missing values, duplicates, and data types to get an understanding of the data - Principal component analysis (PCA) will be used to reduce the dimensionality of the data and identify the principal components that explain the most variance to help analyze the census data

Uploaded by

rishit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

60% found this document useful (10 votes)

7K views25 pages

Data Mining

Uploaded by

rishit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

DATA MINING

PROJECT

- Ayesha Farhat

INDEX
1
CONTENTS
1.Clustering
1.1 reading the data
1.2 treating missing values
1.3 check for outlier

1.4 zscore scaling

1.5 dendogram

1.6 elbow plot

1.7 silhouette scores

1.8 Profile the ads

1.9 summary
2.PCA
2.1 reading the data
2.2 Exploratory analysis
2.3 outliers
2.4 z-score
2.5 steps for PCA
2.6 Scree plot
2.7 Compare PCs
2.8 linear equation

Clustering:

2
Digital Ads Data:
The ads24x7 is a Digital Marketing company which has now got seed funding of $10 Million.
They are expanding their wings in Marketing Analytics. They collected data from their
Marketing Intelligence team and now wants you (their newly appointed data analyst) to
segment type of ads based on the features provided. Use Clustering procedure to segment ads
into homogeneous groups.
The following three features are commonly used in digital marketing:
CPM = (Total Campaign Spend / Number of Impressions) * 1,000
CPC = Total Cost (spend) / Number of Clicks
CTR = Total Measured Clicks / Total Measured Ad Impressions x 100

1.1 Clustering: Read the data and perform basic analysis such as printing a few rows
(head and tail), info, data summary, null values duplicate values, etc.
Answer:

Loading and viewing the datasets :

Viewing top 5 rows:

Tab :1.1

Viewing last 5 rows:

Tab :1.2

Seeing shape of the dataset:

The dataset has 25857 rows and 19 columns .

3
Tab :1.3

Tab :1.4

There are no duplicate rows in the data

Viewing the info of the data:

1.2- Clustering: Treat missing values in CPC, CTR and CPM using the formula given.

The missing values in CPC, CTR and CPM are treated by writing a user-defined function, and calling it.

CPM = (Total Campaign Spend / Number of Impressions) * 1,000

CPC = Total Cost (spend) / Number of Clicks

4
CTR = Total Measured Clicks / Total Measured Ad Impressions x 100
The missing values are treated using the above formulae and user defined function and calling
it using return function.

The above data set has columns timestamp , inventory type which are not very useful for clustering ,
also columns CTR, CPM, CPC are dependent variables , so we need to drop these columns

1.3Clustering: Check if there are any outliers. Do you think treating outliers is necessary
for K-Means clustering? Based on your judgement decide whether to treat outliers and if
yes, which method to employ. (As an analyst your judgement may be different from
another analyst).

5
Fig :1.2

Fig :1.3

6
1.4 - Clustering: Perform z-score scaling and discuss how it affects the speed of the
algorithm.

Dropping some columns and checking top 5 rows:

Tab :1.6

Tab:1.7

1.5 - Clustering: Perform Hierarchical by constructing a Dendrogram using WARD and

Euclidean distance.

Constructing Dendogram by calling the dendogram function:

Fig :1.4

Viewing the last 10 merged clusters using truncate , given p=10, we get :

7
The dataframe is now stored in an array.

Tab:1.9

Wss:

8
1.6 - Clustering: Make Elbow plot (up to n=10) and identify optimum number of clusters
for k-means algorithm.

Fig :1.6

When we move from k=1 to k=2 , we see that there is a significant drop in the value , also when
we move from k=2 to k=3,k=3 to k=4 there is a significant drop aswell.
But from k=4 to k=5 , k=5 to k=6 , the drop in values reduces significantly.
In otherwords, the wss is not significantly dropping beyond 4, so 4 is optimal number of
clusters.
1.7 - Clustering: Print silhouette scores for up to 10 clusters and identify optimum
number of clusters.

Two functions we use here are silhouette_samples and silhouette_score

The silhouette_score function computes the average of all the silhouette width

9
The silhouette_samples function computes the silhouette width for each and every
row.

Tab: 1.10

silhouette_score:

Since the silhouette_score is 0.5, the we can conclude that it is a well distinguished set of
clusters.

The 4 clusters that are created have a silhouette_score of 0.50

Tab: 1.11

1.8 - Clustering: Profile the ads based on optimum number of clusters using silhouette
score and your domain understanding [Hint: Group the data by clusters and take sum or
mean to identify trends in Clicks, spend, revenue, CPM, CTR, & CPC based on Device
Type. Make bar plots].
Cluster Profiling:

Tab: 1.12

1.9 - Clustering: Conclude the project by providing summary of your learnings.

10
 The dataset has 25857 rows and 19 columns.
 The missing values in CPC, CTR and CPM are treated by using the formulae given and writing a
user-defined function, and calling it.
 We check for outliers, we can see there are outliers in the variables.
 Dendogram is the visualization and linkage is for computing the distances and merging the
clusters from n to 1.
 The output of Linkage is visualized by Dendogram.
 We will create linkage using Ward’s method and run linkage function on the usable columns of
the data.
 The linkage now stores the various distance at which the n clusters are sequentially merged into
a single cluster.
 using fit – transform function and viewing the output - The dataframe is now stored in an array.
 Using this array we can now perform k-means
 The one requirement before we run the k-means algorithm, is to know how many clusters we
require as output
 We map the elbow plot using wss values
 From the plot we have following observations:
 When we move from k=1 to k=2 , we see that there is a significant drop in the value , also when
we move from k=2 to k=3,k=3 to k=4 there is a significant drop aswell.
 But from k=4 to k=5 , k=5 to k=6 , the drop in values reduces significantly.
 In otherwords, the wss is not significantly dropping beyond 4,
 So 4 is optimal number of clusters.

11
Part2
PCA:

PCA FH (FT): Primary census abstract for female headed households excluding institutional households
(India & States/UTs - District Level), Scheduled tribes - 2011 PCA for Female Headed Household
Excluding Institutional Household. The Indian Census has the reputation of being one of the best in the
world. The first Census in India was conducted in the year 1872. This was conducted at different points of
time in different parts of the country. In 1881 a Census was taken for the entire country simultaneously.
Since then, Census has been conducted every ten years, without a break. Thus, the Census of India
2011 was the fifteenth in this unbroken series since 1872, the seventh after independence and the
second census of the third millennium and twenty first century. The census has been uninterruptedly
continued despite of several adversities like wars, epidemics, natural calamities, political unrest, etc. The
Census of India is conducted under the provisions of the Census Act 1948 and the Census Rules, 1990.
The Primary Census Abstract which is important publication of 2011 Census gives basic information on
Area, Total Number of Households, Total Population, Scheduled Castes, Scheduled Tribes Population,
Population in the age group 0-6, Literates, Main Workers and Marginal Workers classified by the four
broad industrial categories, namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household Industry
Workers, and (iv) Other Workers and also Non-Workers. The characteristics of the Total Population
include Scheduled Castes, Scheduled Tribes, Institutional and Houseless Population and are presented
by sex and rural-urban residence. Census 2011 covered 35 States/Union Territories, 640 districts, 5,924
sub-districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details without using Data
Science Techniques. You are tasked to perform detailed EDA and identify Optimum Principal
Components that explains the most variance in data. Use Sklearn only

12
2.1 PCA: Read the data and perform basic checks like checking head, info, summary,
nulls, and duplicates, etc.

Loading and reading the dataset.

Checking top 5 rows using the head function.

Tab: 2.1

Checking the shape of the data set

Fig : 2.2

There are 640 rows and 61 columns

Checking the appropriateness of datatypes – non null count, index range and data type of the
data set

Fig : 2.3

13
We see there are 640 rows and 61 data columns

Fig : 2.

59 of 61 columns are int data type and 2 columns are categorical object data type. And no null
values.
Checking for duplicate values.

Fig : 2.

2.2 PCA: Perform detailed Exploratory analysis by creating certain questions like (i)
Which state has highest gender ratio and which has the lowest? (ii) Which district
has the highest & lowest gender ratio? (Example Questions). Pick 5 variables out
of the given 24 variables .
Answer:

Which state has the highest population?

Fig :2.1

Which state has highest total female population?

14
Fig :2.2

15
Which state has highest total male population

Fig :2.3

16
For EDA - Variables considered:

No_HH TOT_M TOT_F TOT_WORK_M TOT_WORK_F

No of Household
Total population Male
Total population Female
Total Worker Population Male
Total Worker Population Female
Univariate Analysis:
Plotting histogram and boxplots for the above variables:

Fig :2.4

Bivariate Analysis:

17
Fig :2.5

2.3 PCA: We choose not to treat outliers for this case. Do you think that treating
outliers for this case is necessary?

2.4 PCA: Scale the Data using z-score method. Does scaling have any impact on
outliers? Compare boxplots before and after scaling and comment.

18
Answer:

After dropping few columns, this is how data set looks:

Tab:2.2

We have 57 features.

Check for presence of outliers in each feature

Plotting box plot before scaling the new data which contains only numerical columns.

19
Fig : 2.6

scaling the data set using the Z score and checking for top 5 rows of the scaled dataset :

Table 2.3

The data has been scaled .

Checking for outliers of the scaled data

Fig : 2.7

20
Fig : 2.8

2.5 PCA: Perform all the required steps for PCA (use sklearn only) Create the
covariance Matrix Get eigen values and eigen vector.

Answer:
Extracting eigen vectors and looking at PCA components

Tab: 2.4

Tab: 2.5

Explained variance=(eigen value of each PC)/(sum of eigen values of all PCs)

Check the explained variance for each PC

21
Tab:2.6
Organinzing the above explained variance in a dataframe

Tab: 2.7

2.6 PCA: Identify the optimum number of PCs (for this project, take at least 90%
explained variance). Show Scree plot.

22
Fig : 2.9

Tab: 2.8.

23
2.7 PCA: Compare PCs with Actual Columns and identify which is explaining most
variance. Write inferences about all the Principal components in terms of actual
variables.

Fig :2.10

24
Fig :2.10

2.8 PCA: Write linear equation for first PC.

PC 1 = a1x1 + a2x2 + a3X3 +a4X4 + …….+ a57x57

Ashish+Gupta+Project+Report Advanced+Statistics 13 11 2022
50% (2)
Ashish+Gupta+Project+Report Advanced+Statistics 13 11 2022
21 pages
Data Mining
75% (4)
Data Mining
22 pages
MRA Assignment: by Chitra Mukadam
100% (2)
MRA Assignment: by Chitra Mukadam
19 pages
Data Mininig Project
67% (3)
Data Mininig Project
28 pages
Data Science Report
No ratings yet
Data Science Report
35 pages
Machine Learning Business Report
75% (55)
Machine Learning Business Report
60 pages
Advance Statistics Project
100% (9)
Advance Statistics Project
9 pages
SMDM Project
87% (15)
SMDM Project
23 pages
FinalReport Life Insurance
80% (5)
FinalReport Life Insurance
34 pages
Predictive Modeling Business Report
100% (3)
Predictive Modeling Business Report
69 pages
Data Mining Business Report Hansraj Yadav
83% (12)
Data Mining Business Report Hansraj Yadav
34 pages
Predictive Modelling Project - n1
100% (4)
Predictive Modelling Project - n1
36 pages
CRM Unit 1 Notes Introduction To CRM
100% (1)
CRM Unit 1 Notes Introduction To CRM
16 pages
Anushi Sparkling
100% (4)
Anushi Sparkling
70 pages
ML Quiz 3 Machine Learning Great Learning
89% (9)
ML Quiz 3 Machine Learning Great Learning
7 pages
Advanced Statistics Project - Business Reports
100% (1)
Advanced Statistics Project - Business Reports
5 pages
SMDM Project Business
80% (5)
SMDM Project Business
13 pages
Predictive Modeling PDF
100% (3)
Predictive Modeling PDF
49 pages
MRA Project Milestone 2
92% (12)
MRA Project Milestone 2
26 pages
Data Mining - Business Report: Clustering Clean - Ads
100% (4)
Data Mining - Business Report: Clustering Clean - Ads
24 pages
Arnab Chowdhury DM
75% (4)
Arnab Chowdhury DM
14 pages
Predictive Modelling Project Report Final
45% (11)
Predictive Modelling Project Report Final
49 pages
SMDM Project Business Report - Ketan Sawalkar: (Document Title)
100% (2)
SMDM Project Business Report - Ketan Sawalkar: (Document Title)
17 pages
Taller1 Hanger Sizing in Caesar
No ratings yet
Taller1 Hanger Sizing in Caesar
37 pages
Advanced Statistics Project - Business Report
No ratings yet
Advanced Statistics Project - Business Report
11 pages
Education - Post 12th Standard - CSV
88% (16)
Education - Post 12th Standard - CSV
11 pages
Data Mining
No ratings yet
Data Mining
24 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
MRA ML1 - Kirtesh
100% (7)
MRA ML1 - Kirtesh
43 pages
Problem 1: Linear Regression
54% (13)
Problem 1: Linear Regression
14 pages
Project Report - FRA V1.0
71% (7)
Project Report - FRA V1.0
28 pages
Points
0% (6)
Points
1 page
Capstone Proect Notes 2
100% (2)
Capstone Proect Notes 2
16 pages
Business Report M2 PDF
100% (2)
Business Report M2 PDF
14 pages
Problem 2
100% (1)
Problem 2
10 pages
Graded Project As - Kamalpreet Kaur
No ratings yet
Graded Project As - Kamalpreet Kaur
8 pages
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
100% (2)
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
31 pages
AS Project - 3 Business Report
0% (1)
AS Project - 3 Business Report
10 pages
Data Mining Graded Assignment: Problem 1: Clustering Analysis
100% (3)
Data Mining Graded Assignment: Problem 1: Clustering Analysis
39 pages
Kailash BusinessReport
No ratings yet
Kailash BusinessReport
31 pages
Data Mining Project - Abinaya John
No ratings yet
Data Mining Project - Abinaya John
42 pages
Data Mining Project PCA Report
100% (1)
Data Mining Project PCA Report
27 pages
Predictive Modelling ALOK KUMAR
100% (1)
Predictive Modelling ALOK KUMAR
25 pages
Business Report
No ratings yet
Business Report
12 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
ML Ts Proj
100% (9)
ML Ts Proj
58 pages
Mra Project: Prepared By: Deepak Batabyal Date:-09 Feb 2020
100% (2)
Mra Project: Prepared By: Deepak Batabyal Date:-09 Feb 2020
32 pages
Data Mining Quiz 1 Clustering
100% (2)
Data Mining Quiz 1 Clustering
4 pages
Hirschmann Greyhound2000 Switch 2025 02 PB363 en
No ratings yet
Hirschmann Greyhound2000 Switch 2025 02 PB363 en
5 pages
Data Mining Quiz 2
100% (2)
Data Mining Quiz 2
8 pages
CyberPWN - Application Security Services - 2020
No ratings yet
CyberPWN - Application Security Services - 2020
37 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
VST - Axial Flow Catalog - 2021.R010
No ratings yet
VST - Axial Flow Catalog - 2021.R010
6 pages
Crime Prediction Model Using Artificial Neural Network
No ratings yet
Crime Prediction Model Using Artificial Neural Network
53 pages
Facebook Comment Volume Prediction
100% (1)
Facebook Comment Volume Prediction
12 pages
Unit 3,4 and 5
No ratings yet
Unit 3,4 and 5
5 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
Agriculture Crop Recommendation System Using
No ratings yet
Agriculture Crop Recommendation System Using
57 pages
Speakout Grammar Extra Intermediate Plus Unit 4
No ratings yet
Speakout Grammar Extra Intermediate Plus Unit 4
2 pages
Gowtham Mra 2
No ratings yet
Gowtham Mra 2
18 pages
Condor Scissors Lift t62 92367 Parts Book
100% (71)
Condor Scissors Lift t62 92367 Parts Book
20 pages
Mahim S Project Report
No ratings yet
Mahim S Project Report
34 pages
Agriculture Crop Recommendation System Using Machine Learning
No ratings yet
Agriculture Crop Recommendation System Using Machine Learning
11 pages
Project 1
No ratings yet
Project 1
6 pages
Report
No ratings yet
Report
42 pages
Intern Report Progress
No ratings yet
Intern Report Progress
59 pages
G-06-Autonomous Database - Serverless and Dedicated-Transcript
No ratings yet
G-06-Autonomous Database - Serverless and Dedicated-Transcript
7 pages
K20 Engine Control Module X3 Document ID# 4171676
No ratings yet
K20 Engine Control Module X3 Document ID# 4171676
2 pages
Survey LLM-Agents 2025
No ratings yet
Survey LLM-Agents 2025
44 pages
AutoCAD Basics To Advanced Electrical BIM
No ratings yet
AutoCAD Basics To Advanced Electrical BIM
12 pages
DRAWINGS 1 OPS-WW - Colour Up Deluxe
No ratings yet
DRAWINGS 1 OPS-WW - Colour Up Deluxe
12 pages
Gate 2024 Information Brochure Sep 27
No ratings yet
Gate 2024 Information Brochure Sep 27
128 pages
As Electronics Coursework Example
100% (2)
As Electronics Coursework Example
5 pages
Hayes-NewTypeEarly-1971 A New Type of Early Christian Ampulla
No ratings yet
Hayes-NewTypeEarly-1971 A New Type of Early Christian Ampulla
9 pages
Suntech 550W EN - STP560S - C72 - VMH Monofacial PERC
No ratings yet
Suntech 550W EN - STP560S - C72 - VMH Monofacial PERC
2 pages
Ensayo de Vacaciones de Primavera
100% (1)
Ensayo de Vacaciones de Primavera
7 pages
TOEIC
No ratings yet
TOEIC
14 pages
Module 1-Discrete Structure
No ratings yet
Module 1-Discrete Structure
7 pages
User Manual 44953
No ratings yet
User Manual 44953
32 pages
Name: Muhammad Fawad ID NAME: 12699 Course: PF Lab: Using Using Using Using Namespace Class Static Void String
No ratings yet
Name: Muhammad Fawad ID NAME: 12699 Course: PF Lab: Using Using Using Using Namespace Class Static Void String
4 pages
Nishant Dua: Technical Skills Experienced
No ratings yet
Nishant Dua: Technical Skills Experienced
5 pages
NAJRUL ANSARI Storekeeper
No ratings yet
NAJRUL ANSARI Storekeeper
3 pages
Ray Tracing Chapter 1
No ratings yet
Ray Tracing Chapter 1
15 pages
C109 Cut SH
No ratings yet
C109 Cut SH
1 page
Computer SSC-I Rubrics HA (19!05!2023)
No ratings yet
Computer SSC-I Rubrics HA (19!05!2023)
4 pages
PIMDSEN
No ratings yet
PIMDSEN
2 pages
Pitch Deck SNGULAR 2023 10 Slides en
No ratings yet
Pitch Deck SNGULAR 2023 10 Slides en
12 pages
Cinema4d Env Variables
No ratings yet
Cinema4d Env Variables
1 page
Muayad CV
No ratings yet
Muayad CV
1 page

Data Mining

Uploaded by

Data Mining

Uploaded by

DATA MINING

1.4 zscore scaling

1.6 elbow plot

1.8 Profile the ads

Loading and viewing the datasets :

Viewing top 5 rows:

Viewing last 5 rows:

Seeing shape of the dataset:

The dataset has 25857 rows and 19 columns .

There are no duplicate rows in the data

Viewing the info of the data:

CPM = (Total Campaign Spend / Number of Impressions) * 1,000

CPC = Total Cost (spend) / Number of Clicks

Dropping some columns and checking top 5 rows:

1.5 - Clustering: Perform Hierarchical by constructing a Dendrogram using WARD and

Constructing Dendogram by calling the dendogram function:

Two functions we use here are silhouette_samples and silhouette_score

The 4 clusters that are created have a silhouette_score of 0.50

1.9 - Clustering: Conclude the project by providing summary of your learnings.

Loading and reading the dataset.

Checking the shape of the data set

There are 640 rows and 61 columns

Which state has the highest population?

Which state has highest total female population?

No_HH TOT_M TOT_F TOT_WORK_M TOT_WORK_F

After dropping few columns, this is how data set looks:

Check for presence of outliers in each feature

The data has been scaled .

Check the explained variance for each PC

2.8 PCA: Write linear equation for first PC.

PC 1 = a1x1 + a2x2 + a3X3 +a4X4 + …….+ a57x57

You might also like