0% found this document useful (0 votes)
73 views122 pages

Machine Learning-1 BUSINESS REPORT

Uploaded by

arpitasaha.1994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views122 pages

Machine Learning-1 BUSINESS REPORT

Uploaded by

arpitasaha.1994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Machine Learning – 1

Project

1
CLUSTERING INDEX
S.No Description Page
Part 1: Clustering: Define the problem and perform Exploratory Data
1 Analysis 6
Part 1: 1.1 Checking the number of rows and columns present in the given
2 data frame 9
3 Part 1: 1.2 Checking the first five rows in the given dataframe 9
4 Part 1: 1.3 Checking the last five rows in the given data frame 10
5 Part 1: 1.4 Checking the data types of the columns for the dataset 10
6 Part 1: 1.5 Statistical summary of the dataset 11
7 Part 1:1.6 Univariate Analysis 13
8 Part 1: 1.7 Bivariate Analysis 39
Part 1: 1.8 Key meaningful observations on individual variables and the
9 relationship between variables 44
10 Part 1: Clustering: Data Preprocessing 46
11 Part 1: 2.2 Outlier Treatment 49
12 Part 1: 2.3 z - score scaling 52
13 Part 1: Clustering: Hierarchical Clustering 54
Part 1: 3.1 Construct a dendrogram using Ward linkage and Euclidean
14 distance 54
15 Part 1: Clustering: K-means Clustering 56
16 Part 1: 4.1 Apply K-means Clustering 56
17 Part 1: 4.2 Plot the Elbow curve 57
18 Part 1: 4.3Check Silhouette Scores 58
19 Part 1: 4.5 Cluster Profiling 61
20 Part 1: Clustering: Actionable Insights and Recommendations 67
Part 1: 5.1 Extract meaningful insights (atleast 3) from the clusters to
identify the most effective types of ads, target audiences, or marketing
21 strategies that can be inferred from each segment. 68

TABLES
S.No Description Page
1 Table 1: First five Rows 9
2 Table 2: Last Five rows 10
3 Table 3: Data Types of the dataset 10
4 Table 4: Statistical Summary 11
5 Table 5: Value counts of Ad-Length 13
6 Table 6: Value count of Ad-Width 14
7 Table 6: Value count of Ad-Size 14
8 Table 7: Value count of Available_Impression 16
9 Table 8: Value count of Matched_Queries 17
10 Table 9: Value Count of Impressions 19
11 Table 10: Value Count on Clicks 20
12 Table 11: Value count of Spend 22
13 Table 12: Value count of Fee 23
14 Table 13: Value count of Revenue 24
15 Table 14: Value count of CTC 26

2
16 Table 15: Value count of CPM 27
17 Table 16: Value count of CPC 28
18 Table 17: Value count of Timestamp 30
19 Table 18: Value count of InventoryType 31
20 Table 19: Value count on Ad Type 32
21 Table 20: Value count of Platform 33
22 Table 21: Value count of Device Type 34
23 Table 22: Value count of Format 35
24 Table 23: correlation between all the numerical variables 37
25 Table 24: Checking Duplicate values 46
26 Table 25: Checking the missing values 47
27 Table 26: Top five rows 48
28 Table 27: Check for presence of missing values in each feature 49
29 Table 28: Number of outliers in each variable 51
30 Table 29: First five rows of new dataset 52
31 Table 30: scaled data (rows transposed as columns) 53
32 Table 31: Rows of the scaled data 53
33 Table 32: Wss 57
34 Table 33: 5 rows of updated data frame 58
35 Table 34: "sil_width" column of the DataFrame. 60
36 Table 35: Appending Clusters to the original dataset 60
37 Table 36: Profile of each cluster after dropping irrelevant columns 61
38 Table 37: Comparison of Clusters 62
39 Table 38: Comparison of Clusters 63
40 Table 39: Comparison of Clusters 64
41 Table 40: Comparison of Clusters 65

FIGURES
S.No Description Page
1 Fig 1: Boxplot and Histogram of Ad – Length 13
2 Fig 2: Boxplot and Histogram of Ad-Width 14
3 Fig 3: Boxplot and Histogram of Ad-Size 15
4 Fig 4: Boxplot and Histogram of Available_Impression 16
5 Fig 5: Boxplot and Histogram of Matched Queries 18
6 Fig 6: Boxplot and Histogram of Impressions 19
7 Fig 7: Boxplot and Histogram of Clicks 21
8 Fig 8: Boxplot and histogram of Spend 22
9 Fig 9: Boxplot and Histogram of Fee 24
10 Fig 10: Boxplot and Histogram of Revenue 25
11 Fig 11: Boxplot and Histogram of CTR 26
12 Fig 12: Boxplot and Histogram of CPM 28
13 Fig 13: Boxplot and Histogram of CPC 29
14 Fig 14: Countplot of Timestamp 30
15 Fig 15: Countplot of InventoryType 31
16 Fig 16: Countplot on AdType 32
17 Fig 17: Countplot of Platform 33

3
18 Fig 18: Countplot of Device Type 34
19 Fig 19: Countplot of Format 35
20 Fig 20: Pairplot of all the numerical Variables 36
21 Fig 21: Heatmap 38
22 Fig 22: Barplot of Revenue and Ad Type 39
23 Fig 23: Barplot of Platform and Clicks 40
24 Fig 24: Barplot of Device Type and Clicks 42
25 Fig 25: Barplot of Spend and Ad Type 43
26 Fig 26: Boxplots with outliers 50
27 Fig 27: Boxplots without outliers 52
28 Fig 28: Dendrogram 54
29 Fig 29: Dendrogram with last 10 merged clusters 55
30 Fig 30: KMeans model 56
31 Fig 31: Elbow plot 57
32 Part 1: 4.4 Figure out the appropriate number of clusters 59

PCA INDEX
S.No Description Page
1 Part 2: PCA: Define the problem and perform Exploratory Data Analysis 69
2 Part 2: Data Prepocessing 86
3 Data types 76
4 Statistical Summary of the data 78
5 Perform an EDA on the data to extract useful insights 80
6 Part 2: Data Prepocessing 86
7 Check for and treat (if needed) missing values 86
8 Check for and treat (if needed) data irregularities 86
Scale the Data using the z-score method, Visualize the data before and after
9 scaling and comment on the impact on outliers 87
10 Part 2; PCA 106
11 Create covariance matrix 107
12 Get eigen values and eigen vectors 109
13 Identify the optimum number of PCs 112
14 Show Scree Plot 113
Compare PCs with Actual Columns and identify which is explaining most
15 variance 116
16 Write inferences about all the PCs in terms of actual variables 121
17 Write linear equation for first PC 122

TABLES
S.No Description Page
1 Table 1: First five rows of the dataset 75
2 Table 2: Last five rows of the dataset 75
3 Table 3: Statistical Summary 79

FIGURES
S.No Description Page
1 Fig 1: Data types in the dataset 77

4
2 Fig 2: Duplicate Values 80
3 Fig 3: Missing Value 80
4 Fig 4: Boxplots of the numerical variables 81
5 Fig 5: Relationship between No_HH and TOT_M 82
6 Fig 6: Relationship between No_HH and TOT_F 82
7 Fig 7: Relationship between No_HH and M_06 82
8 Fig 8: Relationship between No_HH and F_06 82
9 Fig 9: Visual analysis of No_HH using barplot 83
10 Fig 10: Visual analysis of TOT_M using barplot 83
11 Fig 11: Visual analysis of TOT_F using barplot 84
12 Fig 12: Visual analysis of M_06 using barplot 84
13 Fig 13: Visual analysis of F_06 using barplot 85
14 Fig: 14: Missing Values 86
15 Fig 15: Duplicate Values 86
16 Fig 16: Visualization, Distribution and Outliers of each variable 87
17 Fig 17: Data after applying feature scaling 103
18 Fig 18: Output of Scaled Data 105
19 Fig 19: Boxplot after Scaling the Data 105
20 Fig 20: presence of correlations 106
21 Fig 21: Covariance Matrix 107
22 Fig 22: Transpose of the Covariance Matrix 107
23 Fig 23: Comparing Correlation and Covariance Matrix 108
24 Fig 24: eigen values and eigen vectors 110
25 Fig 25: Scaled Data 111
26 Fig 26: Scree plot 113
27 Fig 27: Display the transformed data 113
28 Fig 28: R reduced data array into a DataFrame 114
29 Fig 29: PCA Loadings 115
30 Fig 30: Dataframe with coefficients of all PCs 116
31 Fig 31: original features matter to each PC 118
32 Fig 32: Compare how original features influence various PCs 119
33 Fig 33: Checking the highest and lowest values 120
34 Fig 34: Linear Equation for PC1 122

5
Problem Statement

Clustering

Part 1: Clustering: Define the problem and perform


Exploratory Data Analysis

Problem Definition:
Digital Ads Data: The ads24x7 is a Digital Marketing company which has now got
seed funding of 10 Million Dollars. They are expanding their wings in Marketing
Analytics. They collected data from their Marketing Intelligence team and now
wants you (their newly appointed data analyst) to segment type of ads based on
the features provided. Use Clustering procedure to segment ads into
homogeneous groups.

Data Description:
* Timestamp : The Timestamp of the particular Advertisement.

* InventoryType : The Inventory Type of the particular Advertisement. Format 1 to 7. This is a


Categorical Variable.

* Ad - Length : The Length Dimension of the particular Advertisement.

* Ad- Width : The Width Dimension of the particular Advertisement.

* Ad Size : The Overall Size of the particular Advertisement. Length*Width.

* Ad Type : The type of the particular Advertisement. This is a Categorical Variable.

* Platform : The platform in which the particular Advertisement is displayed. Web, Video or
App. This is a Categorical Variable.

* Device Type : The type of the device which supports the particular Advertisement. This is a
Categorical Variable.

6
* Format : The Format in which the Advertisement is displayed. This is a Categorical Variable*
Available_Impressions : How often the particular Advertisement is shown. An impression is
counted each time an Advertisement is shown on a search result page or other site on a
Network.

* Matched_Queries : Matched search queries data is pulled from Advertising Platform and
consists of the exact searches typed into the search Engine that generated clicks for the
particular Advertisement.

* impressions : The impression count of the particular Advertisement out of the total
available impressions.

* Clicks : It is a marketing metric that counts the number of times users have clicked on the
particular advertisement to reach an online property.

* Spend : It is the amount of money spent on specific ad variations within a specific campaign
or ad set. This metric helps regulate ad performance.

* Fee : The percentage of the Advertising Fees payable by Franchise Entities.

* Revenue: It is the income that has been earned from the particular advertisement.

* CPM : CPM stands for "cost per 1000 impressions." Formula used here is CPM = (Total
Campaign Spend / Number of Impressions) * 1,000. Note that the Total Campaign Spend
refers to the 'Spend' Column and the Number of Impressions refers to the 'Impressions'
Column.

* CPC : CPC stands for "Cost-per-click". Cost-per-click (CPC) bidding means that you pay for
each click on your ads. The Formula used here is CPC = Total Cost (spend) / Number of Clicks.
Note that the Total Cost (spend) refers to the 'Spend' Column and the Number of Clicks refers
to the 'Clicks' Column.

7
The following three features are commonly used in digital marketing:

CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that the
Total Campaign Spend refers to the 'Spend' Column in the dataset and the
Number of Impressions refers to the 'Impressions' Column in the dataset.

CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend)
refers to the 'Spend' Column in the dataset and the Number of Clicks refers to the
'Clicks' Column in the dataset.¶

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note that
the Total Measured Clicks refers to the 'Clicks' Column in the dataset and the
Total Measured Ad Impressions refers to the 'Impressions' Column in the dataset.

Solution:

Import all the libraries. Load and Read the data set.
Data Overview
The initial steps to get an overview of any dataset is to:

* Get information about the number of rows and columns in the dataset

* Observe the first few and last few rows of the dataset, to check whether the
dataset has been loaded properly or not

* Find out the data types of the columns to ensure that data is stored in the
preferred format and the value of each property is as expected.

* Check the statistical summary of the dataset to get an overview of the


numerical columns of the data

8
Part 1: 1.1 Checking the number of rows and columns present in the
given data frame.

The ads24x7 data has 23066 rows and 19 columns.

Part 1: 1.2 Checking the first five rows in the given dataframe

Table 1: First five Rows

9
Part 1: 1.3 Checking the last five rows in the given data frame

Table 2: Last Five rows

Part 1: 1.4 Checking the data types of the columns for the dataset

Table 3: Data Types of the dataset


10
1. Object Type (6 attributes):
• These attributes likely represent categorical or textual data.
• Examples of object-type attributes might include: product names,
categories, or descriptions.
2. Integer Type (7 attributes):
• These attributes represent numerical data without fractional parts.
• Examples of integer-type attributes might include: counts, quantities, or
indices.
3. Float Type (6 attributes):
• These attributes represent numerical data with fractional parts.
• Examples of float-type attributes might include: measurements,
percentages, or continuous values.

Having this information about the data types and attributes is crucial for
performing data analysis, visualization, and modeling tasks. It helps in
understanding the nature of the variables and selecting appropriate techniques
for analysis.

Part 1: 1.5 Statistical summary of the dataset

Table 4: Statistical Summary

11
From the above statistical summary we can see that,

* The Average of Impressions, clicks, spend and Revenue in the given dataset is
1.241520e+06, 10678.518816, 2706.625689 and 1924.252331 respectively.

* The standard deviation of Impressions, clicks, spend and Revenue in the given
dataset is 2.429400e+06, 17353.409363, 4067.927273 and 3105.238410
respectively.

* The statistics provide insights into the typical and spread-out values for
Impressions, Clicks, Spend, and Revenue in the dataset.

* The calculated averages and standard deviations help in understanding the


central tendency and variability of these variables. For example:

1. The average Impressions is approximately 753,612, suggesting that this is the


typical number of impressions observed.

2. The standard deviation of Impressions is approximately 980,257, indicating the


spread or variability of impressions around the mean.

3. Similarly, the averages and standard deviations for Clicks, Spend, and Revenue
provide similar insights into these variables.

These metrics are essential for understanding the distribution and variability of
the data, which can be valuable for decision-making and further analysis.

12
Part 1:1.6 Univariate Analysis

For performing Univariate analysis we will take a look at the Boxplots and
Histograms to get better understanding of the distributions.

Numerical Variables

Observations on Ad – Length

Table 5: Value counts of Ad-Length

Fig 1: Boxplot and Histogram of Ad – Length


* Frequent Occurrences: 120 is the most frequently occurring value in the "Ad -
Length" attribute.
* Common Values: 300, 720, and 480 are common values observed in the "Ad-
Length" attribute.
* Least Occurrence: 728 is the least occurring value in the "Ad - Length" attribute.

13
* Mean Value: The mean value of the "Ad - Length" attribute is 300.
These insights provide a better understanding of the distribution and central
tendency of the "Ad - Length" attribute. They can be useful for further analysis
and decision-making processes related to this attribute.
Observations on Ad- Width

Table 6: Value count of Ad-Width

Fig 2: Boxplot and Histogram of Ad-Width


Observations on Ad Size

14
Table 6: Value count of Ad-Size

Fig 3: Boxplot and Histogram of Ad-Size

* Frequent Occurrences: 72,000 is the most frequently occurring value in the "Ad
Size" attribute.

* Common Values: 216,000, 75,000, and 33,600 are common values observed in
the "Ad Size" attribute.

* Least Occurrence: 180,000 is the least occurring value in the "Ad Size" attribute.

* Outliers: There are outliers present in this variable.

Identifying outliers can be crucial for understanding the data distribution and
ensuring robust analysis.

15
Observations on Available_Impressions

Table 7: Value count of Available_Impression

Fig 4: Boxplot and Histogram of Available_Impression

16
* Frequent Occurrence: 7 is the most frequently occurring value in the "Available
Impressions" attribute.

* Common Values: 9, 5, and 3 are common values observed in the "Available


Impressions" attribute.

* Least Occurrence: 114 is the least occurring value in the "Available Impressions"
attribute.

* Outliers: There are outliers present in this variable.

Identifying outliers can be important for understanding the data distribution and
ensuring accurate analysis.

Observations on Matched_Queries

Table 8: Value count of Matched_Queries

17
Fig 5: Boxplot and Histogram of Matched Queries

* Frequent Occurrence: 5 is the most frequently occurring value in the "Matched


Queries" attribute.

* Common Values: 4, 3, and 2 are common values observed in the "Matched


Queries" attribute.

* Least Occurrence: 197 is the least occurring value in the "Matched Queries"
attribute.

* Outliers: There are outliers present in this variable.

Identifying outliers is essential for understanding the distribution of data and


ensuring accurate analysis.

18
Observations on Impressions

Table 9: Value Count of Impressions

Fig 6: Boxplot and Histogram of Impressions

19
* Frequent Occurrence: 2 is the most frequently occurring value in the
"Impressions" attribute.

* Common Values: 4, 5, and 3 are common values observed in the "Impressions"


attribute.

* Least Occurrence: 143 is the least occurring value in the "Impressions" attribute.

* Outliers: There are outliers present in this variable.

Identifying outliers is crucial for understanding the distribution of data and


ensuring accurate analysis.

Observations on Clicks

Table 10: Value Count on Clicks

20
Fig 7: Boxplot and Histogram of Clicks

* Frequent Occurrence: 1 is the most frequently occurring value in the "Clicks"


attribute.

* Common Values: 2, 3, and 4 are common values observed in the "Clicks"


attribute.

* Least Occurrence: 1201 is the least occurring value in the "Clicks" attribute.

* Outliers: There are outliers present in this variable.

Identifying outliers is crucial for understanding the distribution of data and


ensuring accurate analysis.

21
Observations on Spend

Table 11: Value count of Spend

Fig 8: Boxplot and histogram of Spend

22
* Frequent Occurrence: 0.00 is the most frequently occurring value in the "Spend"
attribute.

* Common Values: 0.04, 0.03, and 0.05 are common values observed in the
"Spend" attribute.

* Least Occurrence: 1.43 is the least occurring value in the "Spend" attribute.

* Outliers: There are outliers present in this variable.

Identifying outliers is crucial for understanding the distribution of data and


ensuring accurate analysis.

Observations on Fee

Table 12: Value count of Fee

23
Fig 9: Boxplot and Histogram of Fee

* Frequent Occurrence: 0.35 is the most frequently occurring value in the "Fee"
attribute.

* Common Values: 0.33, 0.30, and 0.27 are common values observed in the "Fee"
attribute.

* Least Occurrence: 0.21 is the least occurring value in the "Fee" attribute.

* Outliers: There are outliers present in this variable.

Identifying outliers is crucial for understanding the distribution of data and


ensuring accurate analysis.

Observations on Revenue

Table 13: Value count of Revenue

24
Fig 10: Boxplot and Histogram of Revenue

* Frequent Occurrence: 0.0000 is the most frequently occurring value in the


"Revenue" attribute.

* Common Values: 0.0260 and 0.0195 are common values observed in the
"Revenue" attribute.

* Least Occurrence: 0.9295 is the least occurring value in the "Revenue" attribute.

* Outliers: There are outliers present in this variable.

Identifying outliers is crucial for understanding the distribution of data and


ensuring accurate analysis.

25
Observations on CTR

Table 14: Value count of CTC

Fig 11: Boxplot and Histogram of CTR

26
* Frequent Occurrence: 0.0024 is the most frequently occurring value in the "CTR"
attribute.

* Common Values: 0.0025, 0.0023, and 0.0026 are common values observed in
the "CTR" attribute.

* Least Occurrence: 0.1741 is the least occurring value in the "CTR" attribute.

* Outliers: There are outliers present in this variable.

Identifying outliers is important for understanding the distribution of data and


ensuring accurate analysis.

Observations on CPM

Table 15: Value count of CPM

27
Fig 12: Boxplot and Histogram of CPM

* Frequent Occurrence: 1.66 is the most frequently occurring value in the "CPM"
attribute.

* Common Values: 1.62, 1.69, and 1.64 are common values observed in the
"CPM" attribute.

* Least Occurrence: 15.95 is the least occurring value in the "CPM" attribute.

* Outliers: There are outliers present in this variable.

Identifying outliers is crucial for understanding the distribution of data and


ensuring accurate analysis.

Observations on CPC

Table 16: Value count of CPC


28
Fig 13: Boxplot and Histogram of CPC

* Frequent Occurrence: 1.66 is the most frequently occurring value in the "CPC"
attribute.

* Common Values: 0.09, 0.10, and 0.08 are common values observed in the "CPC"
attribute.

* Least Occurrence: 1.96 is the least occurring value in the "CPC" attribute.

* Outliers: There are outliers present in this variable.

Identifying outliers is crucial for understanding the distribution of data and


ensuring accurate analysis.

29
Categorical Variables

Observations on Timestamp

Table 17: Value count of Timestamp

Fig 14: Countplot of Timestamp


* Frequent Occurrence: 2020-11-13-22 is the most frequently occurring value in
the "Timestamp" attribute.
* Common Values: 2020-11-20-9, 2020-11-14-23, and 2020-10-18-1 are common
values observed in the "Timestamp" attribute.
* Least Occurrence: 2020-9-1-16 is the least occurring value in the "Timestamp"
attribute.
This countplot will visually display the frequency of each timestamp in the
dataset, providing insights into the distribution of timestamps.

30
Observations on InventoryType

Table 18: Value count of InventoryType

Fig 15: Countplot of InventoryType


* Frequent Occurrence: Format4 is the most frequently occurring format in the
dataset.
* Common Formats: Format5, Format1, and Format3 are commonly observed
formats in the dataset.
* Least Occurrence: Format7 is the least occurring format in the dataset.
Format4 seems to be the preferred choice or dominant format in the dataset, as it
appears with the highest frequency.
Formats like Format5, Format1, and Format3 are also quite common, indicating a
diverse range of formats used.
The low occurrence of Format7 suggests that it might be less popular or less
frequently used compared to other formats.
31
Observations on Ad Type

Table 19: Value count on Ad Type

Fig 16: Countplot on AdType


* Frequent Occurrence: Inter224 is the most frequently occurring item label in the
dataset.
* Common Items: Inter217, Inter223, and Inter219 are commonly observed item
labels in the dataset.
* Least Occurrence: Inter228 is the least occurring item label in the dataset.
Inter224 appears to be the dominant item label, occurring with the highest
frequency in the dataset.
Items labeled as Inter217, Inter223, and Inter219 are also quite common,
suggesting they are frequently encountered in the dataset.

32
The low occurrence of Inter228 indicates that it might be less prevalent or less
frequently encountered compared to other item labels.

Observations on Platform

Table 20: Value count of Platform

Fig 17: Countplot of Platform


* Frequent Occurrence: "Video" is the most frequently occurring category in the
dataset.
* Common Categories: "Web" is a commonly observed category in the dataset.
* Least Occurrence: "App" is the least occurring category in the dataset.
"Video" appears to be the dominant category, occurring with the highest
frequency in the dataset.
"Web" is also quite common, suggesting it is frequently encountered in the
dataset.
The low occurrence of "App" indicates that it might be less prevalent or less
frequently encountered compared to the other categories.

33
Observations on Device Type

Table 21: Value count of Device Type

Fig 18: Countplot of Device Type

* Frequent Occurrence: "Mobile" is the most frequently occurring device type in


the dataset.

*Least Occurrence: "Desktop" is the least occurring device type in the dataset.

"Mobile" appears to be the dominant device type, occurring with the highest
frequency in the dataset.

The low occurrence of "Desktop" indicates that it might be less prevalent or less
frequently used compared to mobile devices.

34
Observations on Format

Table 22: Value count of Format

Fig 19: Countplot of Format

* Frequent Occurrence: "Video" is the most frequently occurring advertisement


type in the dataset.

* Least Occurrence: "Display" is the least occurring advertisement type in the


dataset.

"Video" appears to be the dominant advertisement type, occurring with the


highest frequency in the dataset.

The low occurrence of "Display" indicates that it might be less prevalent or less
frequently used compared to video advertisements.

35
Multivariate Analysis

Show the relationship between numerical variables

Fig 20: Pairplot of all the numerical Variables

36
To explore the relationship between all the numerical variables using a pairplot,
we have created a scatterplot matrix that visualizes the pairwise relationships
between the variables.

* Correlation: Look for patterns and trends in the scatterplots. Positive correlation
is indicated by points moving upwards to the right, negative correlation by points
moving downwards to the right, and no correlation by a scattered distribution.

* Outliers: Check for any outliers in the data. Outliers might appear as points that
are significantly distant from the main cluster of points in the scatterplots.

* Distribution: Examine the diagonal plots to understand the distribution of each


variable. This can help identify skewed or non-normal distributions.

Correlation of Numerical Variables

Table 23: correlation between all the numerical variables

37
To explore the correlation between all the numerical variables, we can calculate
the correlation matrix and visualize it using a heatmap.

Check for presence of correlations

Fig 21: Heatmap

Strong Positive Correlation: We can identify pairs of variables with high positive
correlation coefficients (close to 1). This indicates that as one variable increases,
the other variable tends to increase as well.
38
Strong Negative Correlation: When we look for pairs of variables with high
negative correlation coefficients (close to -1). This suggests that as one variable
increases, the other variable tends to decrease.

Weak Correlation: Also, we can identify pairs of variables with correlation


coefficients close to 0. This indicates a weak or no linear relationship between the
variables.

Part 1: 1.7 Bivariate Analysis

Numeric to Categorical

Ad Type and Revenue

Fig 22: Barplot of Revenue and Ad Type


To analyze the relationship between the revenue and advertisement type, we
have created a barplot to visualize the average revenue generated by each
advertisement type.

39
Observations and Insights

* Revenue Disparity: Analyzing the barplot, we can observe the disparity in


average revenue generated by different advertisement types.

* Effectiveness of Advertisement Types: Certain advertisement types might be


more effective in generating revenue compared to others. This could be due to
factors such as audience engagement, targeting strategies, or the nature of the
advertisement content.

* Opportunities for Improvement: If there's a significant difference in revenue


between advertisement types, it might indicate areas where improvements or
optimizations can be made. For example, reallocating resources towards more
effective advertisement types or refining strategies for underperforming types.

Insights for Decision-making: These insights can guide decision-making processes


related to advertising strategies, budget allocations, and resource management,
aiming to maximize revenue generation based on the effectiveness of different
advertisement types.

Platform vs Clicks

Fig 23: Barplot of Platform and Clicks

40
To analyze the relationship between the platform and the number of clicks, we
can create a barplot to visualize the average number of clicks for each platform.

Observations and Insights:

* Clicks Variation: Analyzing the barplot, we can observe differences in the


average number of clicks across different platforms.

* Platform Performance: Certain platforms might attract more clicks compared to


others. This could be influenced by factors such as user engagement, targeting
effectiveness, or platform popularity.

* Effectiveness of Advertising Platforms: Platforms with higher average clicks may


indicate higher effectiveness in driving user engagement and interaction with
advertisements.

* Opportunities for Optimization: Identifying platforms with lower average clicks


may present opportunities for optimization or refinement of advertising
strategies to improve user engagement and click-through rates.

Insights for Decision-making: These insights can inform decisions related to


platform selection, budget allocation, and resource allocation in advertising
campaigns, aiming to maximize clicks and user engagement across various
platforms.

41
Device Type vs Clicks

Fig 24: Barplot of Device Type and Clicks


To analyze the relationship between device type and the number of clicks, we can
create a barplot to visualize the average number of clicks for each device type.
Observations and Insights:
* Clicks Variation by Device Type: The barplot reveals differences in the average
number of clicks across different device types.
* User Interaction Differences: Users may interact differently with advertisements
depending on the device they are using. For example, mobile users might exhibit
different click behavior compared to desktop users due to factors such as screen
size, user context, or browsing habits.
* Effectiveness of Device Types: Certain device types might result in higher
average clicks, indicating greater effectiveness in driving user engagement with
advertisements.
* Optimization Opportunities: Identifying device types with lower average clicks
may suggest opportunities for optimization or targeted strategies to improve
click-through rates and user engagement on those devices.

42
Insights for Decision-making: These insights can inform decisions related to
device-specific advertising strategies, budget allocations, and resource allocation
in advertising campaigns, aiming to maximize clicks and user interaction across
different device types.

Ad Type vs Spend

Fig 25: Barplot of Spend and Ad Type


To explore the relationship between advertising spend and ad types, we can
create a barplot to visualize the average spend for each ad type.
Observations and Insights:
* Spend Variation by Ad Type: The barplot highlights differences in average
spending across different ad types.
* Investment Allocation: Certain ad types may require higher spending compared
to others, possibly due to factors such as ad format complexity, placement
preferences, or targeting strategies.
* Ad Type Effectiveness: Higher spending on specific ad types may indicate their
perceived effectiveness in achieving advertising objectives, such as driving
conversions, brand awareness, or user engagement.

43
* Cost-effectiveness Assessment: Analyzing the average spend by ad type can help
assess the cost-effectiveness of different advertising formats or channels,
informing decisions on budget allocations and resource optimization.

* Campaign Optimization: Identifying ad types with lower average spending may


suggest opportunities for optimization or reallocation of resources to maximize
returns on investment and achieve campaign objectives effectively.

Insights for Strategy Development: These insights can guide strategic decisions
related to ad type selection, budget planning, and resource allocation in
advertising campaigns, aiming to optimize spending and achieve desired
outcomes.

Part 1: 1.8 Key meaningful observations on individual variables and


the relationship between variables
To provide key meaningful observations on individual variables and the
relationship between variables, let's analyze some of the numerical variables and
their relationships and highlighting areas for further analysis and optimization in
advertising campaigns.
1. Ad Length and Ad Width:
* Both variables show a wide range of values, with Ad Length having a higher
mean compared to Ad Width.
* There might be a correlation between Ad Length and Ad Width, indicating that
larger ads in one dimension tend to be larger in the other dimension as well.
2. Available Impressions and Impressions:
* Available Impressions represent the total potential views for an ad, while
Impressions represent the actual number of times an ad was displayed.
* A higher number of Available Impressions doesn't necessarily guarantee a
proportional increase in Impressions, as factors like targeting, ad placement, and
competition can influence actual impressions.

44
3. Clicks and Spend:

* There might be a positive correlation between Clicks and Spend, indicating that
higher spending on advertising campaigns tends to result in more clicks.

* However, outliers in spending or ineffective advertising strategies could lead to


high spending without a corresponding increase in clicks.

4. Revenue and Click-Through Rate (CTR):

* Revenue represents the income generated from advertising campaigns, while


CTR indicates the percentage of users who clicked on an ad after seeing it.

* A higher CTR might lead to increased revenue, but other factors like conversion
rates and product pricing also influence overall revenue.

5. Device Type and Clicks:

* There might be differences in the average number of clicks generated from


different device types, such as mobile devices, desktops, or tablets.

* Users might interact differently with ads based on the device they are using,
impacting click-through rates and overall campaign performance.

6. Ad Type and Revenue:

* Certain ad types might be more effective in generating revenue compared to


others, based on factors like format, placement, or targeting.

* Analyzing revenue by ad type can help identify the most profitable advertising
formats and optimize future campaigns accordingly.

45
7. Platform and Spend:

* Different advertising platforms (e.g., social media, search engines) may require
varying levels of spending to achieve desired campaign objectives.

* Platform selection might influence ad reach, targeting capabilities, and overall


campaign effectiveness, impacting spending decisions.

===================================================================

Part 1: Clustering: Data Preprocessing


Checking the duplicate values

Table 24: Checking Duplicate values

We can see that there are no duplicate values in the given dataframe

46
Part 1: 2.1 Missing Value Check and Treatment

Table 25: Checking the missing values


We can see that the minimum value for several variables is 0. There are no
negative values. There are missing values in the 3 variables. The CTR, CPM, and
CPC are derived fields and have missing values. Note that the range of the values
for different variables are very different.
We see that Several variables have minimum values of 0, variables such as Clicks,
Spend, and Revenue may have zero values when there are no clicks, expenditure,
or revenue generated from specific ad campaigns.
There are no negative values observed in the dataset, indicating that the variables
represent quantities or metrics that cannot take on negative values. This absence
of negative values suggests that the data aligns with expectations for the
respective metrics and attributes.
Three variables, namely CTR (Click-Through Rate), CPM (Cost Per Mille), and CPC
(Cost Per Click), are derived fields and contain missing values.

47
Missing Value Treatment:

The missing values were treated using the formulas given above as follows by
imputing values from the user defined functions:

CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that the
Total Campaign Spend refers to the 'Spend' Column in the dataset and the
Number of Impressions refers to the 'Impressions' Column in the dataset.

CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend)
refers to the 'Spend' Column in the dataset and the Number of Clicks refers to the
'Clicks' Column in the dataset.

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note that
the Total Measured Clicks refers to the 'Clicks' Column in the dataset and the
Total Measured Ad Impressions refers to the 'Impressions' Column in the dataset.

After creating user-defined functions to impute the missing values for CTR, CPM,
CPC, we have to call the function for imputing missing values for CTR, CPC, CPC
Find the top five rows of the dataset after treating the missing values:

Table 26: Top five rows


48
Table 27: Check for presence of missing values in each feature
From the above output It's good to see that there are no missing values in the
dataset. Having complete data ensures that analyses and modeling efforts can
proceed without the need for imputation or handling of missing values, thereby
reducing potential biases and uncertainties in the results. This enables more
robust and reliable insights to be derived from the dataset.

Part 1: 2.2 Outlier Treatment


We have used ‘for’ loop to plot the boxplot for all the numeric columns at once to
check the outliers

49
Fig 26: Boxplots with outliers
From the above output, we can see that all the Variables except Ad – Length and
Ad – Width have outliers as shown by the Box plots.
Even though we know that the k-means clustering technique is well-studied, it
may have some limitations for real time data, because the k-means objective
assumes that all the points can be partitioned into k distinct clusters, which is
often an unrealistic assumption in practice. Real time data has contamination or
noise, and the k-means method may be sensitive to it.
There are possible ways or approaches for reducing noise:
1. Treating outliers using IQR method.
2. Treating outliers using z-score method.
3. Using EDA results to segment data into two or more parts and then apply k-
means algorithm to each part separately. This method is applicable only if the size
of the data is large and each part of the dataset has reasonable number of data
points.
From these details, for this dataset we will treat outliers using IQR Method, and
compare results with model without outlier treatment.
Outlier Detection and Treatment using IQR method

50
In this method, any observation that is less than Q1 - 1.5 IQR or more than Q3 +
1.5 IQR is considered an outlier. To treat outliers, we defined a function
remove_outlier'

* The larger values (>upper whisker) are all equated to the 95th percentile value
of the distribution

* The smaller values (<lower whisker) are all equated to the 5th percentile value
of the distribution

After calculating the number of Outliers in each column at once using loop
function:

Table 28: Number of outliers in each variable


We do outlier treatment and plot all the boxplots.

51
Fig 27: Boxplots without outliers

From the above output we can see that the outliers are removed. Removing
outliers from the dataset can have significant implications for subsequent
analyses and modeling. While outlier removal can be a useful preprocessing step
to improve the quality and reliability of analyses.

Part 1: 2.3 z - score scaling

Here we drop the categorical variables and check the first five rows of the new
dataset after dropping the columns.

Table 29: First five rows of new dataset


We used scikit-learn's Standard Scaler to perform z-score scaling.
Below Table shows the rows of this scaled data (rows transposed as columns)

52
Table 30: scaled data (rows transposed as columns)

We used scikit-learn's Standard Scaler to perform z-score scaling.


Below Table shows the rows of this scaled data

Table 31: Rows of the scaled data


Absolutely, scaling is a crucial preprocessing step in data analysis and machine
learning that can have several positive impacts on the analysis process.
* Scaling ensures that all variables contribute equally to the analysis by bringing
them to a similar scale. This can improve the performance of many machine
learning algorithms,
* Scaling has positive and synchronizing impact on analysis enhancing speed by
reducing errors.

53
* Scaling of variables is important for clustering to stabilize the weights of the
different variables. If there is wide discrepancy in the range of variables, cluster
formation may be affected by weight differential.

* Also, without scaling the data, the algorithm maybe biased toward highest
value.

Overall, scaling is a fundamental preprocessing step that can have a significant


positive impact on the analysis process by improving model performance,
convergence speed, interpretability, and robustness while reducing bias and
errors.

Part 1: Clustering: Hierarchical Clustering

Part 1: 3.1 Construct a dendrogram using Ward linkage and Euclidean


distance

Importing dendrogram and linkage module and after choosing ward linkage
method, the code will generate a dendrogram using the Ward linkage method,
which is one of the commonly used methods for hierarchical clustering.

Fig 28: Dendrogram


54
Generating a dendrogram plot using the Ward linkage method for hierarchical
clustering. It will show the last 10 merged clusters in the dendrogram.

Fig 29: Dendrogram with last 10 merged clusters


In a dendrogram, each branch is referred to as a clade, and the terminal end of
each clade is termed a leaf. The arrangement of the clades provides insight into
the similarity between the leaves.
The height of the branching points signifies the degree of dissimilarity between
clusters: the greater the height, the greater the dissimilarity.
In the dendrogram, the longest (or tallest) branch is depicted in blue.
Segmenting the data based solely on this branch would result in only 2 clusters,
which might not be suitable for business analysis.
However, by identifying the tallest red branches separated by horizontal lines, we
can delineate 5 clusters.

55
Alternatively, choosing branches delineated by horizontal lines could result in 3
clusters. For this dataset, we opt for 5 clusters using the dendrogram method.

Part 1: Clustering: K-means Clustering

Part 1: 4.1 Apply K-means Clustering

Import the KMeans module from the sklearn library.


Instantiate a KMeans object and assign it to the variable k_means.
Fit the KMeans model on the scaled dataframe scaled_df1.

Fig 30: KMeans model


Retrieve the cluster labels. The following output displays the cluster assignments
for each observation.

Within Cluster sum of squares


The term k_means.inertia_ represents the within-cluster sum of squares (WCSS),
which is a measure of the dispersion of data points within each cluster.

The within-cluster sum of squares (WSS) for different values of k (number of


clusters) are as follows:
• K = 1 : 276,792.00
56
• K = 3 : 108,643.09
• K = 4 : 74,262.29
• K = 5 : 54,880.69
• K = 6 : 46,875.62

From the above outputs we can clearly see that WSS reduces as K keeps
increasing.

Calculating WSS for other values of K - Elbow Method

The WSS (within-cluster sum of squares) values are stored in the list named "wss".

Table 32: Wss

Part 1: 4.2 Plot the Elbow curve

Fig 31: Elbow plot

57
As we increase the number of clusters (k) from 1 to 2, there is a noticeable and
substantial decrease in the within-cluster sum of squares (WSS). Similarly, as we
progress from k=2 to k=3 and then from k=3 to k=4, there are significant drops in
WSS values. However, the reduction in WSS becomes less pronounced when
moving from k=4 to k=5 and from k=5 to k=6. In other words, beyond k=5, there is
no substantial decrease in WSS. Therefore, based on the Elbow plot analysis, we
conclude that 5 is the optimal number of clusters for this dataset.

Part 1: 4.3Check Silhouette Scores

The KMeans clustering algorithm is applied with 5 clusters on the scaled data
(scaled_df1), and the resulting cluster labels are stored in the variable "labels".
The cluster labels obtained from KMeans clustering are added as a new column
named "Clus_kmeans" to the scaled data (scaled_df1). The first five rows of the
updated dataframe are displayed transposed for better readability.

Table 33: 5 rows of updated data frame

58
The silhouette score calculated for the clustering results obtained from KMeans
indicates the overall cohesion and separation of the clusters.

Part 1: 4.4 Figure out the appropriate number of clusters

Based on the Silhouette scores, the optimum number of clusters is determined to


be 5.

Both Hierarchical Clustering and KMeans Clustering were conducted. The Elbow
plot and Silhouette Score were utilized to identify the optimal number of clusters
in KMeans, while a dendrogram was drawn for Hierarchical Clustering. In the
Hierarchical method, 5 clusters were obtained, whereas in KMeans, 5 clusters
were identified using the elbow plot, and 6 clusters were found using the
silhouette score.

The silhouette width values calculated for each observation in the dataset based
on the clustering results obtained from KMeans are stored in the "sil_width"
column of the DataFrame.

59
Table 34: "sil_width" column of the DataFrame.
Appending Clusters to the original dataset

Table 35: Appending Clusters to the original dataset

60
Part 1: 4.5 Cluster Profiling
The count of observations in each cluster obtained from KMeans clustering is
displayed, sorted by cluster index.

The profile of each cluster after dropping irrelevant columns and grouping by the
KMeans cluster labels is displayed. The mean values of each variable across
clusters are shown along with the frequency of each cluster.

Table 36: Profile of each cluster after dropping irrelevant columns

61
Observations:

1.Clusters 1 and 4 encompass ads with a higher average length compared to the
remaining clusters.

2.Clusters 3 and 4 exhibit ads with a significantly greater average width than the
other clusters.

3.Cluster 0 is associated with the smallest ad size.

4.While there is minimal disparity in fees, Cluster 3 demonstrates notably higher


mean spend and mean revenue compared to the other clusters.

5.Cluster 3 records the highest available impressions among all clusters.

Using the hint provided in the rubric, we will plot the bar charts by grouping the
data by Cluster Labels and taking sum or mean of Clicks, Spend, Revenue, CTR,
CPC, & CPM based on Device Type.
Comparison of Clusters according to device type (x-axis) and total clicks (y-axis)

Table 37: Comparison of Clusters

62
Observations:
The bar plot highlights that Cluster 3 exhibits the highest number of clicks across
both Desktop and Mobile device types. Additionally, Clusters 0 and 4 demonstrate
significant click counts for both device types.

Comparison of Clusters according to device type (x-axis) and total revenue (y-
axis)

Table 38: Comparison of Clusters

Observations:

From the above barplot the mobile segment within cluster 3 have most revenue
generated and may be considered the best ads. Similarly, the desktop segment
cluster has highest revenue generated for Desktop Ads

63
Comparison of Clusters according to device type (x-axis) and total spend (y-
axis):

Table 39: Comparison of Clusters


Observations:
The bar plot reveals that within Cluster 3, the mobile segment exhibits the highest
total spending, suggesting it may comprise premium ads. Similarly, for the
desktop segment, Cluster 3 has the highest spending. Additionally, for the mobile
segment, Clusters 2 and 4 demonstrate the most spending after Cluster 0,
respectively.

64
Comparison of Clusters according to device type (x-axis) and average CPC, CTR,
CPM

Table 40: Comparison of Clusters according

65
Observations:

CPMstands for "cost per 1000 impressions". In simple words, CPM refers to the
amount it costs to have an ad published a thousand times on a website and is
seen by users. For example, if a website publisher charges 4.00 Dollars CPM, that
means an advertiser must pay 4.00 Dollars for every 1,000 impressions of its ads.

CPCstands for Cost Per Click. It is a method that websites use to determine the
average times an advertiser has been clicked on the relevant ad. CPC is also a
widely used google ad words metric that advertisers incorporate to manage their
campaign budgets & performance. Let us say your CPC ads get 2 clicks, one
costing 0.40 Dollars and the other is 0.20 Dollars, this totals 0.60 Dollars. You'd
divide your 0.60 Dollars by 2 (your total number of clicks) to get an average CPC of
0.30 Dollars.

CTR or Click Through Rate is measuring the success of online ads by aggregating
the percentage of people that actually clicked on the ad to arrive at the
hyperlinked website. For example, if an ad has been clicked 200 times after
serving 50,000 times, by multiplying that result by 100 you get a click-through
rate of 0.4%

From the bar plots, it's evident that Clusters 2, 4, and 3 exhibit the highest
average CPM, implying that ads in these clusters are likely placed on expensive
and highly visited websites. Moreover, the average CTR is also highest in these
clusters. Interestingly, there doesn't seem to be a significant difference between
the mobile and desktop segments in terms of CPM and CTR.
Selling ads based on CPM imposes a revenue ceiling. To increase revenue, there's
a need to invest in expanding reach to create more ad opportunities or deliver
more ads to the same users. Conversely, selling ads based on CTR doesn't cap
revenue. Instead, it allows for increasing engagement without necessarily
increasing the number of impressions per person. Thus, with CPM, efforts are
directed towards reaching more people or risking user experience with more ads.

66
Part 1: Clustering: Actionable Insights and
Recommendations

Part 1: 5.1 Extract meaningful insights (atleast 3) from the clusters to


identify the most effective types of ads, target audiences, or
marketing strategies that can be inferred from each segment.
Meaningful Insights:
1. Cluster 3 appears to be the most lucrative segment: This cluster exhibits the
highest average spending and revenue, suggesting that the ads in this group are
likely more effective in generating returns. Advertisers may want to focus their
efforts on understanding the characteristics of this cluster to replicate its success
in other campaigns.
2. Clusters 2 and 4 show potential for high engagement: These clusters
demonstrate the highest average CTR, indicating that the ads within them are
particularly engaging to users. Advertisers could analyze the content and
placement strategies of these ads to identify factors
3. Cluster 0 may represent underperforming ads: With comparatively lower
spending and revenue metrics, Cluster 0 could indicate ads that are less effective
in driving conversions. Advertisers may need to reevaluate the targeting,
messaging, or creative elements of these ads to improve their performance and
align them more closely with the characteristics of higher-performing clusters.

Part 1: 5.2 Based on the clustering analysis and key insights, provide actionable
recommendations (atleast 3) to Ads24x7 on how to optimize their digital
marketing efforts, allocate budgets efficiently, and tailor ad content to specific
audience segments.
Recommendations:
1. Allocate budget strategically based on cluster performance: Ads24x7 should
prioritize allocating a larger portion of their budget to ads targeting clusters with
higher spending and revenue metrics, such as Cluster 3. By focusing resources on
these segments, they can maximize returns on investment and ensure efficient
use of marketing funds.

67
2. Tailor ad content to engage target audiences: Analyzing the characteristics of
high-performing clusters, such as Clusters 2 and 4 with high CTR, can provide
valuable insights into the preferences and behaviors of specific audience
segments. Ads24x7 should leverage these insights to tailor ad content, messaging,
and creative elements to better resonate with the identified audience
preferences, thereby increasing engagement and driving conversions.
3. Optimize ad placements and channels: Understanding which clusters
demonstrate the highest engagement rates, such as Clusters 2 and 4, can inform
decisions about ad placements and channels. Ads24x7 should consider focusing
their efforts on platforms or channels that are particularly effective in reaching
and engaging these target segments. Additionally, they can experiment with
different placement strategies to maximize exposure and engagement while
minimizing costs.

Saving the Cluster Profiles in a csv file


# df.to_csv('df1_hc.csv')

68
Problem Statement
PCA:
Part 2: PCA: Define the problem and perform Exploratory Data
Analysis
Problem Definition:
PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes -
2011 PCA for Female Headed Household Excluding Institutional Household. The
Indian Census has the reputation of being one of the best in the world. The first
Census in India was conducted in the year 1872. This was conducted at different
points of time in different parts of the country. In 1881 a Census was taken for
the entire country simultaneously. Since then, Census has been conducted every
ten years, without a break. Thus, the Census of India 2011 was the fifteenth in
this unbroken series since 1872, the seventh after independence and the second
census of the third millennium and twenty first century. The census has been
uninterruptedly continued despite of several adversities like wars, epidemics,
natural calamities, political unrest, etc. The Census of India is conducted under
the provisions of the Census Act 1948 and the Census Rules, 1990. The Primary
Census Abstract which is important publication of 2011 Census gives basic
information on Area, Total Number of Households, Total Population, Scheduled
Castes, Scheduled Tribes Population, Population in the age group 0-6, Literates,
Main Workers and Marginal Workers classified by the four broad industrial
categories, namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household
Industry Workers, and (iv) Other Workers and also Non-Workers. The
characteristics of the Total Population include Scheduled Castes, Scheduled
Tribes, Institutional and Houseless Population and are presented by sex and
rural-urban residence. Census 2011 covered 35 States/Union Territories, 640
districts, 5,924 sub-districts, 7,935 Towns and 6,40,867 Villages.

69
The data collected has so many variables thus making it difficult to find useful
details without using Data Science Techniques. You are tasked to perform
detailed EDA and identify Optimum Principal Components that explains the most
variance in data. Use Sklearn only.

Data_Description

* State : State Code

* District : District Code

* Name : Name

* TRU1 : Area Name

* No_HH : No of Household

* TOT_M : Total Population Male

* TOT_F : Total Population Female

* M_06 : Population in the age group 0-6 Male

* F_06 : Population in the age group 0-6 Female

* M_SC : Scheduled Castes population Male

* F_SC : Scheduled Castes population Female

* M_ST : Scheduled Tribes population Male

70
* F_ST : Scheduled Tribes population Female

* M_LIT : Literates population Male

* F_LIT : Literates population Female

* M_ILL : Illiterate Male

* F_ILL : Illiterate Female

* TOT_WORK_M : Total Worker Population Male

* TOT_WORK_F : Total Worker Population Female

* MAINWORK_M: Main Working Population Male

* MAINWORK_F : Main Working Population Female

* MAIN_CL_M : Main Cultivator Population Male

* MAIN_CL_F : Main Cultivator Population Female

* MAIN_AL_M : Main Agricultural Labourers Population Male

* MAIN_AL_F : Main Agricultural Labourers Population Female

* MAIN_HH_M : Main Household Industries Population Male

* MAIN_HH_F : Main Household Industries Population Female

71
* MAIN_OT_M : Main Other Workers Population Male

* MAIN_OT_F : Main Other Workers Population Female

* MARGWORK_M : Marginal Worker Population Male

* MARGWORK_F : Marginal Worker Population Female

* MARG_CL_M : Marginal Cultivator Population Male

* MARG_CL_F : Marginal Cultivator Population Female

* MARG_AL_M : Marginal Agriculture Labourers Population Male

* MARG_AL_F : Marginal Agriculture Labourers Population Female

* MARG_HH_M : Marginal Household Industries Population Male

* MARG_HH_F : Marginal Household Industries Population Female

* MARG_OT_M : Marginal Other Workers Population Male

* MARG_OT_F : Marginal Other Workers Population Female

* MARGWORK_3_6_M : Marginal Worker Population 3-6 Male

* MARGWORK_3_6_F : Marginal Worker Population 3-6 Female

* MARG_CL_3_6_M : Marginal Cultivator Population 3-6 Male

72
* MARG_CL_3_6_F : Marginal Cultivator Population 3-6 Female

* MARG_AL_3_6_M : Marginal Agriculture Labourers Population 3-6 Male

* MARG_AL_3_6_F : Marginal Agriculture Labourers Population 3-6 Female

* MARG_HH_3_6_M : Marginal Household Industries Population 3-6 Male

* MARG_HH_3_6_F : Marginal Household Industries Population 3-6 Female

* MARG_OT_3_6_M : Marginal Other Workers Population Person 3-6 Male

* MARG_OT_3_6_F : Marginal Other Workers Population Person 3-6 Female

* MARGWORK_0_3_M : Marginal Worker Population 0-3 Male

* MARGWORK_0_3_F : Marginal Worker Population 0-3 Female

* MARG_CL_0_3_M : Marginal Cultivator Population 0-3 Male

* MARG_CL_0_3_F : Marginal Cultivator Population 0-3 Female

* MARG_AL_0_3_M : Marginal Agriculture Labourers Population 0-3 Male

* MARG_AL_0_3_F : Marginal Agriculture Labourers Population 0-3 Female

* MARG_HH_0_3_M : Marginal Household Industries Population 0-3 Male

* MARG_HH_0_3_F : Marginal Household Industries Population 0-3 Female

73
* MARG_OT_0_3_M : Marginal Other Workers Population 0-3 Male

* MARG_OT_0_3_F : Marginal Other Workers Population 0-3 Female

* NON_WORK_M : Non Working Population Male

* NON_WORK_F : Non Working Population Female

Solution:

Import the required libraries. Load and read the data.

Data Overview

The initial steps to get an overview of any dataset is to:

* get information about the number of rows and columns in the dataset

* observe the first few and last few rows of the dataset, to check whether the
dataset has been loaded properly or not

* find out the data types of the columns to ensure that data is stored in the
preferred format and the value of each property is as expected.

* check the statistical summary of the dataset to get an overview of the numerical
columns of the data

Check Shape
Checking the number of rows and columns present in the given dataframe

74
PCA dataset has 640 rows and 61 columns

Checking the first five rows in the given dataframe

Table 1: First five rows of the dataset


Checking the last five rows in the dataframe

Table 2: Last five rows of the dataset

75
Data types
Checking the data types of the columns for the dataset

76
Fig 1: Data types in the dataset

The dataset comprises 61 attributes, with 2 being of object type and the
remaining 59 being of integer type.

77
Statistical Summary of the data

78
Table 3: Statistical Summary
Here is the statistical summary of the population data:
* The average number of households across all states is approximately 51,222.87.
* The average total male population is around 79,940.58.
* The average total female population is approximately 122,372.08.
* The average total working male population is about 37,992.41.
* The average total working female population is around 41,295.76.
* The average non-working population of males is approximately 510.01.
* The average non-working population of females is about 704.78.

79
Checking the duplicate values

Fig 2: Duplicate Values

We can see that there are no duplicate values in the given dataframe

Missing Value Check

Fig 3: Missing Value

Based on the provided output, there are no null values detected in the dataset.

Perform an EDA on the data to extract useful insights


Note: 1. Pick 5 variables out of the given 24 variables below for EDA: No_HH,
TOT_M, TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL,
F_ILL, TOT_WORK_M, TOT_WORK_F, MAINWORK_M, MAINWORK_F,
MAIN_CL_M, MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M,

80
MAIN_HH_F, MAIN_OT_M, MAIN_OT_F 2. Example questions to answer from
EDA - (i) Which state has highest gender ratio and which has the lowest? (ii)
Which district has the highest & lowest gender ratio?

Solution:

As per given Rubric, we need to choose 5 variables out of given 24 for EDA. We'll
choose 'No_HH', 'TOT_M', 'TOT_F', 'M_06', 'F_06'

Univariate Analysis

Visual analysis of the given data in the dataset using boxplots

Fig 4: Boxplots of the numerical variables

The box plots reveal that all variables in the dataset exhibit right skewness and
contain outliers.

81
Bivariate Analysis

Fig 5: Relationship between No_HH and TOT_M Fig 6: Relationship between No_HH and TOT_F

Fig 7: Relationship between No_HH and M_06 Fig 8: Relationship between No_HH and F_06

From the Bivariate Analysis, the above scatterplots says that all variables are
Positively Co-related to each other.

82
1. Which state has the highest population?

Fig 9: Visual analysis of No_HH using barplot

Answer: From the above Visual analysis of No_HH using barplot we, can see that
'Kerala' is the state that has highest population.

2. Which state has the highest total male population?

Fig 10: Visual analysis of TOT_M using barplot

83
Answer: From the above Visual analysis of TOT_F using barplot, we can see that
'Kerala' is the state that has highest Male population.

3. Which state has the lowest total Female population?

Fig 11: Visual analysis of TOT_F using barplot

Answer: From the above Visual analysis of TOT_M using barplot, we can see that
'Arunachal Pradesh' is the state that has lowest total Female Polulation.
4. Which state has second highest male child within the age group of 0-6?

Fig 12: Visual analysis of M_06 using barplot


84
Answer: From the above Visual analysis of M_06 using barplot, we can see that
'West Bengal' is the second highest state of male child within the age group of 0-
6.

5. Which state has second lowest female child within the age group of 0-6?

Fig 13: Visual analysis of F_06 using barplot

Answer: From the above Visual analysis of F_06 using barplot, we can see that
'Andaman and Nicobar Islands' is the second lowest state of female child within
the age group of 0-6.

85
Part 2: Data Prepocessing

Check for and treat (if needed) missing values

Fig: 14: Missing Values

Based on the provided output, there are no null values detected in the dataset.

Check for and treat (if needed) data irregularities

Fig 15: Duplicate Values

86
Scale the Data using the z-score method, Visualize the data before and after
scaling and comment on the impact on outliers

Let us visualize and check the distribution and outliers for each column in the
data

87
88
89
90
91
92
93
94
95
96
97
98
99
100
Fig 16: Visualization, Distribution and Outliers of each variable

101
* By examining the histograms, we can observe the shape of the distribution for
each numerical variable. Skewness values provide quantitative measures of
asymmetry.

* Positive skewness indicates that the right tail of the distribution is longer, while
negative skewness indicates a longer left tail.

* In the box plots, outliers are visualized as individual points that fall outside the
whiskers, which represent the range of typical values.

* Analyzing the distribution and skewness of numerical variables helps in


understanding the underlying characteristics of the dataset and identifying
potential data quality issues, such as outliers.

* Outliers may require further investigation to determine if they are valid data
points or errors. Depending on the context, outliers may be retained, removed, or
treated using appropriate methods.

Outliers treatment is not necessary unless they are the result from a processing
mistake or wrong measurement. True outliers must be kept in the data.

Treating outliers is typically unnecessary unless they arise from processing errors
or incorrect measurements. Genuine outliers, representing extreme or unusual
values.

Scale the Data using the z-score method

The column 'State' and 'Area Name' are of object data type containing a lot of
unique entries and would not add values to our analysis. We can drop these
column.

102
Importing the StandardScalar Module, Creating an object for the StandardScalar
Function, then Fitting and Transforming the data. Now the transformed data,
after applying feature scaling, is stored in the variable scaled_data. This shows the
below output.

Fig 17: Data after applying feature scaling

Below is the output after scaling the data:

103
104
Fig 18: Output of Scaled Data
Visualize after scaling the data:

Fig 19: Boxplot after Scaling the Data

Observations:

* From the above output, we can see that after scaling the data using the
StandardScaler, the distribution of the variables remains highly skewed to the
right. This indicates that the shape of the distributions is preserved even after
standardization. Additionally, all variables except State Code and Dist. Code still
exhibit outliers, primarily located at the right end of the distributions. These
outliers are not affected by the scaling process, as the purpose of scaling is to
standardize the data's scale, not to remove outliers.

Therefore, scaling has no impact on the presence of outliers in the dataset.

105
Part 2; PCA: PCA
Perform all the required steps for PCA (use sklearn only)

Create the covariance matrix

Fig 20: presence of correlations

When the dataset contains a large number of numerical variables, visualizing the
correlation matrix using a heatmap becomes challenging due to the heavy data.

106
Create covariance matrix

Fig 21: Covariance Matrix


Even if we take the transpose of the covariance matrix it results in same value as that of the
above

Fig 22: Transpose of the Covariance Matrix

Comparing Correlation and Covariance Matrix

107
Fig 23: Comparing Correlation and Covariance Matrix

With standardisation (Without standardisation also, correlation matrix yields


same result)
Covariance indicates the direction of the linear relationship between variables.
Correlation on the other hand measures both the strength and direction of the
linear relationship between two variables. Correlation is a function of the
covariance. You can obtain the correlation coefficient of two variables by dividing
108
the covariance of these variables by the product of the standard deviations of the
same values.
We can state that above three approaches yield the same eigenvectors and
eigenvalue pairs:
* Eigen decomposition of the covariance matrix after standardizing the data.

* Eigen decomposition of the correlation matrix.

* Eigen decomposition of the correlation matrix after standardizing the data.

Finally we can say that after scaling - the covariance and the correlation have the
same values

Get eigen values and eigen vectors

109
Fig 24: eigen values and eigen vectors

Checking the scaled data

110
Fig 25: Scaled Data

111
The output above represents the scaled dataset, where each row corresponds to
a specific observation, and each column represents a different feature or variable
after scaling. This transformation ensures that all variables have a mean of
approximately 0 and a standard deviation of approximately 1. By transposing the
data, we can easily examine the scaled values for each feature across different
observations. This scaled data is now suitable for further analysis.

Identify the optimum number of PCs

* Defined the number of principal components (PCs) to generate based on the


number of features in the scaled dataset.

* Utilized PCA (Principal Component Analysis) to compute the principal


components for the scaled data.

* Created a DataFrame to store the transformed data after PCA.

* Computed the percentage of variance explained by each principal component


using the explained_variance_ratio_ attribute of the PCA object.

* Iterated through the explained variance ratios to find the minimum number of
principal components required to explain at least 90% of the variance in the data.

* Printed the number of principal components that collectively explain more than
90% of the variance.

112
Show Scree Plot

Fig 26: Scree plot

Visually we can observe that there is steep drop in variance explained with
increase in number of PC's.

Using scikit learn PCA here. It does all the above steps and maps data to PCA
dimensions in one shot
Instantiate PCA with 6 components and set random state, Fit and transform the
scaled data and Display the transformed data as data_reduced.

Fig 27: Display the transformed data

113
Convert the reduced data array into a DataFrame

Fig 28: R reduced data array into a DataFrame

Principal components representing the directions of maximum variance in the


data

114
Fig 29: PCA Loadings

Variance explained by each principal component

115
Percentage of variance explained by each principal component

For this project, we need to consider at least 90% explained variance, so cut off
for selecting the number of PCs is: '6'

Compare PCs with Actual Columns and identify which is explaining


most variance

Create a dataframe containing the loadings or coefficients of all PCs

116
Fig 30: Dataframe with coefficients of all PCs

117
Check as to how the original features matter to each PC
Note: Here we are only considering the absolute values

Fig 31: original features matter to each PC

The output visualizes how the original features contribute to each Principal
Component (PC) by considering only the absolute values of the loadings. Each
subplot represents one original feature, and the bars indicate the absolute
loadings of that feature on the corresponding PC.

The height of the bars represents the magnitude of the influence of each feature
on the PC. A taller bar indicates a stronger influence, while a shorter bar suggests
a weaker influence. By examining these bar plots, we can understand which
original features contribute the most to each PC in terms of their absolute
loadings. This information helps us interpret the underlying structure captured by
each PC and identify the key features driving the variability in the data.

118
Compare how the original features influence various PCs

Fig 32: Compare how original features influence various PCs

By comparing the loadings of the original features across different PCs, we can
gain insights into the underlying structure of the data and understand which
features are driving the variability captured by each PC.

Checking the highest and lowest values

119
Fig 33: Checking the highest and lowest values

The function color_high is designed to apply specific background colors to cells in


a DataFrame based on their values.
Cells with values less than or equal to -0.20 are highlighted with a pink
background.
Cells with values greater than or equal to 0.40 are highlighted with a sky blue
background.

120
Write inferences about all the PCs in terms of actual variables
To infer the meaning of principal components (PCs) in terms of the original
variables, we typically examine the loadings of each variable on each PC. Loadings
represent the correlation between the original variables and the principal
components. Here are the inferences about all the PCs based on their loadings:

PC1: This component captures variance primarily related to variables with high
loadings. In terms of actual variables, PC1 might represent overall spending
patterns or general engagement metrics.

PC2: Variables with high loadings on PC2 contribute the most to its variance. PC2
could represent specific advertising formats or platforms that attract distinct
audience segments.

PC3: PC3 captures variance in variables with significant loadings on this


component. It may represent demographic or geographic factors influencing ad
performance.

PC4: Variables strongly correlated with PC4 contribute to its variance. PC4 might
represent factors related to ad targeting or campaign optimization strategies.

PC5: This component captures variance primarily from variables with high
loadings on PC5. It could represent seasonal trends or temporal patterns affecting
ad effectiveness.

PC6: Variables with high loadings on PC6 contribute the most to its variance. PC6
might represent ad placement strategies or audience engagement metrics.

These interpretations provide insights into the underlying factors driving variance
in the dataset and help understand the relationships between the original
variables and the principal components.

121
Write linear equation for first PC

The linear equation for the first principal component (PC1), we can use the
loadings of each original variable on PC1. The equation represents a weighted
sum of the original variables, where the weights are given by the loadings.

Fig 34: Linear Equation for PC1

122

You might also like