Machine Learning-1 BUSINESS REPORT
Machine Learning-1 BUSINESS REPORT
Project
1
CLUSTERING INDEX
S.No Description Page
Part 1: Clustering: Define the problem and perform Exploratory Data
1 Analysis 6
Part 1: 1.1 Checking the number of rows and columns present in the given
2 data frame 9
3 Part 1: 1.2 Checking the first five rows in the given dataframe 9
4 Part 1: 1.3 Checking the last five rows in the given data frame 10
5 Part 1: 1.4 Checking the data types of the columns for the dataset 10
6 Part 1: 1.5 Statistical summary of the dataset 11
7 Part 1:1.6 Univariate Analysis 13
8 Part 1: 1.7 Bivariate Analysis 39
Part 1: 1.8 Key meaningful observations on individual variables and the
9 relationship between variables 44
10 Part 1: Clustering: Data Preprocessing 46
11 Part 1: 2.2 Outlier Treatment 49
12 Part 1: 2.3 z - score scaling 52
13 Part 1: Clustering: Hierarchical Clustering 54
Part 1: 3.1 Construct a dendrogram using Ward linkage and Euclidean
14 distance 54
15 Part 1: Clustering: K-means Clustering 56
16 Part 1: 4.1 Apply K-means Clustering 56
17 Part 1: 4.2 Plot the Elbow curve 57
18 Part 1: 4.3Check Silhouette Scores 58
19 Part 1: 4.5 Cluster Profiling 61
20 Part 1: Clustering: Actionable Insights and Recommendations 67
Part 1: 5.1 Extract meaningful insights (atleast 3) from the clusters to
identify the most effective types of ads, target audiences, or marketing
21 strategies that can be inferred from each segment. 68
TABLES
S.No Description Page
1 Table 1: First five Rows 9
2 Table 2: Last Five rows 10
3 Table 3: Data Types of the dataset 10
4 Table 4: Statistical Summary 11
5 Table 5: Value counts of Ad-Length 13
6 Table 6: Value count of Ad-Width 14
7 Table 6: Value count of Ad-Size 14
8 Table 7: Value count of Available_Impression 16
9 Table 8: Value count of Matched_Queries 17
10 Table 9: Value Count of Impressions 19
11 Table 10: Value Count on Clicks 20
12 Table 11: Value count of Spend 22
13 Table 12: Value count of Fee 23
14 Table 13: Value count of Revenue 24
15 Table 14: Value count of CTC 26
2
16 Table 15: Value count of CPM 27
17 Table 16: Value count of CPC 28
18 Table 17: Value count of Timestamp 30
19 Table 18: Value count of InventoryType 31
20 Table 19: Value count on Ad Type 32
21 Table 20: Value count of Platform 33
22 Table 21: Value count of Device Type 34
23 Table 22: Value count of Format 35
24 Table 23: correlation between all the numerical variables 37
25 Table 24: Checking Duplicate values 46
26 Table 25: Checking the missing values 47
27 Table 26: Top five rows 48
28 Table 27: Check for presence of missing values in each feature 49
29 Table 28: Number of outliers in each variable 51
30 Table 29: First five rows of new dataset 52
31 Table 30: scaled data (rows transposed as columns) 53
32 Table 31: Rows of the scaled data 53
33 Table 32: Wss 57
34 Table 33: 5 rows of updated data frame 58
35 Table 34: "sil_width" column of the DataFrame. 60
36 Table 35: Appending Clusters to the original dataset 60
37 Table 36: Profile of each cluster after dropping irrelevant columns 61
38 Table 37: Comparison of Clusters 62
39 Table 38: Comparison of Clusters 63
40 Table 39: Comparison of Clusters 64
41 Table 40: Comparison of Clusters 65
FIGURES
S.No Description Page
1 Fig 1: Boxplot and Histogram of Ad – Length 13
2 Fig 2: Boxplot and Histogram of Ad-Width 14
3 Fig 3: Boxplot and Histogram of Ad-Size 15
4 Fig 4: Boxplot and Histogram of Available_Impression 16
5 Fig 5: Boxplot and Histogram of Matched Queries 18
6 Fig 6: Boxplot and Histogram of Impressions 19
7 Fig 7: Boxplot and Histogram of Clicks 21
8 Fig 8: Boxplot and histogram of Spend 22
9 Fig 9: Boxplot and Histogram of Fee 24
10 Fig 10: Boxplot and Histogram of Revenue 25
11 Fig 11: Boxplot and Histogram of CTR 26
12 Fig 12: Boxplot and Histogram of CPM 28
13 Fig 13: Boxplot and Histogram of CPC 29
14 Fig 14: Countplot of Timestamp 30
15 Fig 15: Countplot of InventoryType 31
16 Fig 16: Countplot on AdType 32
17 Fig 17: Countplot of Platform 33
3
18 Fig 18: Countplot of Device Type 34
19 Fig 19: Countplot of Format 35
20 Fig 20: Pairplot of all the numerical Variables 36
21 Fig 21: Heatmap 38
22 Fig 22: Barplot of Revenue and Ad Type 39
23 Fig 23: Barplot of Platform and Clicks 40
24 Fig 24: Barplot of Device Type and Clicks 42
25 Fig 25: Barplot of Spend and Ad Type 43
26 Fig 26: Boxplots with outliers 50
27 Fig 27: Boxplots without outliers 52
28 Fig 28: Dendrogram 54
29 Fig 29: Dendrogram with last 10 merged clusters 55
30 Fig 30: KMeans model 56
31 Fig 31: Elbow plot 57
32 Part 1: 4.4 Figure out the appropriate number of clusters 59
PCA INDEX
S.No Description Page
1 Part 2: PCA: Define the problem and perform Exploratory Data Analysis 69
2 Part 2: Data Prepocessing 86
3 Data types 76
4 Statistical Summary of the data 78
5 Perform an EDA on the data to extract useful insights 80
6 Part 2: Data Prepocessing 86
7 Check for and treat (if needed) missing values 86
8 Check for and treat (if needed) data irregularities 86
Scale the Data using the z-score method, Visualize the data before and after
9 scaling and comment on the impact on outliers 87
10 Part 2; PCA 106
11 Create covariance matrix 107
12 Get eigen values and eigen vectors 109
13 Identify the optimum number of PCs 112
14 Show Scree Plot 113
Compare PCs with Actual Columns and identify which is explaining most
15 variance 116
16 Write inferences about all the PCs in terms of actual variables 121
17 Write linear equation for first PC 122
TABLES
S.No Description Page
1 Table 1: First five rows of the dataset 75
2 Table 2: Last five rows of the dataset 75
3 Table 3: Statistical Summary 79
FIGURES
S.No Description Page
1 Fig 1: Data types in the dataset 77
4
2 Fig 2: Duplicate Values 80
3 Fig 3: Missing Value 80
4 Fig 4: Boxplots of the numerical variables 81
5 Fig 5: Relationship between No_HH and TOT_M 82
6 Fig 6: Relationship between No_HH and TOT_F 82
7 Fig 7: Relationship between No_HH and M_06 82
8 Fig 8: Relationship between No_HH and F_06 82
9 Fig 9: Visual analysis of No_HH using barplot 83
10 Fig 10: Visual analysis of TOT_M using barplot 83
11 Fig 11: Visual analysis of TOT_F using barplot 84
12 Fig 12: Visual analysis of M_06 using barplot 84
13 Fig 13: Visual analysis of F_06 using barplot 85
14 Fig: 14: Missing Values 86
15 Fig 15: Duplicate Values 86
16 Fig 16: Visualization, Distribution and Outliers of each variable 87
17 Fig 17: Data after applying feature scaling 103
18 Fig 18: Output of Scaled Data 105
19 Fig 19: Boxplot after Scaling the Data 105
20 Fig 20: presence of correlations 106
21 Fig 21: Covariance Matrix 107
22 Fig 22: Transpose of the Covariance Matrix 107
23 Fig 23: Comparing Correlation and Covariance Matrix 108
24 Fig 24: eigen values and eigen vectors 110
25 Fig 25: Scaled Data 111
26 Fig 26: Scree plot 113
27 Fig 27: Display the transformed data 113
28 Fig 28: R reduced data array into a DataFrame 114
29 Fig 29: PCA Loadings 115
30 Fig 30: Dataframe with coefficients of all PCs 116
31 Fig 31: original features matter to each PC 118
32 Fig 32: Compare how original features influence various PCs 119
33 Fig 33: Checking the highest and lowest values 120
34 Fig 34: Linear Equation for PC1 122
5
Problem Statement
Clustering
Problem Definition:
Digital Ads Data: The ads24x7 is a Digital Marketing company which has now got
seed funding of 10 Million Dollars. They are expanding their wings in Marketing
Analytics. They collected data from their Marketing Intelligence team and now
wants you (their newly appointed data analyst) to segment type of ads based on
the features provided. Use Clustering procedure to segment ads into
homogeneous groups.
Data Description:
* Timestamp : The Timestamp of the particular Advertisement.
* Platform : The platform in which the particular Advertisement is displayed. Web, Video or
App. This is a Categorical Variable.
* Device Type : The type of the device which supports the particular Advertisement. This is a
Categorical Variable.
6
* Format : The Format in which the Advertisement is displayed. This is a Categorical Variable*
Available_Impressions : How often the particular Advertisement is shown. An impression is
counted each time an Advertisement is shown on a search result page or other site on a
Network.
* Matched_Queries : Matched search queries data is pulled from Advertising Platform and
consists of the exact searches typed into the search Engine that generated clicks for the
particular Advertisement.
* impressions : The impression count of the particular Advertisement out of the total
available impressions.
* Clicks : It is a marketing metric that counts the number of times users have clicked on the
particular advertisement to reach an online property.
* Spend : It is the amount of money spent on specific ad variations within a specific campaign
or ad set. This metric helps regulate ad performance.
* Revenue: It is the income that has been earned from the particular advertisement.
* CPM : CPM stands for "cost per 1000 impressions." Formula used here is CPM = (Total
Campaign Spend / Number of Impressions) * 1,000. Note that the Total Campaign Spend
refers to the 'Spend' Column and the Number of Impressions refers to the 'Impressions'
Column.
* CPC : CPC stands for "Cost-per-click". Cost-per-click (CPC) bidding means that you pay for
each click on your ads. The Formula used here is CPC = Total Cost (spend) / Number of Clicks.
Note that the Total Cost (spend) refers to the 'Spend' Column and the Number of Clicks refers
to the 'Clicks' Column.
7
The following three features are commonly used in digital marketing:
CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that the
Total Campaign Spend refers to the 'Spend' Column in the dataset and the
Number of Impressions refers to the 'Impressions' Column in the dataset.
CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend)
refers to the 'Spend' Column in the dataset and the Number of Clicks refers to the
'Clicks' Column in the dataset.¶
CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note that
the Total Measured Clicks refers to the 'Clicks' Column in the dataset and the
Total Measured Ad Impressions refers to the 'Impressions' Column in the dataset.
Solution:
Import all the libraries. Load and Read the data set.
Data Overview
The initial steps to get an overview of any dataset is to:
* Get information about the number of rows and columns in the dataset
* Observe the first few and last few rows of the dataset, to check whether the
dataset has been loaded properly or not
* Find out the data types of the columns to ensure that data is stored in the
preferred format and the value of each property is as expected.
8
Part 1: 1.1 Checking the number of rows and columns present in the
given data frame.
Part 1: 1.2 Checking the first five rows in the given dataframe
9
Part 1: 1.3 Checking the last five rows in the given data frame
Part 1: 1.4 Checking the data types of the columns for the dataset
Having this information about the data types and attributes is crucial for
performing data analysis, visualization, and modeling tasks. It helps in
understanding the nature of the variables and selecting appropriate techniques
for analysis.
11
From the above statistical summary we can see that,
* The Average of Impressions, clicks, spend and Revenue in the given dataset is
1.241520e+06, 10678.518816, 2706.625689 and 1924.252331 respectively.
* The standard deviation of Impressions, clicks, spend and Revenue in the given
dataset is 2.429400e+06, 17353.409363, 4067.927273 and 3105.238410
respectively.
* The statistics provide insights into the typical and spread-out values for
Impressions, Clicks, Spend, and Revenue in the dataset.
3. Similarly, the averages and standard deviations for Clicks, Spend, and Revenue
provide similar insights into these variables.
These metrics are essential for understanding the distribution and variability of
the data, which can be valuable for decision-making and further analysis.
12
Part 1:1.6 Univariate Analysis
For performing Univariate analysis we will take a look at the Boxplots and
Histograms to get better understanding of the distributions.
Numerical Variables
Observations on Ad – Length
13
* Mean Value: The mean value of the "Ad - Length" attribute is 300.
These insights provide a better understanding of the distribution and central
tendency of the "Ad - Length" attribute. They can be useful for further analysis
and decision-making processes related to this attribute.
Observations on Ad- Width
14
Table 6: Value count of Ad-Size
* Frequent Occurrences: 72,000 is the most frequently occurring value in the "Ad
Size" attribute.
* Common Values: 216,000, 75,000, and 33,600 are common values observed in
the "Ad Size" attribute.
* Least Occurrence: 180,000 is the least occurring value in the "Ad Size" attribute.
Identifying outliers can be crucial for understanding the data distribution and
ensuring robust analysis.
15
Observations on Available_Impressions
16
* Frequent Occurrence: 7 is the most frequently occurring value in the "Available
Impressions" attribute.
* Least Occurrence: 114 is the least occurring value in the "Available Impressions"
attribute.
Identifying outliers can be important for understanding the data distribution and
ensuring accurate analysis.
Observations on Matched_Queries
17
Fig 5: Boxplot and Histogram of Matched Queries
* Least Occurrence: 197 is the least occurring value in the "Matched Queries"
attribute.
18
Observations on Impressions
19
* Frequent Occurrence: 2 is the most frequently occurring value in the
"Impressions" attribute.
* Least Occurrence: 143 is the least occurring value in the "Impressions" attribute.
Observations on Clicks
20
Fig 7: Boxplot and Histogram of Clicks
* Least Occurrence: 1201 is the least occurring value in the "Clicks" attribute.
21
Observations on Spend
22
* Frequent Occurrence: 0.00 is the most frequently occurring value in the "Spend"
attribute.
* Common Values: 0.04, 0.03, and 0.05 are common values observed in the
"Spend" attribute.
* Least Occurrence: 1.43 is the least occurring value in the "Spend" attribute.
Observations on Fee
23
Fig 9: Boxplot and Histogram of Fee
* Frequent Occurrence: 0.35 is the most frequently occurring value in the "Fee"
attribute.
* Common Values: 0.33, 0.30, and 0.27 are common values observed in the "Fee"
attribute.
* Least Occurrence: 0.21 is the least occurring value in the "Fee" attribute.
Observations on Revenue
24
Fig 10: Boxplot and Histogram of Revenue
* Common Values: 0.0260 and 0.0195 are common values observed in the
"Revenue" attribute.
* Least Occurrence: 0.9295 is the least occurring value in the "Revenue" attribute.
25
Observations on CTR
26
* Frequent Occurrence: 0.0024 is the most frequently occurring value in the "CTR"
attribute.
* Common Values: 0.0025, 0.0023, and 0.0026 are common values observed in
the "CTR" attribute.
* Least Occurrence: 0.1741 is the least occurring value in the "CTR" attribute.
Observations on CPM
27
Fig 12: Boxplot and Histogram of CPM
* Frequent Occurrence: 1.66 is the most frequently occurring value in the "CPM"
attribute.
* Common Values: 1.62, 1.69, and 1.64 are common values observed in the
"CPM" attribute.
* Least Occurrence: 15.95 is the least occurring value in the "CPM" attribute.
Observations on CPC
* Frequent Occurrence: 1.66 is the most frequently occurring value in the "CPC"
attribute.
* Common Values: 0.09, 0.10, and 0.08 are common values observed in the "CPC"
attribute.
* Least Occurrence: 1.96 is the least occurring value in the "CPC" attribute.
29
Categorical Variables
Observations on Timestamp
30
Observations on InventoryType
32
The low occurrence of Inter228 indicates that it might be less prevalent or less
frequently encountered compared to other item labels.
Observations on Platform
33
Observations on Device Type
*Least Occurrence: "Desktop" is the least occurring device type in the dataset.
"Mobile" appears to be the dominant device type, occurring with the highest
frequency in the dataset.
The low occurrence of "Desktop" indicates that it might be less prevalent or less
frequently used compared to mobile devices.
34
Observations on Format
The low occurrence of "Display" indicates that it might be less prevalent or less
frequently used compared to video advertisements.
35
Multivariate Analysis
36
To explore the relationship between all the numerical variables using a pairplot,
we have created a scatterplot matrix that visualizes the pairwise relationships
between the variables.
* Correlation: Look for patterns and trends in the scatterplots. Positive correlation
is indicated by points moving upwards to the right, negative correlation by points
moving downwards to the right, and no correlation by a scattered distribution.
* Outliers: Check for any outliers in the data. Outliers might appear as points that
are significantly distant from the main cluster of points in the scatterplots.
37
To explore the correlation between all the numerical variables, we can calculate
the correlation matrix and visualize it using a heatmap.
Strong Positive Correlation: We can identify pairs of variables with high positive
correlation coefficients (close to 1). This indicates that as one variable increases,
the other variable tends to increase as well.
38
Strong Negative Correlation: When we look for pairs of variables with high
negative correlation coefficients (close to -1). This suggests that as one variable
increases, the other variable tends to decrease.
Numeric to Categorical
39
Observations and Insights
Platform vs Clicks
40
To analyze the relationship between the platform and the number of clicks, we
can create a barplot to visualize the average number of clicks for each platform.
41
Device Type vs Clicks
42
Insights for Decision-making: These insights can inform decisions related to
device-specific advertising strategies, budget allocations, and resource allocation
in advertising campaigns, aiming to maximize clicks and user interaction across
different device types.
Ad Type vs Spend
43
* Cost-effectiveness Assessment: Analyzing the average spend by ad type can help
assess the cost-effectiveness of different advertising formats or channels,
informing decisions on budget allocations and resource optimization.
Insights for Strategy Development: These insights can guide strategic decisions
related to ad type selection, budget planning, and resource allocation in
advertising campaigns, aiming to optimize spending and achieve desired
outcomes.
44
3. Clicks and Spend:
* There might be a positive correlation between Clicks and Spend, indicating that
higher spending on advertising campaigns tends to result in more clicks.
* A higher CTR might lead to increased revenue, but other factors like conversion
rates and product pricing also influence overall revenue.
* Users might interact differently with ads based on the device they are using,
impacting click-through rates and overall campaign performance.
* Analyzing revenue by ad type can help identify the most profitable advertising
formats and optimize future campaigns accordingly.
45
7. Platform and Spend:
* Different advertising platforms (e.g., social media, search engines) may require
varying levels of spending to achieve desired campaign objectives.
===================================================================
We can see that there are no duplicate values in the given dataframe
46
Part 1: 2.1 Missing Value Check and Treatment
47
Missing Value Treatment:
The missing values were treated using the formulas given above as follows by
imputing values from the user defined functions:
CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that the
Total Campaign Spend refers to the 'Spend' Column in the dataset and the
Number of Impressions refers to the 'Impressions' Column in the dataset.
CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend)
refers to the 'Spend' Column in the dataset and the Number of Clicks refers to the
'Clicks' Column in the dataset.
CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note that
the Total Measured Clicks refers to the 'Clicks' Column in the dataset and the
Total Measured Ad Impressions refers to the 'Impressions' Column in the dataset.
After creating user-defined functions to impute the missing values for CTR, CPM,
CPC, we have to call the function for imputing missing values for CTR, CPC, CPC
Find the top five rows of the dataset after treating the missing values:
49
Fig 26: Boxplots with outliers
From the above output, we can see that all the Variables except Ad – Length and
Ad – Width have outliers as shown by the Box plots.
Even though we know that the k-means clustering technique is well-studied, it
may have some limitations for real time data, because the k-means objective
assumes that all the points can be partitioned into k distinct clusters, which is
often an unrealistic assumption in practice. Real time data has contamination or
noise, and the k-means method may be sensitive to it.
There are possible ways or approaches for reducing noise:
1. Treating outliers using IQR method.
2. Treating outliers using z-score method.
3. Using EDA results to segment data into two or more parts and then apply k-
means algorithm to each part separately. This method is applicable only if the size
of the data is large and each part of the dataset has reasonable number of data
points.
From these details, for this dataset we will treat outliers using IQR Method, and
compare results with model without outlier treatment.
Outlier Detection and Treatment using IQR method
50
In this method, any observation that is less than Q1 - 1.5 IQR or more than Q3 +
1.5 IQR is considered an outlier. To treat outliers, we defined a function
remove_outlier'
* The larger values (>upper whisker) are all equated to the 95th percentile value
of the distribution
* The smaller values (<lower whisker) are all equated to the 5th percentile value
of the distribution
After calculating the number of Outliers in each column at once using loop
function:
51
Fig 27: Boxplots without outliers
From the above output we can see that the outliers are removed. Removing
outliers from the dataset can have significant implications for subsequent
analyses and modeling. While outlier removal can be a useful preprocessing step
to improve the quality and reliability of analyses.
Here we drop the categorical variables and check the first five rows of the new
dataset after dropping the columns.
52
Table 30: scaled data (rows transposed as columns)
53
* Scaling of variables is important for clustering to stabilize the weights of the
different variables. If there is wide discrepancy in the range of variables, cluster
formation may be affected by weight differential.
* Also, without scaling the data, the algorithm maybe biased toward highest
value.
Importing dendrogram and linkage module and after choosing ward linkage
method, the code will generate a dendrogram using the Ward linkage method,
which is one of the commonly used methods for hierarchical clustering.
55
Alternatively, choosing branches delineated by horizontal lines could result in 3
clusters. For this dataset, we opt for 5 clusters using the dendrogram method.
From the above outputs we can clearly see that WSS reduces as K keeps
increasing.
The WSS (within-cluster sum of squares) values are stored in the list named "wss".
57
As we increase the number of clusters (k) from 1 to 2, there is a noticeable and
substantial decrease in the within-cluster sum of squares (WSS). Similarly, as we
progress from k=2 to k=3 and then from k=3 to k=4, there are significant drops in
WSS values. However, the reduction in WSS becomes less pronounced when
moving from k=4 to k=5 and from k=5 to k=6. In other words, beyond k=5, there is
no substantial decrease in WSS. Therefore, based on the Elbow plot analysis, we
conclude that 5 is the optimal number of clusters for this dataset.
The KMeans clustering algorithm is applied with 5 clusters on the scaled data
(scaled_df1), and the resulting cluster labels are stored in the variable "labels".
The cluster labels obtained from KMeans clustering are added as a new column
named "Clus_kmeans" to the scaled data (scaled_df1). The first five rows of the
updated dataframe are displayed transposed for better readability.
58
The silhouette score calculated for the clustering results obtained from KMeans
indicates the overall cohesion and separation of the clusters.
Both Hierarchical Clustering and KMeans Clustering were conducted. The Elbow
plot and Silhouette Score were utilized to identify the optimal number of clusters
in KMeans, while a dendrogram was drawn for Hierarchical Clustering. In the
Hierarchical method, 5 clusters were obtained, whereas in KMeans, 5 clusters
were identified using the elbow plot, and 6 clusters were found using the
silhouette score.
The silhouette width values calculated for each observation in the dataset based
on the clustering results obtained from KMeans are stored in the "sil_width"
column of the DataFrame.
59
Table 34: "sil_width" column of the DataFrame.
Appending Clusters to the original dataset
60
Part 1: 4.5 Cluster Profiling
The count of observations in each cluster obtained from KMeans clustering is
displayed, sorted by cluster index.
The profile of each cluster after dropping irrelevant columns and grouping by the
KMeans cluster labels is displayed. The mean values of each variable across
clusters are shown along with the frequency of each cluster.
61
Observations:
1.Clusters 1 and 4 encompass ads with a higher average length compared to the
remaining clusters.
2.Clusters 3 and 4 exhibit ads with a significantly greater average width than the
other clusters.
Using the hint provided in the rubric, we will plot the bar charts by grouping the
data by Cluster Labels and taking sum or mean of Clicks, Spend, Revenue, CTR,
CPC, & CPM based on Device Type.
Comparison of Clusters according to device type (x-axis) and total clicks (y-axis)
62
Observations:
The bar plot highlights that Cluster 3 exhibits the highest number of clicks across
both Desktop and Mobile device types. Additionally, Clusters 0 and 4 demonstrate
significant click counts for both device types.
Comparison of Clusters according to device type (x-axis) and total revenue (y-
axis)
Observations:
From the above barplot the mobile segment within cluster 3 have most revenue
generated and may be considered the best ads. Similarly, the desktop segment
cluster has highest revenue generated for Desktop Ads
63
Comparison of Clusters according to device type (x-axis) and total spend (y-
axis):
64
Comparison of Clusters according to device type (x-axis) and average CPC, CTR,
CPM
65
Observations:
CPMstands for "cost per 1000 impressions". In simple words, CPM refers to the
amount it costs to have an ad published a thousand times on a website and is
seen by users. For example, if a website publisher charges 4.00 Dollars CPM, that
means an advertiser must pay 4.00 Dollars for every 1,000 impressions of its ads.
CPCstands for Cost Per Click. It is a method that websites use to determine the
average times an advertiser has been clicked on the relevant ad. CPC is also a
widely used google ad words metric that advertisers incorporate to manage their
campaign budgets & performance. Let us say your CPC ads get 2 clicks, one
costing 0.40 Dollars and the other is 0.20 Dollars, this totals 0.60 Dollars. You'd
divide your 0.60 Dollars by 2 (your total number of clicks) to get an average CPC of
0.30 Dollars.
CTR or Click Through Rate is measuring the success of online ads by aggregating
the percentage of people that actually clicked on the ad to arrive at the
hyperlinked website. For example, if an ad has been clicked 200 times after
serving 50,000 times, by multiplying that result by 100 you get a click-through
rate of 0.4%
From the bar plots, it's evident that Clusters 2, 4, and 3 exhibit the highest
average CPM, implying that ads in these clusters are likely placed on expensive
and highly visited websites. Moreover, the average CTR is also highest in these
clusters. Interestingly, there doesn't seem to be a significant difference between
the mobile and desktop segments in terms of CPM and CTR.
Selling ads based on CPM imposes a revenue ceiling. To increase revenue, there's
a need to invest in expanding reach to create more ad opportunities or deliver
more ads to the same users. Conversely, selling ads based on CTR doesn't cap
revenue. Instead, it allows for increasing engagement without necessarily
increasing the number of impressions per person. Thus, with CPM, efforts are
directed towards reaching more people or risking user experience with more ads.
66
Part 1: Clustering: Actionable Insights and
Recommendations
Part 1: 5.2 Based on the clustering analysis and key insights, provide actionable
recommendations (atleast 3) to Ads24x7 on how to optimize their digital
marketing efforts, allocate budgets efficiently, and tailor ad content to specific
audience segments.
Recommendations:
1. Allocate budget strategically based on cluster performance: Ads24x7 should
prioritize allocating a larger portion of their budget to ads targeting clusters with
higher spending and revenue metrics, such as Cluster 3. By focusing resources on
these segments, they can maximize returns on investment and ensure efficient
use of marketing funds.
67
2. Tailor ad content to engage target audiences: Analyzing the characteristics of
high-performing clusters, such as Clusters 2 and 4 with high CTR, can provide
valuable insights into the preferences and behaviors of specific audience
segments. Ads24x7 should leverage these insights to tailor ad content, messaging,
and creative elements to better resonate with the identified audience
preferences, thereby increasing engagement and driving conversions.
3. Optimize ad placements and channels: Understanding which clusters
demonstrate the highest engagement rates, such as Clusters 2 and 4, can inform
decisions about ad placements and channels. Ads24x7 should consider focusing
their efforts on platforms or channels that are particularly effective in reaching
and engaging these target segments. Additionally, they can experiment with
different placement strategies to maximize exposure and engagement while
minimizing costs.
68
Problem Statement
PCA:
Part 2: PCA: Define the problem and perform Exploratory Data
Analysis
Problem Definition:
PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes -
2011 PCA for Female Headed Household Excluding Institutional Household. The
Indian Census has the reputation of being one of the best in the world. The first
Census in India was conducted in the year 1872. This was conducted at different
points of time in different parts of the country. In 1881 a Census was taken for
the entire country simultaneously. Since then, Census has been conducted every
ten years, without a break. Thus, the Census of India 2011 was the fifteenth in
this unbroken series since 1872, the seventh after independence and the second
census of the third millennium and twenty first century. The census has been
uninterruptedly continued despite of several adversities like wars, epidemics,
natural calamities, political unrest, etc. The Census of India is conducted under
the provisions of the Census Act 1948 and the Census Rules, 1990. The Primary
Census Abstract which is important publication of 2011 Census gives basic
information on Area, Total Number of Households, Total Population, Scheduled
Castes, Scheduled Tribes Population, Population in the age group 0-6, Literates,
Main Workers and Marginal Workers classified by the four broad industrial
categories, namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household
Industry Workers, and (iv) Other Workers and also Non-Workers. The
characteristics of the Total Population include Scheduled Castes, Scheduled
Tribes, Institutional and Houseless Population and are presented by sex and
rural-urban residence. Census 2011 covered 35 States/Union Territories, 640
districts, 5,924 sub-districts, 7,935 Towns and 6,40,867 Villages.
69
The data collected has so many variables thus making it difficult to find useful
details without using Data Science Techniques. You are tasked to perform
detailed EDA and identify Optimum Principal Components that explains the most
variance in data. Use Sklearn only.
Data_Description
* Name : Name
* No_HH : No of Household
70
* F_ST : Scheduled Tribes population Female
71
* MAIN_OT_M : Main Other Workers Population Male
72
* MARG_CL_3_6_F : Marginal Cultivator Population 3-6 Female
73
* MARG_OT_0_3_M : Marginal Other Workers Population 0-3 Male
Solution:
Data Overview
* get information about the number of rows and columns in the dataset
* observe the first few and last few rows of the dataset, to check whether the
dataset has been loaded properly or not
* find out the data types of the columns to ensure that data is stored in the
preferred format and the value of each property is as expected.
* check the statistical summary of the dataset to get an overview of the numerical
columns of the data
Check Shape
Checking the number of rows and columns present in the given dataframe
74
PCA dataset has 640 rows and 61 columns
75
Data types
Checking the data types of the columns for the dataset
76
Fig 1: Data types in the dataset
The dataset comprises 61 attributes, with 2 being of object type and the
remaining 59 being of integer type.
77
Statistical Summary of the data
78
Table 3: Statistical Summary
Here is the statistical summary of the population data:
* The average number of households across all states is approximately 51,222.87.
* The average total male population is around 79,940.58.
* The average total female population is approximately 122,372.08.
* The average total working male population is about 37,992.41.
* The average total working female population is around 41,295.76.
* The average non-working population of males is approximately 510.01.
* The average non-working population of females is about 704.78.
79
Checking the duplicate values
We can see that there are no duplicate values in the given dataframe
Based on the provided output, there are no null values detected in the dataset.
80
MAIN_HH_F, MAIN_OT_M, MAIN_OT_F 2. Example questions to answer from
EDA - (i) Which state has highest gender ratio and which has the lowest? (ii)
Which district has the highest & lowest gender ratio?
Solution:
As per given Rubric, we need to choose 5 variables out of given 24 for EDA. We'll
choose 'No_HH', 'TOT_M', 'TOT_F', 'M_06', 'F_06'
Univariate Analysis
The box plots reveal that all variables in the dataset exhibit right skewness and
contain outliers.
81
Bivariate Analysis
Fig 5: Relationship between No_HH and TOT_M Fig 6: Relationship between No_HH and TOT_F
Fig 7: Relationship between No_HH and M_06 Fig 8: Relationship between No_HH and F_06
From the Bivariate Analysis, the above scatterplots says that all variables are
Positively Co-related to each other.
82
1. Which state has the highest population?
Answer: From the above Visual analysis of No_HH using barplot we, can see that
'Kerala' is the state that has highest population.
83
Answer: From the above Visual analysis of TOT_F using barplot, we can see that
'Kerala' is the state that has highest Male population.
Answer: From the above Visual analysis of TOT_M using barplot, we can see that
'Arunachal Pradesh' is the state that has lowest total Female Polulation.
4. Which state has second highest male child within the age group of 0-6?
5. Which state has second lowest female child within the age group of 0-6?
Answer: From the above Visual analysis of F_06 using barplot, we can see that
'Andaman and Nicobar Islands' is the second lowest state of female child within
the age group of 0-6.
85
Part 2: Data Prepocessing
Based on the provided output, there are no null values detected in the dataset.
86
Scale the Data using the z-score method, Visualize the data before and after
scaling and comment on the impact on outliers
Let us visualize and check the distribution and outliers for each column in the
data
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Fig 16: Visualization, Distribution and Outliers of each variable
101
* By examining the histograms, we can observe the shape of the distribution for
each numerical variable. Skewness values provide quantitative measures of
asymmetry.
* Positive skewness indicates that the right tail of the distribution is longer, while
negative skewness indicates a longer left tail.
* In the box plots, outliers are visualized as individual points that fall outside the
whiskers, which represent the range of typical values.
* Outliers may require further investigation to determine if they are valid data
points or errors. Depending on the context, outliers may be retained, removed, or
treated using appropriate methods.
Outliers treatment is not necessary unless they are the result from a processing
mistake or wrong measurement. True outliers must be kept in the data.
Treating outliers is typically unnecessary unless they arise from processing errors
or incorrect measurements. Genuine outliers, representing extreme or unusual
values.
The column 'State' and 'Area Name' are of object data type containing a lot of
unique entries and would not add values to our analysis. We can drop these
column.
102
Importing the StandardScalar Module, Creating an object for the StandardScalar
Function, then Fitting and Transforming the data. Now the transformed data,
after applying feature scaling, is stored in the variable scaled_data. This shows the
below output.
103
104
Fig 18: Output of Scaled Data
Visualize after scaling the data:
Observations:
* From the above output, we can see that after scaling the data using the
StandardScaler, the distribution of the variables remains highly skewed to the
right. This indicates that the shape of the distributions is preserved even after
standardization. Additionally, all variables except State Code and Dist. Code still
exhibit outliers, primarily located at the right end of the distributions. These
outliers are not affected by the scaling process, as the purpose of scaling is to
standardize the data's scale, not to remove outliers.
105
Part 2; PCA: PCA
Perform all the required steps for PCA (use sklearn only)
When the dataset contains a large number of numerical variables, visualizing the
correlation matrix using a heatmap becomes challenging due to the heavy data.
106
Create covariance matrix
107
Fig 23: Comparing Correlation and Covariance Matrix
Finally we can say that after scaling - the covariance and the correlation have the
same values
109
Fig 24: eigen values and eigen vectors
110
Fig 25: Scaled Data
111
The output above represents the scaled dataset, where each row corresponds to
a specific observation, and each column represents a different feature or variable
after scaling. This transformation ensures that all variables have a mean of
approximately 0 and a standard deviation of approximately 1. By transposing the
data, we can easily examine the scaled values for each feature across different
observations. This scaled data is now suitable for further analysis.
* Iterated through the explained variance ratios to find the minimum number of
principal components required to explain at least 90% of the variance in the data.
* Printed the number of principal components that collectively explain more than
90% of the variance.
112
Show Scree Plot
Visually we can observe that there is steep drop in variance explained with
increase in number of PC's.
Using scikit learn PCA here. It does all the above steps and maps data to PCA
dimensions in one shot
Instantiate PCA with 6 components and set random state, Fit and transform the
scaled data and Display the transformed data as data_reduced.
113
Convert the reduced data array into a DataFrame
114
Fig 29: PCA Loadings
115
Percentage of variance explained by each principal component
For this project, we need to consider at least 90% explained variance, so cut off
for selecting the number of PCs is: '6'
116
Fig 30: Dataframe with coefficients of all PCs
117
Check as to how the original features matter to each PC
Note: Here we are only considering the absolute values
The output visualizes how the original features contribute to each Principal
Component (PC) by considering only the absolute values of the loadings. Each
subplot represents one original feature, and the bars indicate the absolute
loadings of that feature on the corresponding PC.
The height of the bars represents the magnitude of the influence of each feature
on the PC. A taller bar indicates a stronger influence, while a shorter bar suggests
a weaker influence. By examining these bar plots, we can understand which
original features contribute the most to each PC in terms of their absolute
loadings. This information helps us interpret the underlying structure captured by
each PC and identify the key features driving the variability in the data.
118
Compare how the original features influence various PCs
By comparing the loadings of the original features across different PCs, we can
gain insights into the underlying structure of the data and understand which
features are driving the variability captured by each PC.
119
Fig 33: Checking the highest and lowest values
120
Write inferences about all the PCs in terms of actual variables
To infer the meaning of principal components (PCs) in terms of the original
variables, we typically examine the loadings of each variable on each PC. Loadings
represent the correlation between the original variables and the principal
components. Here are the inferences about all the PCs based on their loadings:
PC1: This component captures variance primarily related to variables with high
loadings. In terms of actual variables, PC1 might represent overall spending
patterns or general engagement metrics.
PC2: Variables with high loadings on PC2 contribute the most to its variance. PC2
could represent specific advertising formats or platforms that attract distinct
audience segments.
PC4: Variables strongly correlated with PC4 contribute to its variance. PC4 might
represent factors related to ad targeting or campaign optimization strategies.
PC5: This component captures variance primarily from variables with high
loadings on PC5. It could represent seasonal trends or temporal patterns affecting
ad effectiveness.
PC6: Variables with high loadings on PC6 contribute the most to its variance. PC6
might represent ad placement strategies or audience engagement metrics.
These interpretations provide insights into the underlying factors driving variance
in the dataset and help understand the relationships between the original
variables and the principal components.
121
Write linear equation for first PC
The linear equation for the first principal component (PC1), we can use the
loadings of each original variable on PC1. The equation represents a weighted
sum of the original variables, where the weights are given by the loadings.
122