Data Mining Business Report Set
Data Mining Business Report Set
Data Mining Business Report Set
The ads24x7 is a Digital Marketing company which has now got seed funding of $10 Million. They are
expanding their wings in Marketing Analytics. They collected data from their Marketing Intelligence team
and now wants you (their newly appointed data analyst) to segment type of ads based on the features
provided. Use Clustering procedure to segment ads into homogeneous groups.
CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that the Total Campaign
Spend refers to the 'Spend' Column in the dataset and the Number of Impressions refers to the 'Impressions'
Column in the dataset.
CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend) refers to the 'Spend'
Column in the dataset and the Number of Clicks refers to the 'Clicks' Column in the dataset.
CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note that the Total Measured
Clicks refers to the 'Clicks' Column in the dataset and the Total Measured Ad Impressions refers to the
'Impressions' Column in the dataset.
The Data Dictionary and the detailed description of the formulas for CPM, CPC and CTR are given
in the sheet 2 of the Clustering Clean ads_data Excel File.
1.2 Assumption & Solutions:
1.2.1 Clustering: Read the data and perform basic analysis such as printing a few rows
(head and tail), info, data summary, null values duplicate values, etc.
Summarizing the Digital Ads Data upon Exploratory Data Analysis
Data Summary
Statistical Description
Duplicate Checks
1.2.2 Clustering: Treat missing values in CPC, CTR and CPM using the formula given.
Upon the null value analysis made on the dataset, we could see that the columns, CTR, CPM and CPC
contains around 4736 null values. Treating the missing values using the below formula given
1.2.3 Clustering: Check if there are any outliers. Do you think treating outliers is necessary
for K-Means clustering? Based on your judgement decide whether to treat outliers
and if yes, which method to employ. (As an analyst your judgement may be different
from another analyst).
An outlier is “an observation that deviates so much from other observations as to arouse suspicion
that it was generated by a different mechanism” (Hawkins 1980).
In the k-means based outlier detection technique the data are partitioned in to k groups by assigning
them to the closest cluster centres.
Once assigned we can compute the distance or dissimilarity between each object and its cluster
centre, and pick those with largest distances as outliers which will impact the cluster centroids.
We can employ box plot validation over the numerical data in the data set and identify the presence
of outliers.
Upon visualising the data using a box plot, we could see that most of the columns do have outliers
and we can use the Outlier treatment using the Inter Quartile Range(IQR)
IQR is used to measure variability by dividing a data set into quartiles. The data is sorted in ascending
order and split into 4 equal parts. Q1, Q2, Q3 called first, second and third quartiles are the values
which separate the 4 equal parts.
Post Treatment:
1.2.4 Clustering: Perform z-score scaling and discuss how it affects the speed of the
algorithm.
Standardization/normalization of data in statistics refers to the process of rescaling the values of the
variables in your data set so that all variables share a common scale.
There are two main scaling practices – Standard Scalar and Min Max scalar performed as a pre-
processing step, particularly for cluster analysis. This standardization is important if we are working
with dataset where each variable has a different unit (e.g., inches, meters, tons and kilograms), or
where the scales of each variable are different from one another (e.g., 0-1 vs 0-1000).
The important reason behind that in cluster analysis is that the cluster groups are formed based on
the distance between points in mathematical space.
Standardization (or Z-score normalization) is the process of rescaling the features so that they’ll have
the properties of a Gaussian distribution with
where μ is the mean and σ is the standard deviation from the mean; standard scores (also called z
scores) of the samples are calculated as follows:
z = (x – μ) / σ
Hierarchical clustering is a set of methods that recursively cluster two values at a time. The Euclidean
distance is usually the square distance between the two vectors. Dendrogram is a diagram that
shows the hierarchical relationship between objects.
The linkage function is used to specify the distance between two clusters is computed as the increase
in the “error sum of squares” (ESS) after fusing two clusters into a single cluster. Ward´s Method
seeks to choose the successive clustering steps so as to minimize the increase in ESS at each step.
Based on the Dendrogram we can assume an overall of 5 cluster would be optimum for this Dataset
1.2.6 Clustering: Make Elbow plot (up to n=10) and identify optimum number of clusters
for k-means algorithm.
From the elbow plot and Within Sum of Squares(WSS) value calculated, we can see the K = 5
followed by K = 4 mark the optimum value to be chosen as a cluster.
Upon the various measures, we have come to the conclusion that the optimum number is 5.
Summary of the original data set along with the cluster information:
- the Digital Ads dataset has a total of 23066 rows and 19 columns.
- No Duplicates
The null value treatment on CPC, CTR and CPM are treated by using reusable user-defined
functions via the provided solutions
The data once treated had to be checked for outliers, we had used the IQR treatment to
update outliers.
The Dataset contains various components such as the Spend, number of clicks, Impressions
which all have been scaled via different measurements, to perform a successful clustering of
data, we had to standardize/scale the data to a common measure
Dendrogram had helped to visualize and identify linkage for computing the distances and
merging the clusters from n to 1. The linkage function is used to specify the distance
between two clusters is computed as the increase in the “error sum of squares” (ESS) after
fusing two clusters into a single cluster. Ward´s Method seeks to choose the successive
clustering steps so as to minimize the increase in ESS at each step.
The elbow map plotted using Within sum of Squares(WSS) and Silhouette score for each of
the cluster, we can understand the below observations
- We can also see that the drop becomes less significant when we move from k = 6 to
k = 10.
- Hence we can conclude that there is a significant drop in the value stops we move
from k=4 to k=5, k=5 to k=6. In other words, the measure to determine the number
of clusters is not significantly dropping beyond 5,
- In conclusion, k=5 is optimal number of clusters that can be grouped in the Digital
Ads Data set