Data Mining Business Report Set

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 12

1 Part 1: Clustering

1.1 Problem Statement:


Digital Ads Data:

The ads24x7 is a Digital Marketing company which has now got seed funding of $10 Million. They are
expanding their wings in Marketing Analytics. They collected data from their Marketing Intelligence team
and now wants you (their newly appointed data analyst) to segment type of ads based on the features
provided. Use Clustering procedure to segment ads into homogeneous groups.

The following three features are commonly used in digital marketing:

CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that the Total Campaign
Spend refers to the 'Spend' Column in the dataset and the Number of Impressions refers to the 'Impressions'
Column in the dataset.

CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend) refers to the 'Spend'
Column in the dataset and the Number of Clicks refers to the 'Clicks' Column in the dataset.

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note that the Total Measured
Clicks refers to the 'Clicks' Column in the dataset and the Total Measured Ad Impressions refers to the
'Impressions' Column in the dataset.

The Data Dictionary and the detailed description of the formulas for CPM, CPC and CTR are given
in the sheet 2 of the Clustering Clean ads_data Excel File.
1.2 Assumption & Solutions:
1.2.1 Clustering: Read the data and perform basic analysis such as printing a few rows
(head and tail), info, data summary, null values duplicate values, etc.
Summarizing the Digital Ads Data upon Exploratory Data Analysis

 Shape of the Data set

 A Glipmpse of the first 5 rows of the data

 Data Summary
 Statistical Description

 Duplicate Checks

 Null Value checks

1.2.2 Clustering: Treat missing values in CPC, CTR and CPM using the formula given.

Upon the null value analysis made on the dataset, we could see that the columns, CTR, CPM and CPC
contains around 4736 null values. Treating the missing values using the below formula given

CPM = (Total Campaign Spend / Number of Impressions) * 1,000

CPC = Total Cost (spend) / Number of Clicks

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100

Updated Dataset statistical description,


Rechecking the null count over the dataset

1.2.3 Clustering: Check if there are any outliers. Do you think treating outliers is necessary
for K-Means clustering? Based on your judgement decide whether to treat outliers
and if yes, which method to employ. (As an analyst your judgement may be different
from another analyst).

An outlier is “an observation that deviates so much from other observations as to arouse suspicion
that it was generated by a different mechanism” (Hawkins 1980).

In the k-means based outlier detection technique the data are partitioned in to k groups by assigning
them to the closest cluster centres.

Once assigned we can compute the distance or dissimilarity between each object and its cluster
centre, and pick those with largest distances as outliers which will impact the cluster centroids.

We can employ box plot validation over the numerical data in the data set and identify the presence
of outliers.
Upon visualising the data using a box plot, we could see that most of the columns do have outliers
and we can use the Outlier treatment using the Inter Quartile Range(IQR)

IQR is used to measure variability by dividing a data set into quartiles. The data is sorted in ascending
order and split into 4 equal parts. Q1, Q2, Q3 called first, second and third quartiles are the values
which separate the 4 equal parts.

- Q1 represents the 25th percentile of the data.

- Q2 represents the 50th percentile of the data.

- Q3 represents the 75th percentile of the data.

Post Treatment:
1.2.4 Clustering: Perform z-score scaling and discuss how it affects the speed of the
algorithm.
Standardization/normalization of data in statistics refers to the process of rescaling the values of the
variables in your data set so that all variables share a common scale.

There are two main scaling practices – Standard Scalar and Min Max scalar performed as a pre-
processing step, particularly for cluster analysis. This standardization is important if we are working
with dataset where each variable has a different unit (e.g., inches, meters, tons and kilograms), or
where the scales of each variable are different from one another (e.g., 0-1 vs 0-1000).

The important reason behind that in cluster analysis is that the cluster groups are formed based on
the distance between points in mathematical space.

Descriptive Summary before Scaling the Dataset:


Post Applying Z-scalar/ Standard Scalar standardization:

Standardization (or Z-score normalization) is the process of rescaling the features so that they’ll have
the properties of a Gaussian distribution with

μ=0 and σ=1

where μ is the mean and σ is the standard deviation from the mean; standard scores (also called z
scores) of the samples are calculated as follows:

z = (x – μ) / σ

1.2.5 Clustering: Perform Hierarchical by constructing a Dendrogram using WARD and


Euclidean distance, and identify optimum number of clusters

Hierarchical clustering is a set of methods that recursively cluster two values at a time. The Euclidean
distance is usually the square distance between the two vectors. Dendrogram is a diagram that
shows the hierarchical relationship between objects.

The linkage function is used to specify the distance between two clusters is computed as the increase
in the “error sum of squares” (ESS) after fusing two clusters into a single cluster. Ward´s Method
seeks to choose the successive clustering steps so as to minimize the increase in ESS at each step.
Based on the Dendrogram we can assume an overall of 5 cluster would be optimum for this Dataset

1.2.6 Clustering: Make Elbow plot (up to n=10) and identify optimum number of clusters
for k-means algorithm.
From the elbow plot and Within Sum of Squares(WSS) value calculated, we can see the K = 5
followed by K = 4 mark the optimum value to be chosen as a cluster.

The WSS value for 1 cluster is 230659.99999999988


The WSS value for 2 cluster is 128187.65114852798
The WSS value for 3 cluster is 96112.60577832462
The WSS value for 4 cluster is 68272.80552058032
The WSS value for 5 cluster is 41872.738370121864
The WSS value for 6 cluster is 32963.18634015866
The WSS value for 7 cluster is 27158.691372710695
The WSS value for 8 cluster is 22917.363538088815
The WSS value for 9 cluster is 20244.288511101742
The WSS value for 10 cluster is 17342.993623454804
1.2.7 Clustering: Print silhouette scores for up to 10 clusters and identify optimum number
of clusters.
The elbow plot and Silhouette score value obtained; we can see the K = 5 followed by K = 4 mark the
optimum value to be chosen as a cluster.

For n_clusters=2, the silhouette score is 0.48432285544519216


For n_clusters=3, the silhouette score is 0.41459708788922367
For n_clusters=4, the silhouette score is 0.5108680429602303
For n_clusters=5, the silhouette score is 0.5726299596163354
For n_clusters=6, the silhouette score is 0.5812032600279103
For n_clusters=7, the silhouette score is 0.5856500853286127
For n_clusters=8, the silhouette score is 0.5873432561020318
For n_clusters=9, the silhouette score is 0.590765265059381
For n_clusters=10, the silhouette score is 0.5990651108401313
1.2.8 Clustering: Profile the ads based on optimum number of clusters using silhouette
score and your domain understanding [Hint: Group the data by clusters and take sum
or mean to identify trends in Clicks, spend, revenue, CPM, CTR, & CPC based on
Device Type. Make bar plots].

Upon the various measures, we have come to the conclusion that the optimum number is 5.

Summary of the original data set along with the cluster information:

Summary of the cluster based analysis:

Bar plots to analyse each of the components across Device Type:


1.2.9 Clustering: Conclude the project by providing summary of your learnings.

 Upon performing the EDA on the provided dataset,

- the Digital Ads dataset has a total of 23066 rows and 19 columns.

- With 19 columns classified as float64(6), int64(7), object(6)

- No Duplicates

- Null values on columns CPC, CTR and CPM

 The null value treatment on CPC, CTR and CPM are treated by using reusable user-defined
functions via the provided solutions

CPM = (Total Campaign Spend / Number of Impressions) * 1,000


CPC = Total Cost (spend) / Number of Clicks

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100

 The data once treated had to be checked for outliers, we had used the IQR treatment to
update outliers.

 The Dataset contains various components such as the Spend, number of clicks, Impressions
which all have been scaled via different measurements, to perform a successful clustering of
data, we had to standardize/scale the data to a common measure

 Dendrogram had helped to visualize and identify linkage for computing the distances and
merging the clusters from n to 1. The linkage function is used to specify the distance
between two clusters is computed as the increase in the “error sum of squares” (ESS) after
fusing two clusters into a single cluster. Ward´s Method seeks to choose the successive
clustering steps so as to minimize the increase in ESS at each step.

 The elbow map plotted using Within sum of Squares(WSS) and Silhouette score for each of
the cluster, we can understand the below observations

- We can see a significant drop when we navigate from k=1 to k=2

- We can also see that the drop becomes less significant when we move from k = 6 to
k = 10.

- Hence we can conclude that there is a significant drop in the value stops we move
from k=4 to k=5, k=5 to k=6. In other words, the measure to determine the number
of clusters is not significantly dropping beyond 5,

- In conclusion, k=5 is optimal number of clusters that can be grouped in the Digital
Ads Data set

You might also like