0% found this document useful (0 votes)

24 views12 pages

Data Mining Business Report Set

Uploaded by

priyada16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views12 pages

Data Mining Business Report Set

Uploaded by

priyada16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 12

1 Part 1: Clustering

1.1 Problem Statement:

Digital Ads Data:

The ads24x7 is a Digital Marketing company which has now got seed funding of $10 Million. They are
expanding their wings in Marketing Analytics. They collected data from their Marketing Intelligence team
and now wants you (their newly appointed data analyst) to segment type of ads based on the features
provided. Use Clustering procedure to segment ads into homogeneous groups.

The following three features are commonly used in digital marketing:

CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that the Total Campaign
Spend refers to the 'Spend' Column in the dataset and the Number of Impressions refers to the 'Impressions'
Column in the dataset.

CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend) refers to the 'Spend'
Column in the dataset and the Number of Clicks refers to the 'Clicks' Column in the dataset.

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note that the Total Measured
Clicks refers to the 'Clicks' Column in the dataset and the Total Measured Ad Impressions refers to the
'Impressions' Column in the dataset.

The Data Dictionary and the detailed description of the formulas for CPM, CPC and CTR are given
in the sheet 2 of the Clustering Clean ads_data Excel File.
1.2 Assumption & Solutions:
1.2.1 Clustering: Read the data and perform basic analysis such as printing a few rows
(head and tail), info, data summary, null values duplicate values, etc.
Summarizing the Digital Ads Data upon Exploratory Data Analysis

 Shape of the Data set

 A Glipmpse of the first 5 rows of the data

 Data Summary
 Statistical Description

 Duplicate Checks

 Null Value checks

1.2.2 Clustering: Treat missing values in CPC, CTR and CPM using the formula given.

Upon the null value analysis made on the dataset, we could see that the columns, CTR, CPM and CPC
contains around 4736 null values. Treating the missing values using the below formula given

CPM = (Total Campaign Spend / Number of Impressions) * 1,000

CPC = Total Cost (spend) / Number of Clicks

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100

Updated Dataset statistical description,

Rechecking the null count over the dataset

1.2.3 Clustering: Check if there are any outliers. Do you think treating outliers is necessary
for K-Means clustering? Based on your judgement decide whether to treat outliers
and if yes, which method to employ. (As an analyst your judgement may be different
from another analyst).

An outlier is “an observation that deviates so much from other observations as to arouse suspicion
that it was generated by a different mechanism” (Hawkins 1980).

In the k-means based outlier detection technique the data are partitioned in to k groups by assigning
them to the closest cluster centres.

Once assigned we can compute the distance or dissimilarity between each object and its cluster
centre, and pick those with largest distances as outliers which will impact the cluster centroids.

We can employ box plot validation over the numerical data in the data set and identify the presence
of outliers.
Upon visualising the data using a box plot, we could see that most of the columns do have outliers
and we can use the Outlier treatment using the Inter Quartile Range(IQR)

IQR is used to measure variability by dividing a data set into quartiles. The data is sorted in ascending
order and split into 4 equal parts. Q1, Q2, Q3 called first, second and third quartiles are the values
which separate the 4 equal parts.

- Q1 represents the 25th percentile of the data.

- Q2 represents the 50th percentile of the data.

- Q3 represents the 75th percentile of the data.

Post Treatment:
1.2.4 Clustering: Perform z-score scaling and discuss how it affects the speed of the
algorithm.
Standardization/normalization of data in statistics refers to the process of rescaling the values of the
variables in your data set so that all variables share a common scale.

There are two main scaling practices – Standard Scalar and Min Max scalar performed as a pre-
processing step, particularly for cluster analysis. This standardization is important if we are working
with dataset where each variable has a different unit (e.g., inches, meters, tons and kilograms), or
where the scales of each variable are different from one another (e.g., 0-1 vs 0-1000).

The important reason behind that in cluster analysis is that the cluster groups are formed based on
the distance between points in mathematical space.

Descriptive Summary before Scaling the Dataset:

Post Applying Z-scalar/ Standard Scalar standardization:

Standardization (or Z-score normalization) is the process of rescaling the features so that they’ll have
the properties of a Gaussian distribution with

μ=0 and σ=1

where μ is the mean and σ is the standard deviation from the mean; standard scores (also called z
scores) of the samples are calculated as follows:

z = (x – μ) / σ

1.2.5 Clustering: Perform Hierarchical by constructing a Dendrogram using WARD and

Euclidean distance, and identify optimum number of clusters

Hierarchical clustering is a set of methods that recursively cluster two values at a time. The Euclidean
distance is usually the square distance between the two vectors. Dendrogram is a diagram that
shows the hierarchical relationship between objects.

The linkage function is used to specify the distance between two clusters is computed as the increase
in the “error sum of squares” (ESS) after fusing two clusters into a single cluster. Ward´s Method
seeks to choose the successive clustering steps so as to minimize the increase in ESS at each step.
Based on the Dendrogram we can assume an overall of 5 cluster would be optimum for this Dataset

1.2.6 Clustering: Make Elbow plot (up to n=10) and identify optimum number of clusters
for k-means algorithm.
From the elbow plot and Within Sum of Squares(WSS) value calculated, we can see the K = 5
followed by K = 4 mark the optimum value to be chosen as a cluster.

The WSS value for 1 cluster is 230659.99999999988

The WSS value for 2 cluster is 128187.65114852798
The WSS value for 3 cluster is 96112.60577832462
The WSS value for 4 cluster is 68272.80552058032
The WSS value for 5 cluster is 41872.738370121864
The WSS value for 6 cluster is 32963.18634015866
The WSS value for 7 cluster is 27158.691372710695
The WSS value for 8 cluster is 22917.363538088815
The WSS value for 9 cluster is 20244.288511101742
The WSS value for 10 cluster is 17342.993623454804
1.2.7 Clustering: Print silhouette scores for up to 10 clusters and identify optimum number
of clusters.
The elbow plot and Silhouette score value obtained; we can see the K = 5 followed by K = 4 mark the
optimum value to be chosen as a cluster.

For n_clusters=2, the silhouette score is 0.48432285544519216

For n_clusters=3, the silhouette score is 0.41459708788922367
For n_clusters=4, the silhouette score is 0.5108680429602303
For n_clusters=5, the silhouette score is 0.5726299596163354
For n_clusters=6, the silhouette score is 0.5812032600279103
For n_clusters=7, the silhouette score is 0.5856500853286127
For n_clusters=8, the silhouette score is 0.5873432561020318
For n_clusters=9, the silhouette score is 0.590765265059381
For n_clusters=10, the silhouette score is 0.5990651108401313
1.2.8 Clustering: Profile the ads based on optimum number of clusters using silhouette
score and your domain understanding [Hint: Group the data by clusters and take sum
or mean to identify trends in Clicks, spend, revenue, CPM, CTR, & CPC based on
Device Type. Make bar plots].

Upon the various measures, we have come to the conclusion that the optimum number is 5.

Summary of the original data set along with the cluster information:

Summary of the cluster based analysis:

Bar plots to analyse each of the components across Device Type:

1.2.9 Clustering: Conclude the project by providing summary of your learnings.

 Upon performing the EDA on the provided dataset,

- the Digital Ads dataset has a total of 23066 rows and 19 columns.

- With 19 columns classified as float64(6), int64(7), object(6)

- No Duplicates

- Null values on columns CPC, CTR and CPM

 The null value treatment on CPC, CTR and CPM are treated by using reusable user-defined
functions via the provided solutions

CPM = (Total Campaign Spend / Number of Impressions) * 1,000

CPC = Total Cost (spend) / Number of Clicks

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100

 The data once treated had to be checked for outliers, we had used the IQR treatment to
update outliers.

 The Dataset contains various components such as the Spend, number of clicks, Impressions
which all have been scaled via different measurements, to perform a successful clustering of
data, we had to standardize/scale the data to a common measure

 Dendrogram had helped to visualize and identify linkage for computing the distances and
merging the clusters from n to 1. The linkage function is used to specify the distance
between two clusters is computed as the increase in the “error sum of squares” (ESS) after
fusing two clusters into a single cluster. Ward´s Method seeks to choose the successive
clustering steps so as to minimize the increase in ESS at each step.

 The elbow map plotted using Within sum of Squares(WSS) and Silhouette score for each of
the cluster, we can understand the below observations

- We can see a significant drop when we navigate from k=1 to k=2

- We can also see that the drop becomes less significant when we move from k = 6 to
k = 10.

- Hence we can conclude that there is a significant drop in the value stops we move
from k=4 to k=5, k=5 to k=6. In other words, the measure to determine the number
of clusters is not significantly dropping beyond 5,

- In conclusion, k=5 is optimal number of clusters that can be grouped in the Digital
Ads Data set

AF-DBSCAN Presentation
No ratings yet
AF-DBSCAN Presentation
30 pages
Final Documentation
No ratings yet
Final Documentation
68 pages
Sukanya 3rd December 2023 Machine Learning1 Coded
No ratings yet
Sukanya 3rd December 2023 Machine Learning1 Coded
58 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
RAJIV RANJAN 22 Jan 2023
No ratings yet
RAJIV RANJAN 22 Jan 2023
66 pages
Data Minning Project
No ratings yet
Data Minning Project
31 pages
ML-1 Project
No ratings yet
ML-1 Project
30 pages
Machine Learning-1 BUSINESS REPORT
No ratings yet
Machine Learning-1 BUSINESS REPORT
122 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
Data Mining Project DSBA Clustering Report Final
No ratings yet
Data Mining Project DSBA Clustering Report Final
26 pages
Data Mining Project DSBA Clustering Report Final
No ratings yet
Data Mining Project DSBA Clustering Report Final
26 pages
Data Mining Project - Parijat
No ratings yet
Data Mining Project - Parijat
28 pages
Lec 3
No ratings yet
Lec 3
28 pages
تنقيب بيانات 7 بعد التعديل Maj
No ratings yet
تنقيب بيانات 7 بعد التعديل Maj
35 pages
ML 1
No ratings yet
ML 1
27 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
26 pages
Business Report
No ratings yet
Business Report
20 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Data Mining Business Report 2
No ratings yet
Data Mining Business Report 2
18 pages
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
No ratings yet
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
31 pages
Dmbi Iat-2 Imp Ques Soln
No ratings yet
Dmbi Iat-2 Imp Ques Soln
43 pages
Data Mining Assignment-Clustering Data-Ads 24x7 Summary
No ratings yet
Data Mining Assignment-Clustering Data-Ads 24x7 Summary
12 pages
Data Mining Project - Brahma Chari
No ratings yet
Data Mining Project - Brahma Chari
23 pages
Data Mining Project - Abinaya John
No ratings yet
Data Mining Project - Abinaya John
42 pages
Machine Learning-1 Project
No ratings yet
Machine Learning-1 Project
47 pages
DATA MINING Project Report
No ratings yet
DATA MINING Project Report
28 pages
Practical Research 2 Most Essential Learning Competencies
100% (10)
Practical Research 2 Most Essential Learning Competencies
2 pages
Monika Sree 08-06-2024
No ratings yet
Monika Sree 08-06-2024
36 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Peer Eval
No ratings yet
Peer Eval
6 pages
Data Mining Project Ashwani 3 PDF
100% (1)
Data Mining Project Ashwani 3 PDF
20 pages
Module 3
No ratings yet
Module 3
6 pages
10 Marks Questions
No ratings yet
10 Marks Questions
19 pages
Great Learning DATA MINING PROJECT
No ratings yet
Great Learning DATA MINING PROJECT
15 pages
DW&M Unit 3 Part II
No ratings yet
DW&M Unit 3 Part II
50 pages
Data Mining Project DSBA Clustering Report Final
100% (4)
Data Mining Project DSBA Clustering Report Final
26 pages
Customer Categorization by Data Analysis Using Clustering Algorithms of Machine Learning
No ratings yet
Customer Categorization by Data Analysis Using Clustering Algorithms of Machine Learning
4 pages
LP I Assignment A4 Clustering
No ratings yet
LP I Assignment A4 Clustering
13 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
SPSS Annotated Output K Means Cluster Anal
No ratings yet
SPSS Annotated Output K Means Cluster Anal
10 pages
Untitled
No ratings yet
Untitled
3 pages
Sample Informed Consent: (See IRB Policies and Procedures Manual For The Basic Elements of Informed Consent)
100% (3)
Sample Informed Consent: (See IRB Policies and Procedures Manual For The Basic Elements of Informed Consent)
1 page
Data Mining
No ratings yet
Data Mining
24 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Clustering Project
100% (1)
Clustering Project
44 pages
Arnab Chowdhury DM
75% (4)
Arnab Chowdhury DM
14 pages
Data Mining - Project
100% (2)
Data Mining - Project
25 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
Data Mining Graded Assignment: Problem 1: Clustering Analysis
100% (3)
Data Mining Graded Assignment: Problem 1: Clustering Analysis
39 pages
Data Mining - Business Report: Clustering Clean - Ads
100% (4)
Data Mining - Business Report: Clustering Clean - Ads
24 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
VARUNSAINI - 11 Dec 2022
No ratings yet
VARUNSAINI - 11 Dec 2022
16 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
CH 9 Solutions Manual PDF
No ratings yet
CH 9 Solutions Manual PDF
67 pages
HAZOP Study of Methanol, Ethanol, and Water Distillation Column
100% (1)
HAZOP Study of Methanol, Ethanol, and Water Distillation Column
7 pages
Pranjal - Singh - 25.12.2022 - Data Mining Project
No ratings yet
Pranjal - Singh - 25.12.2022 - Data Mining Project
8 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
Effectiveness of - Performance Management System and Its Effect On Employee Performance
100% (1)
Effectiveness of - Performance Management System and Its Effect On Employee Performance
87 pages
Market Research For Microfinance Participant S Manual 1
No ratings yet
Market Research For Microfinance Participant S Manual 1
41 pages
Chapter13 Slides
No ratings yet
Chapter13 Slides
24 pages
2021 - Session 1 - Degree - HRM549 - 546PDF - 240402 - 102444
No ratings yet
2021 - Session 1 - Degree - HRM549 - 546PDF - 240402 - 102444
2 pages
7030SSL - Coursework 2 Brief - Individual Written Report
No ratings yet
7030SSL - Coursework 2 Brief - Individual Written Report
11 pages
Composite Morningness Questionnaire
No ratings yet
Composite Morningness Questionnaire
4 pages
Lecture 4
No ratings yet
Lecture 4
19 pages
Measure Your Happiness and Discover Your Strengths
No ratings yet
Measure Your Happiness and Discover Your Strengths
4 pages
Mixed Methods With NVivo Webinar Handout
No ratings yet
Mixed Methods With NVivo Webinar Handout
54 pages
The Influence of The DinEX Service Quality Dimensions On Casual-Dining Restaurant Customers Satisfaction and Behavioral Intentions
No ratings yet
The Influence of The DinEX Service Quality Dimensions On Casual-Dining Restaurant Customers Satisfaction and Behavioral Intentions
16 pages
2938 11979 1 PB
No ratings yet
2938 11979 1 PB
17 pages
"Job Satisfaction Among Employees Automobile": Research Report ON
No ratings yet
"Job Satisfaction Among Employees Automobile": Research Report ON
87 pages
Kraft Papay - Prof Env Teacher Development Eepa Full
No ratings yet
Kraft Papay - Prof Env Teacher Development Eepa Full
49 pages
IFM Sports Marketing Surveys - Our Business
No ratings yet
IFM Sports Marketing Surveys - Our Business
8 pages
Harold A. Abella - CBC - Ed 5 and Ed 6 - Reviewer
No ratings yet
Harold A. Abella - CBC - Ed 5 and Ed 6 - Reviewer
6 pages
1978-06 Successful Applications of NDT Techniques To Mining Equipment Maintenance - Sutcliffe, Cottier (#MNG)
100% (1)
1978-06 Successful Applications of NDT Techniques To Mining Equipment Maintenance - Sutcliffe, Cottier (#MNG)
3 pages
Merger and Acquisition Cultural Risk and Integration
100% (1)
Merger and Acquisition Cultural Risk and Integration
4 pages
Socio Demographic
No ratings yet
Socio Demographic
3 pages
3 Idiots
No ratings yet
3 Idiots
14 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
14 pages
Field Interviewer: Location: Home Based, Inner London
No ratings yet
Field Interviewer: Location: Home Based, Inner London
12 pages
Poster Template
No ratings yet
Poster Template
1 page
Recruiter 11 Recruitment-Plan-template
No ratings yet
Recruiter 11 Recruitment-Plan-template
5 pages
Chloride Diffusion Coefficient Calculation
No ratings yet
Chloride Diffusion Coefficient Calculation
1 page
Abusive Supervision and Organizational Citizenship Behaviors
No ratings yet
Abusive Supervision and Organizational Citizenship Behaviors
10 pages
Saint Louis University School of Advanced Studies Baguio City MGMT 201/adm201/ MEM204 Course Guide
No ratings yet
Saint Louis University School of Advanced Studies Baguio City MGMT 201/adm201/ MEM204 Course Guide
4 pages
Meta Analysis Monitor & Control Project Cost Management
No ratings yet
Meta Analysis Monitor & Control Project Cost Management
1 page

Data Mining Business Report Set

Uploaded by

Data Mining Business Report Set

Uploaded by

1 Part 1: Clustering

1.1 Problem Statement:

The following three features are commonly used in digital marketing:

 Shape of the Data set

 A Glipmpse of the first 5 rows of the data

 Null Value checks

CPM = (Total Campaign Spend / Number of Impressions) * 1,000

CPC = Total Cost (spend) / Number of Clicks

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100

Updated Dataset statistical description,

- Q1 represents the 25th percentile of the data.

- Q2 represents the 50th percentile of the data.

- Q3 represents the 75th percentile of the data.

Descriptive Summary before Scaling the Dataset:

μ=0 and σ=1

1.2.5 Clustering: Perform Hierarchical by constructing a Dendrogram using WARD and

The WSS value for 1 cluster is 230659.99999999988

For n_clusters=2, the silhouette score is 0.48432285544519216

Summary of the cluster based analysis:

Bar plots to analyse each of the components across Device Type:

 Upon performing the EDA on the provided dataset,

- With 19 columns classified as float64(6), int64(7), object(6)

- Null values on columns CPC, CTR and CPM

CPM = (Total Campaign Spend / Number of Impressions) * 1,000

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100

- We can see a significant drop when we navigate from k=1 to k=2

You might also like