Analytics Group Assignment
Analytics Group Assignment
Assignment
Customer Personality Analysis through Marketing Campaign dataset
Customer Personality Analysis is about the detailed analysis of a company’s ideal customers.
The business wants to better understand its customers and make it easier for them to modify
products according to the specific needs, behaviors and concerns of different types of
customers.
Customer Personality Analysis will help the business to predict the total spending of a
customer based on various factors. The business wants to understand the association
between marketing campaigns and various attributes of customers. It wants to understand
how to make successful marketing campaigns for its customers. It wants to know the
personality of customers like the number of children they have and their buying habits based
on that.
The business wants to understand the spending patterns of customers based on various
factors like Education, Marital status, Number of children, Income, and so on. They want to
find out the correlation between the number of visits to the store by a customer and factors
like income, number of purchases, etc.
● People
○ ID: Customer's unique identifier
○ Year_Birth: (Categorical) Customer's birth year
○ Education: (Categorical) Customer's education level
○ Marital_Status: (Categorical) Customer's marital status
○ Income: (Scale) Customer's yearly household income
○ Kidhome: (Scale) Number of children in customer's household
○ Teenhome: (Scale) Number of teenagers in customer's household
○ Dt_Customer: (Categorical) Date of customer's enrollment with the company
○ Recency: (Scale) Number of days since customer's last purchase
○ Complain: (Categorical) 1 if the customer complained in the last 2 years, 0
otherwise
● Products
○ MntWines: (Scale) Amount spent on wine in last 2 years
○ MntFruits: (Scale) Amount spent on fruits in last 2 years
1
○ MntMeatProducts: (Scale) Amount spent on meat in last 2 years
○ MntFishProducts: (Scale) Amount spent on fish in last 2 years
○ MntSweetProducts: (Scale) Amount spent on sweets in last 2 years
○ MntGoldProds: (Scale) Amount spent on gold in last 2 years
● Promotion
○ NumDealsPurchases: (Scale) Number of purchases made with a discount
○ AcceptedCmp1: (Categorical) 1 if customer accepted the offer in the 1st
campaign, 0 otherwise
○ AcceptedCmp2: (Categorical) 1 if customer accepted the offer in the 2nd
campaign, 0 otherwise
○ AcceptedCmp3: (Categorical) 1 if customer accepted the offer in the 3rd
campaign, 0 otherwise
○ AcceptedCmp4: (Categorical) 1 if customer accepted the offer in the 4th
campaign, 0 otherwise
○ AcceptedCmp5: (Categorical) 1 if customer accepted the offer in the 5th
campaign, 0 otherwise
○ Response: (Categorical) 1 if customer accepted the offer in the last campaign, 0
otherwise
● Place
○ NumWebPurchases: (Scale) Number of purchases made through the company’s
website
○ NumCatalogPurchases: (Scale) Number of purchases made using a catalogue
○ NumStorePurchases: (Scale) Number of purchases made directly in stores
○ NumWebVisitsMonth: (Scale) Number of visits to company’s website in the last
month
2
Description of dataset
● Number of attributes/columns = 29
● Number of instances/rows = 2240
● There is no missing data in all column/attribute except Income column
● There are 26 attributes with integer datatype & 3 attributes with factor datatype
● Education - 5 unique values. Will need to reduce/replace them with something more
meaningful. Half of the people are graduated.
3
● Marital_status - 8 unique values. Will need to reduce/replace them with something
more meaningful. We can say that approximately 40% are married and 60% are
single.
● Income - Maximum & Minimum values are very high than the mean which means
there are outliers in income. The standard dev is also very high which means that the
data is very highly dispersed.
● Kidhome, Teenhome - Maximum value is 2
● MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts,
MntGoldProds - Slight difference between Q3 & maximum value. It means that there
may be outliers in the data.
● NumDealsPurchases, NumWebPurchases, NumCatalogPurchases,
NumStorePurchases, NumWebVisitsMonth - Slight difference exist between Q3 & the
maximum value. It means that there may be outliers in the data.
● More statistical inferences can be added here
According to the below screenshot, we can see that only Income has missing data & there
are 24 missing values.
As there are only 24 missing values under the Income column out of a total of 2240
records, we will delete those rows.
Following are the attributes in which outliers are present - Income, Age & Total_Spending
4
Feature Generation
Age - Generating a new scaled variable called Age with the help of Year_Birth (Age = 2014
- Year_Birth. We are considering 2014 as the year because this dataset is from 2014)
Total_Children - Generating a new scaled variable called Total_children with the help of
teen_home & kids_home
5
Marital_Status - Updating the levels under marital_status & bringing it down from 8 to 2
levels as below.
Education - Updating the levels under education and bringing it down from 5 to 2 levels
as below
Income - As Skewness and Kurtosis for income is very high, we can say that it is a
distorted distribution (not normally distributed).
6
STATISTICAL ANALYSIS
1. Chi-Square
Problem 1 -
7
Conclusion - As the p-value is greater than the level of significance - 5%, we accept
the Null Hypotheses.
Problem 2 -
Conclusion - As the p-value is greater than the level of significance - 5%, we do not
reject the Null Hypotheses.
Hence, we can conclude that there is weak or no association between the marital
status of a customer and whether the customer accepts the offer in the first
campaign or not
Problem 3 -
8
Conclusion - As the p-value is significantly lesser than the level of significance -
5%, we reject the Null Hypotheses.
Hence, we can conclude that there is some association between the customer has
a child or not and whether the customer accepts the offer in the first campaign or
not
2. T-test
Problem 1 -
Null hypothesis: The mean value of Total spending of Grad Students = The mean
value of Total spending of Post Grad Students
Alternate hypothesis: The mean value of Total spending of Grad Students ≠ The
mean value of Total spending of Post Grad Students
Since the ratio of variance of spending of grad students and variance of spending
of post-grad students is less than 4. We can assume equal variance t-test
Thus the mean value of Total spending of Grad Students ≠ The mean value of
Total spending of Post Grad Students
9
Problem 2 -
Null hypothesis: The mean value of Total spending of a customer having children
= The mean value of Total spending of a customer not having children
Since the ratio of variance of spending of customers having children and variance
of spending of customers not having children is less than 4. We can assume equal
variance t-test
Thus the mean value of Total spending of customers having children ≠ The mean
value of Total spending of customers not having children.
Problem 3 -
10
Conclusion - As the p-value is significantly lesser than the level of significance -
5%, we reject the Null Hypotheses.
Thus the mean value of Total spending of single people ≠ The mean value of Total
spending of married people
3. ANOVA
Problem 1 -
Conclusion - For degree of freedom (1, 2200) the calculated F value is greater than
the standard F value. Hence we reject the null hypothesis.
Problem 2 -
11
Alternate hypothesis: At least one of them are unequal
Conclusion - For degree of freedom (1, 2200) the calculated F value is greater than
the standard F value. Hence we reject the null hypothesis.
Problem 3 -
Null hypothesis: The mean of total spending of Parents = The mean of total
spending of non parents
Conclusion - For degree of freedom (1, 2200) the calculated F value is greater than
the standard F value. Hence we reject the null hypothesis.
4. Correlation
Problem 1 -
Objective - To find out the correlation between Income and total spending of a
customer.
Null Hypothesis - Income is not correlated with the total spending of a customer.
12
Conclusion - With 95% confidence we can conclude that Income is related to
total spending.
Problem 2 -
Null Hypothesis - Income is not correlated with the Number of visits per month of
a customer
Alternate Hypothesis - Income is correlated with the number of visits per month
of a customer
13
Problem 3 -
5. Regression
STATISTICAL CONCLUSION
○ The mean value of Total spending of Grad Students ≠ The mean value of
Total spending of Post Grad Students.
14
○ The mean value of Total spending of customers having children ≠ The
mean value of Total spending of customers not having children.
○ The mean value of Total spending of single people ≠ The mean value of
Total spending of married people
3. ANOVA conclusions -
○ The mean value of total spending of Parents and non parents is unequal.
4. Correlation conclusions -
○ Income of the customers are directly related to total spending
○ Income of the customers are inversely related to total number of visits to
the store
○ Total number of store purchases is directly related to total spending
5. Regression analysis conclusions -
15