SMDM - Project Report - Lakshmi

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

SMDM Project

Report
2023

n Author is licensed under CC BY-SA-NC

Authored by: Lakshmi P

1
Table of Contents
Problem 1....................................................................................................................3
A..............................................................................................................................3
B..............................................................................................................................5
C..............................................................................................................................7
D............................................................................................................................10
E............................................................................................................................13
F............................................................................................................................15
G............................................................................................................................17
H............................................................................................................................18
Problem 2..................................................................................................................20

2
Problem 1
A. What is the important technical information about the dataset that a database
administrator would be interested in?

As a Database administrator, the important technical information relating to the dataset can be
obtained by studying the basic components. This can be done using basic exploratory analysis of the
data:

 head of the dataset


 shape of the dataset
 info of the dataset
 summary of the dataset

There are total of 14 columns and 1581 rows in the data set. Of the total columns, there are 8
columns which have object datatype, 5 which are integer datatype and 1 column with float datatype.
There are 6 variables with continuous data and 8 variables with categorical data.

We can observe that there are missing values in the Gender and Partner salary columns. Also, we
can note that the Salary and Partner salary columns have different datatypes despite being similar
kind of variables.

3
Below is the screenshot of the top 5 rows which is obtained by the head attribute of the data frame.

The describe method will give the information about the spread of the data for the numerical values.
We can observe the minimum and maximum value, mean value, and different percentile values for
each of the variable. This gives basic understanding about the range of data in the dataset.

4
B. Take a critical look at the data and do a preliminary analysis of the variables. Do a quality
check of the data so that the variables are consistent. Are there any discrepancies present in
the data? If yes, perform preliminary treatment of data.

In order to make the dataset ready for further analysis, we will treat the data for the noise elements
which can impact our outcome.

 Duplicates
We have used the ‘duplicated’ method and checked that there are no duplicate values in the
dataset

 Missing values
We have observed previously that there are missing values in two columns – Gender and
Partner salary. Since Gender is a categorical variable, we will impute the missing values with
Mode.
For the Partner Salary column, we will check the condition if the partner working is Yes or No.
 If the corresponding data for Partner working is No, we will impute it with zero.
 If the corresponding data for Partner working is Yes, we will take different approach. As
the Partner salary is a continuous variable, we need to check the presence of outliers.
As there are no outliers that can be observed from the box plot below, we can impute
the missing values with Mean.

 Bad Values
We have checked the unique values of all categorical variables to identify any bad data in the
dataset. We observe that the Gender column has some incorrect typo errors in the data which
we have corrected using the replace method.

 Anomalies or Outliers
We have already verified that there is no outlier in the Partner salary column. We can do box
plots of the remaining continuous variables to check the presence of outliers and treat it
appropriately.

5
We can observe that the number of dependents and the total salary column has outliers. We
will not be treating the number of dependents data as zero is not really an outlier and there
can be people who have no dependents. However, we do need to treat the total salary

6
column. We do this treatment using a user defined function, to restrict the values to the upper
limit.

7
C. Explore all the features of the data separately by using appropriate visualizations and
draw insights that can be utilized by the business.

Univariate Analysis:

Insights:
 Age is right skewed, denoting that the dataset comprises of greater younger population.
 Maximum no. of people has 2 or 3 dependents which can influence the size of car bought
 Price is right skewed, indicating people tend to opt more cheaper cars
 The Total Salary column is appearing normally distributed after treatment of data

8
 Partner salary has large no of data points as zero as the partner working is not true in most
cases.

9
10
Insights:
 Men account for large proportion of the customer base
 Of the total number of customers, majority are married
 Salaried customers tend to dominate the car sales compared to business customers
 Amongst the total customers, post graduates are higher than graduates
 Many customers do not have existing home loan, however almost 50% of total customers do
have personal loan.
 Higher no of customers have also stated that they have a working partner
 Based on the make of car, we can observe there is higher preference for sedan and hatchback
compared to SUV.

11
D. Understanding the relationships among the variables in the dataset is crucial for every
analytical project. Perform analysis on the data fields to gain deeper insights. Comment on
your understanding of the data. 

Bivariate Analysis:

We can perform bi-variate analysis on different combination of variables to determine the relationship
between each other and also the influence on the kind of var bought.

Insights that influence type of car:


 Men tend to prefer sedan and hatchback over SUV. However, SUV is preferred more by
women.
 Married customers opt for sedan and hatchback in comparison to SUV
 Among sedan and SUV categories, salaried customers prefer them more. But in case of
hatchback, there is almost equal preference between business or salaried customers
 Based on education level of customers, Post graduates dominate the purchasing preference
amongst all categories.
 Where a customer has working partner, they are more likely to choose a sedan or hatchback.

12
Analysis based on price of car:
 Female customers usually opt to buy higher priced car.
 Salaried customers tend to marginally spend more than business customers on car
 Education level doesn’t not influence the price of car to a great extent

Analysis based on salary of customers:


 Customers with higher annual income prefer SUV more than other categories
 Partner salary is highly skewed data across all categories of cars

13
Analysis based on correlation and heatmap:
Age is highly correlated with price and annual income of the customer. Total salary is derived as sum
of individual salary and partner salary, which causes the high correlation.

14
E. Employees working on the existing marketing campaign have made the following remarks.
Based on the data and your analysis state whether you agree or disagree with their
observations. Justify your answer Based on the data available.
E1) Steve Roger says “Men prefer SUV by a large margin, compared to the women”
Let us compare the gender and make of the car to test this assumption.

SUV accounts for 18% of the total car sales, of which 11% was sold to females and only 8% was sold
to males.
Hence, this disproves the assumption of Steve that Men prefer SUV compared to women.

E2) Ned Stark believes that a salaried person is more likely to buy a Sedan.
Analysis of purchasing different make of car based on occupation of the person

Salaried customers account for 57% of the total car sales, of which 25% opted for sedan, 18% bought
hatchback and 13% chose to buy SUV.
Hence, the assumption by Ned that salaried person is more likely to buy sedan is true.

15
E3) Sheldon Cooper does not believe any of them; he claims that a salaried male is an easier
target for a SUV sale over a Sedan Sale.
To answer this question, we need to study combined influence of occupation and gender on the
choice of car.

Based on the data in the table and the graph, we can determine that salaried male as a customer
group opted for a sedan (19%) more than a hatchback (18%) or SUV (6%). Hence SUV is the least
choice of car among salaried men. This confirms that statement made by Sheldon is False.

16
F. From the given data, comment on the amount spent on purchasing automobiles across the
following categories. Comment on how a business can utilize the results from this exercise.
Give justification along with presenting metrics/charts used for arriving at the conclusions.
F1) Gender
We can study the total amount spent and the average price of car purchased by male vs female
customers. The total amount spent by males is significantly higher than females. However, the
average price of car bought by a female is higher than that of Male.
Let us further study the total amount spent based on different make of car.

From the above data, we can derive females spend higher on SUV (59%) & sedan (38%). However,
males spent more on sedan (45%) and hatchback (37%).
We derive at purchasing groups based on combination of variables to derive that males buying sedan
and hatchback contribute to highest portion of total sales. This buying group should be targeted with
focused marketing efforts by the company. For example, there could be more ads featuring women in
SUV, and showing men while driving sedan or hatchback.

17
F2) Personal loan
We can perform similar study to compare the variable where personal loan is availed by the customer
and the subsequent influence on spend on car.

From the above data, we can conclude that irrespective of where customers have availed personal
loan or not, they tend to opt for a sedan in comparison to SUV or hatchback. However, there is higher
preference for SUV in customers who do not have personal loan. The reason could be that, existing
commitments due to personal loan, might discourage choice of buying a higher priced car. Hence the
company should gather this data in initial screening during customer walk-ins for purchase of car, and
accordingly suggest the best suitable car for the customer.
The company can also think of providing additional finance options if feasible. This will ensure that
this criteria of personal or home loan will not be a deterrent in choosing a higher priced car.

18
G. From the current data set comment if having a working partner leads to the purchase of a
higher-priced car.

To begin with, we first determine the range of prices of the different category of cars, which can point
towards the higher-priced car. Below table shows maximum and minimum values, which indicates that
SUV & sedan are comparatively higher priced than hatchback.

Of the total customer base, 55% have a working partner. Amongst these customers 26% opt for a
sedan, 19% for hatch back and the remaining own a SUV. We also study the total amount and
average price spent by customers where the partner is working. The total amount spent by customers
is only marginally higher where the partner is working. However, the average price of car where the
partner is not working is higher in comparison to where the partner is working.

Hence, we can conclude that having a working partner does not necessarily lead to purchase of
higher priced car. It is influenced by other factors as well.

19
H. The main objective of this analysis is to devise an improved marketing strategy to send
targeted information to different groups of potential buyers present in the data. For the current
analysis use the Gender and Marital status - fields to arrive at groups with similar purchase
history.

In order to analyse the preference of different category of customers, we study the distribution of our
customer base across these categories – gender and marital status. Married men dominate the
customer base to a large extent. Hence, their purchasing behaviour tend to influence the sales.

We need to study the purchase history of different category of cars across both these categories, for
further analysis. Married customers account for more than 90% customer base, and have high
preference for Sedan and hatchback. Male customers form 79% of total customers and they also
prefer sedan or hatchback.

20
In order to arrive at groups of similar purchase history, we will combine all the three variables. By
studying the data table and graph above, married men buying sedan and hatchback contribute to
~65% of all sales. The third highest group consists of married females who are opting for SUV. We
can see that SUV is least popular among unmarried Males or Females. Hence there should be
targeted ads relating to these products and customer groups.

The company needs to study the market further as to what features are these customer groups
looking for in the car which can influence the buying decision.

21
Problem 2
List down the top 5 important variables, along with the business justifications

Prior to commencing the analysis of the data, we must first study the various elements and variables
in it. We study the shape of the data i.e., no. of rows and columns and also the datatypes of all the
variables. We will also observe the key statistical measures of all variables. This will give us basic
understanding of what kind of data is consisted in the dataset and decide appropriate next steps.
In order to prepare the dataset for further analysis, we will treat the data for the noise elements which
can impact our outcome.

 Duplicates
We have used the ‘duplicated’ method and checked that there are no duplicate values in the
dataset

 Missing values
We observe that there is missing values in only one column – Transactor revolver, which can
be imputed with the mode.

 Bad Values
We have checked the unique values of the key categorical variables and confirmed that there
is issue with only one column – Occupation at source. We can treat this column by imputing
the zero values with Mode.

 Anomalies or Outliers
Though there are several continuous variables, most of them have been encoded based on
Boolean values. Hence, we check for outliers and treat them only for the key variables –
average spends, income at source and credit card limit.

22
Customer Profiling:
As we further deep dive into the data, it is important to understand the customer profile of our base.

From the above charts, we note that maximum part of the customers is salaried or self-employed.
However, the entire base of customers is almost equally spread in terms of net worth.

As a continuation of above exploration of data, we study the income range of the customers and the
average credit card spends of the customers.

We can derive that average spends of customers is highly skewed, indicating that there are more no
of customers whose average spend is on lower side.

Correlation:

In order to study the impact of the variables on each other, we use the heatmap to study the
correlation between the variables. This throws light on three key attributes which are highly corelated
to each other:

1. Annual income and credit card limit – higher the annual income, higher the limit
2. Annual income and average spend – higher the annual income, higher the spend
3. Average spends and credit card limit – higher the spend, higher the credit card limit

These three variables are interlinked and strongly influence customer behaviour on card usage.
23
Analysis on card type and activity:
We identify that the top 5 card types account ~60% of the total customer base and 70% of average
spends over last 3 months – rewards, prosperity, edge, chartered and smartearn.

We can study the customer behaviour patters on usage of these different cards based on activity in
last 30, 60 and 90 days.

We find that only 28% of the cards are active in first 30 days. Further study on active cards in 60 and
90 days, reveals that this percentage improves to 48% and 63% respectively. When we analyse
further on card usage of the top 5 cards – these continue to hold the highest share in terms of being
active across 30, 60 and 90 days.

If a card is not activated and used within first 30 days, it will not create a behaviour mindset in the
customer for repeated usage. This can be one of the key reasons for high attrition. In order to resolve
this, one of the strategies that could be adopted is presenting specific discounts or offers along with
when the card is issued with an expiry of one month. This could prompt the customer in activating and
using the card to avail the offer. Also, the team must focus on providing relevant offers to top 5 card
type card holders as they form the maximum base.

24
Customer Profiles and Average spends:
We have initially determined that Salaried and Self-employed people tend to dominate our customer
base. However, the customers were equally distributed across all categories of net worth. Hence,
further study was done to establish the relationship of these variables with other attributes.

From the below graphical representation we can derive


 The two categories – salaried and self-employed, have higher average spends on the card.
 Category A and B have the higher average spends on the card.
 The two categories – salaried and self-employed consist of the maximum part of our net worth
profile across the category A & B.
 Hence, the subset of customers within category A/B and who are salaried/self-employed need
to be focused, in order to ensure high spends which in turn will lower attrition.

25
Conclusion:

From the various analysis we have performed, we can conclude that the below 5 variables are key to
derive a solution for the business problem.

 Credit card type


 Credit Card activity in last 30 days
 Customer category basis their net worth
 Occupation at source
 Average credit card spends in last 3 months

26

You might also like