Detail Project Report SMDM
Detail Project Report SMDM
Problem 1
A. What is the important technical information about the dataset that a database
administrator would be interested in? (Hint: Information about the size of the dataset and the
nature of the variables)
Comments:
Data Frame Provided ----- Austo Motor Company
The above table reveals that the dataset has in all 14 columns and 1581 rows. The
information above also reflects that there are missing values in few of the
column items such as Gender and Partner salary
Column ‘Gender’ have 53 null values whereas column ‘ Partner_salary’ have 106 null values
B. Take a critical look at the data and do a preliminary analysis of the variables. Do a quality
check of the data so that the variables are consistent? Are there any discrepancies present
in the data?
Column ‘Gender’ have 53 null values whereas column ‘ Partner_salary’ have 106
null values
There are no duplicate rows has been found
There were two spelling error found in the column “Gender” – ‘Femal & Femle’
This spelling error for both the error has been corrected and replaced , below is the
result of the correction in the data value
After checking mode of the ‘Gender’ data is ‘Male’ as per below plot.
Now there are no null value in the data set after treatment of ‘Gender’ Column and
‘Partner_salary’ column as per above table
No of Dependents Bloxpot
Price Boxplot
Total Salary Boxplot
There are outliers in the ‘No_of Dependants’ column as well as ‘Total_salary’ as per above boxplot.
I will proceed to treat the outliers for the "Total_Salary" only because there is probability of having ‘o’
dependent value and treating dependent could led to misled the analysis.
Also Taking mean for the Total salary in order to avoid creating any manipulative analysis and mean
will provide us overall correct representation of data.
Q1 = 25%
Q3= 75%
Formula to be used =
IQR=Q3-Q1
As we can see from the above plot that outliers has been treated, now there is no outliers for the
Total salary.
Comments:
Statistical analysis of the data which helps to summarize the data for numeric value .
Analyzing the Age Variable
With ref to above boxplot for the stated company (Austo Motor Company), we can say that buying
pattern for the car is w.r.t age group . Younger age group (Range 20- 30) tends to buy more cars as
compared to the middle aged (Range 31-45) and older age group (range from 46-55). Also there is
fluctuation in buying pattern for the age group between 35-40 , sales for the cars between this age
group is slightly better after young age group and compared to rest of the age group.
Analyzing on No of Dependents
If we look at the histogram of number of dependents in the dataset, we can see that the dat
a of dependents is bimodal i.e. there are two modes in the number of dependents data in th
e dataset i.e. 3 and 2.
Also, have tried to plot the boxplot in order to better understand the number of dependen
ts data.We can see that there is no median line displayed in the box plot, that is because, 2
5% and 50% of allobservations had the same values i.e., 2, as a result, the median and low
er quartile are overlapping.In addition, from the above box plot, we can also infer that the
re is an outlier in the column no ofdependents, which may or may not be treated, dependi
ng on business context
From the above graph, we can see that the histogram for salary is right skewed.
In addition, the box plot of the salary variable does reconfirm the fact that the salary data i
s right skewed and there are no outliers in the data.
Analyzing the Partner_salary Variable
From the above histogram plot, we can see that the partner salary data is right skewed.
From the above histogram, we can see that the total salary dataset is slightly right skewed.
Post outlier treatment, we can see that the boxplot doesn’t showcase any outliers. And th
e median is 80,000. However, if we take a closer look, we can see that boxplot also convey
s the message that the total salary data is slightly right skewed.
Analyzing the Price Variable
From the above graph, we can see that the Prices of cars are right skewed.
From the above graph we can see that majority of the observations in Gender category bel
ong to Male, which stands at 1252 (comprising 79.2% of the gender data) and Female num
ber stands at 329 (20.8% of the gender data).
From the above countplot graph, it can be seen that Married status outnumber the single st
atus customers in the dataset.
From the above countplot graph, we can see that Post Graduate customers are more than
Graduate customers in the dataset. In the dataset given, there are 985 Post Graduates and
596 Graduates.
From the below countplot, we can see that the number of customers without personal loan and with
personal loan are more or less the same. However, there is a slight variation in the numbers. The
number of customers with personal loan stand at 792 and customers without personal loan stand
at 789.
Analyzing the House Loan Category Variable
From the above graph, we can see that numbers of customers not availing the house loan (
1054) are more than the number of customers availing the house loan (527).
From the above countplot, we can see that Partners working outnumber the partners not
working in the dataset. The number of Partners working stand at 868 (55% of the data), w
hile partners not working are 713 in number (45% of the data).
Analyzing the Make Category Variable
From the above graph, we can see that preference for Sedan (702) amongst the customers
is high, followed by Hatchback (582) and thereafter SUV (297) in the given dataset.
Conculsion : All the features of the data (both categorical and numeric) can be analysed
separately by Univariate analysis.
For understanding the relationship between the variables, we need to do bivariate analysis,
to better understand the dataset
BIVARIATE ANALYSIS:
By above graph we can see that majority of population in the dataset prefer to have
By above graph we can see that bar of Make and Martial status there is a higher preference for
Sedan overall.
From the above graph, we can see that of the proportion of customers availing House
loan , more than 50% prefer Sedan , followed by Hatchback and SUV. While, of the proportion of
customers not availing house loan, more than 41% prefer Sedan, followed by Hatchback and SUV.
6. Relationship between Salary and Type of car
From the above bar plot, we can see that average salary of the customers who prefer SUV is great
er than Sedan and Hatchback. Which indirectly also implies that SUV is a high range car.
From the above bar plot, we can see that average total salary of the customers (which includes the
ir partner salary also) who prefer SUV is greater than Sedan and Hatchback. Which indirectly also
implies that SUV is a high range car.
9) Bivariate Analysis using Pairplot (pls refer jupyter file as full image could not be captured):
The pair plot displays the relationships between two variables in the dataset. From the above pair pl
ot we can see that in most of the variables of the dataset, there is a
weak or no correlation. However, there is a correlation between the data points for variables
Salary and Total Salary, Total Salary and Age etc. The diagonal graphs all refer to the same
variable on both x and y axis. The graphs displayed above the diagonal graphs are mirror
image of the graphs below the diagonal graphs. The extent of correlation however, doesn’t
get depicted in the pair plot, for which correlation function need to be applied.
From the below table and heatmap, we can see that there is some correlation (although weak)
between Total Salary and Age and Salary and Price, meaning thereby as the Age increases, total
salary also increases. Similar is the case for Salary and Price (of the car). We also see a very strong
correlation between the Age and Price of the car, which implies, as age
increases, the customer’s spending capacity to buy higher priced car increases. Also, there is high
correlation between the salary and total salary. Similarly, high correlation between Partner salary
and total salary, which is understandable.
Between the rest of variables, either there is very weak or negative correlation.
E. Employees working on the existing marketing campaign have made the following
remarks. Based on the data and your analysis state whether you agree or disagree
with their observations. Justify your answer Based on the data available.
E1) Steve Roger says “Men prefer SUV by a large margin, compared to the women”
To analyse the above statement, following bar plot was plotted
From the above graph and table, we can see that the E1 statement i.e. Steve Roger saying “Men
prefer SUV by a large margin, compared to the women”, does not hold true.
E2) Ned Stark believes that a salaried person is more likely to buy a Sedan.
From the above graph, we can conclude that the statement E2 holds true.
If we compare the preference of salaried class for the type of car preferred we see that , of the total
salaried data comparsion to SUV, Sedan and Hatchback .
Hence, the probability of owning Sedan amongst the Salaried class is high
E3) Sheldon Cooper does not believe any of them; he claims that a salaried male is an easier target
for a SUV sale over a Sedan Sale.
From the above graph and table analysis, we can conclude that that the statement E3 doesn’t hold
true. A salaried male is an easier target for a Sedan sale than SUV sale.
F. From the given data, comment on the amount spent on purchasing automobiles across
the following categories. Comment on how a Business can utilize the results from this
exercise. Give justification along with presenting metrics/charts used for arriving at the
conclusions.
Give justification along with presenting metrics/charts used for arriving at the conclusions.
F1) Gender
As per above graph, we can say Female has bought altogether more expensive car than
Male .
F2) Personal_loan
With reference to above plot, customer who don’t take loan buy more expensive cars than who
avails Personal loan.
G. From the current data set comment if having a working partner leads to the
purchase of a higher-priced car.
With the above graph we can concludes that it doesn’t matter if partner is working or not working
Customer will buy their preferred car, there is although a marginal difference between them
which slightly shows that customers whose partner is not working buy more expensive cars.
H. The main objective of this analysis is to devise an improved marketing strategy to send targeted
information to different groups of potential buyers present in the data. For the current analysis
use the Gender and Marital_status - fields to arrive at groups with similar purchase history.
❖ For Female customer they buys more SUV as compared to Sedan, Hatchback takes last position in
the buying preference list for Females.
Taken the number of Gender and maker
Married customers buys more Sedan as compared to Hatchback and SUV becomes the last
choice.
Single customer buys more Hatchback as compared to Sedan and SUV takes last position
for the choice
Single business professional tends to buy more Hatchback than Sedan with very few prefers
to buy SUV.
Salaried and Married prefers to buy more Sedan than Hatchback and SUV is the last choice
for them.
Salaried and single prefers to buy Hatchback followed by Sedan with fewer choice for SUV
Overall, we can see that SUVs have very low popularity amongst Male segment (both single and
married). The company can further investigate and try to find out reasons behind the same in order
to improve their topline for SUVs in Male category. Similarly, Sedan seems to be the top choice for
married males and Hatchback enjoys popularity amongst the single males. Based on the information
and analysis, the company can customise and provide targeted information
regarding festive offers, deals, new launches to the identified segments for their preferred car make,
in order to boost their topline and margins.