100% found this document useful (1 vote)
474 views25 pages

Detail Project Report SMDM

Uploaded by

Deepak Padiyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
474 views25 pages

Detail Project Report SMDM

Uploaded by

Deepak Padiyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Graded Project SMDM

Problem 1

A. What is the important technical information about the dataset that a database
administrator would be interested in? (Hint: Information about the size of the dataset and the
nature of the variables)

Comments:
Data Frame Provided ----- Austo Motor Company

 Total Rows = 1581


 Total Column= 14
 Float Type Data = 1
 Integer Type Data = 5
 Object Type Data = 8
 Total Memory used = 173 KB

Jupyter Snap shot

The above table reveals that the dataset has in all 14 columns and 1581 rows. The
information above also reflects that there are missing values in few of the
column items such as Gender and Partner salary
 Column ‘Gender’ have 53 null values whereas column ‘ Partner_salary’ have 106 null values

B. Take a critical look at the data and do a preliminary analysis of the variables. Do a quality
check of the data so that the variables are consistent? Are there any discrepancies present
in the data?

 Statistical analysis of data as below

 Column ‘Gender’ have 53 null values whereas column ‘ Partner_salary’ have 106
null values
 There are no duplicate rows has been found

 Below unique number has been identified in column “Gender”.

 There were two spelling error found in the column “Gender” – ‘Femal & Femle’

This spelling error for both the error has been corrected and replaced , below is the
result of the correction in the data value

 After checking mode of the ‘Gender’ data is ‘Male’ as per below plot.

 Null value (Total – 53 ) is replaced with Mode of the ‘Gender’ column


 Replaced NaN value with – Yes in ‘Partner_salary’ with below calculation

 Now there are no null value in the data set after treatment of ‘Gender’ Column and
‘Partner_salary’ column as per above table

Imputation for missing data


Since we are aware that there are missing values in Partner_salary and Gender, we need to
do computation to make the data consistent.
There are techniques for treating missing values in the data set, and which is dependent on
the type of data dealt with. They are as below:
 Drop the missing values: In this case, the missing values are dropped from those variables. In case
there are very few missing values you can drop those values.
 Impute with mean value: For numerical column, the missing values can be imputed with
mean values. Before replacing with mean value, it is advisable to check that the variable
Shouldn’t have extreme values i.e. outliers.
 Impute with median value: For numerical column, you can also replace the missing values
with median values. In case you have extreme values such as outliers it is advisable to use
median approach.
 Impute with mode value: For categorical column, you can replace the missing values with
mode values i.e the frequent ones.

 Checking outliers in the data as per below boxplots


Age boxplot
Salary boxplot

No of Dependents Bloxpot

Price Boxplot
Total Salary Boxplot

Partner Salary Boxplot

 There are outliers in the ‘No_of Dependants’ column as well as ‘Total_salary’ as per above boxplot.
 I will proceed to treat the outliers for the "Total_Salary" only because there is probability of having ‘o’
 dependent value and treating dependent could led to misled the analysis.
 Also Taking mean for the Total salary in order to avoid creating any manipulative analysis and mean
will provide us overall correct representation of data.

Mean of the Total_salary is 79625.99620493359

Treating outliers (Total_salary)

Upper range = 149000

Lower Range = 7400

Q1 = 25%

Q3= 75%

Formula to be used =
IQR=Q3-Q1

lower range= Q1-(1.5 * IQR)

upper range= Q3+(1.5 * IQR

As we can see from the above plot that outliers has been treated, now there is no outliers for the
Total salary.

C. Explore all the features of the data separately by using appropriate


visualizations and draw insights that can be utilized by the business.

Comments:

 Statistical analysis of the data which helps to summarize the data for numeric value .
 Analyzing the Age Variable

With ref to above boxplot for the stated company (Austo Motor Company), we can say that buying
pattern for the car is w.r.t age group . Younger age group (Range 20- 30) tends to buy more cars as
compared to the middle aged (Range 31-45) and older age group (range from 46-55). Also there is
fluctuation in buying pattern for the age group between 35-40 , sales for the cars between this age
group is slightly better after young age group and compared to rest of the age group.

 Analyzing on No of Dependents

If we look at the histogram of number of dependents in the dataset, we can see that the dat
a of dependents is bimodal i.e. there are two modes in the number of dependents data in th
e dataset i.e. 3 and 2.
Also, have tried to plot the boxplot in order to better understand the number of dependen
ts data.We can see that there is no median line displayed in the box plot, that is because, 2
5% and 50% of allobservations had the same values i.e., 2, as a result, the median and low
er quartile are overlapping.In addition, from the above box plot, we can also infer that the
re is an outlier in the column no ofdependents, which may or may not be treated, dependi
ng on business context

 Analyzing the Salary Variable

From the above graph, we can see that the histogram for salary is right skewed.

In addition, the box plot of the salary variable does reconfirm the fact that the salary data i
s right skewed and there are no outliers in the data.
 Analyzing the Partner_salary Variable

From the above histogram plot, we can see that the partner salary data is right skewed.

 Analyzing the Total_salary Variable

From the above histogram, we can see that the total salary dataset is slightly right skewed.

Post outlier treatment, we can see that the boxplot doesn’t showcase any outliers. And th
e median is 80,000. However, if we take a closer look, we can see that boxplot also convey
s the message that the total salary data is slightly right skewed.
 Analyzing the Price Variable

From the above graph, we can see that the Prices of cars are right skewed.

 Analyzing the Gender Category Variable

From the above graph we can see that majority of the observations in Gender category bel
ong to Male, which stands at 1252 (comprising 79.2% of the gender data) and Female num
ber stands at 329 (20.8% of the gender data).

 Analyzing the Profession Category Variable


From the above countplot graph, we can see that Salaried customers are more than Business
customers in the dataset. In the dataset given, there are 896 salaried and 685 business
customers

 Analyzing the Martial_status Category Variable

From the above countplot graph, it can be seen that Married status outnumber the single st
atus customers in the dataset.

 Analyzing the Education Category Variable

From the above countplot graph, we can see that Post Graduate customers are more than
Graduate customers in the dataset. In the dataset given, there are 985 Post Graduates and
596 Graduates.

 Analyzing the Personal Loan Category Variable

From the below countplot, we can see that the number of customers without personal loan and with
personal loan are more or less the same. However, there is a slight variation in the numbers. The
number of customers with personal loan stand at 792 and customers without personal loan stand
at 789.
 Analyzing the House Loan Category Variable

From the above graph, we can see that numbers of customers not availing the house loan (
1054) are more than the number of customers availing the house loan (527).

 Analyzing the Partner Working Category Variable

From the above countplot, we can see that Partners working outnumber the partners not
working in the dataset. The number of Partners working stand at 868 (55% of the data), w
hile partners not working are 713 in number (45% of the data).
 Analyzing the Make Category Variable

From the above graph, we can see that preference for Sedan (702) amongst the customers
is high, followed by Hatchback (582) and thereafter SUV (297) in the given dataset.

Conculsion : All the features of the data (both categorical and numeric) can be analysed
separately by Univariate analysis.

D. Understanding the relationships among the variables in the dataset is


crucial for every analytical project. Perform analysis on the data fields to gain
deeper insights. Comment on your understanding of the data.

For understanding the relationship between the variables, we need to do bivariate analysis,
to better understand the dataset

BIVARIATE ANALYSIS:

A. Relationship between the level of education and type of car


By above graph, the level of education does not have significant impact on the type of car
possessed by the individual.

2) Relationship between profession and type of car

By above graph we can see that majority of population in the dataset prefer to have

Sedan, in both Business and Salaried class.

3) Relationship between marital status and type of car

By above graph we can see that bar of Make and Martial status there is a higher preference for
Sedan overall.

4) Relationship between working partner and type of car


From the above graph and corresponding cross tab, we can see that in general, the preference for
Sedan is on higher side, whether the partner is working or not.

5. Relationship between House loan and type of car

From the above graph, we can see that of the proportion of customers availing House
loan , more than 50% prefer Sedan , followed by Hatchback and SUV. While, of the proportion of
customers not availing house loan, more than 41% prefer Sedan, followed by Hatchback and SUV.
6. Relationship between Salary and Type of car

From the above bar plot, we can see that average salary of the customers who prefer SUV is great
er than Sedan and Hatchback. Which indirectly also implies that SUV is a high range car.

7. Relationship between Total Salary and type of car

From the above bar plot, we can see that average total salary of the customers (which includes the
ir partner salary also) who prefer SUV is greater than Sedan and Hatchback. Which indirectly also
implies that SUV is a high range car.

8.Relationship between age of customers and type of car purchased


From the above bar plot, we can see that average age of customers who buy SUV is greatest,
followed by average age of customers purchasing Sedan and thereafter Hatchback.

9) Bivariate Analysis using Pairplot (pls refer jupyter file as full image could not be captured):

The pair plot displays the relationships between two variables in the dataset. From the above pair pl
ot we can see that in most of the variables of the dataset, there is a
weak or no correlation. However, there is a correlation between the data points for variables
Salary and Total Salary, Total Salary and Age etc. The diagonal graphs all refer to the same
variable on both x and y axis. The graphs displayed above the diagonal graphs are mirror
image of the graphs below the diagonal graphs. The extent of correlation however, doesn’t
get depicted in the pair plot, for which correlation function need to be applied.

10. Bivariate Analysis using Correlation and Heatmap:


For doing the multivariate analysis, correlation was run and heatmap plotted, in order to understand
the extent of correlation between the numeric variables. The correlation function in python, gave th
e following result:

From the below table and heatmap, we can see that there is some correlation (although weak)
between Total Salary and Age and Salary and Price, meaning thereby as the Age increases, total
salary also increases. Similar is the case for Salary and Price (of the car). We also see a very strong
correlation between the Age and Price of the car, which implies, as age
increases, the customer’s spending capacity to buy higher priced car increases. Also, there is high
correlation between the salary and total salary. Similarly, high correlation between Partner salary
and total salary, which is understandable.
Between the rest of variables, either there is very weak or negative correlation.
E. Employees working on the existing marketing campaign have made the following
remarks. Based on the data and your analysis state whether you agree or disagree
with their observations. Justify your answer Based on the data available.
E1) Steve Roger says “Men prefer SUV by a large margin, compared to the women”
To analyse the above statement, following bar plot was plotted

From the above graph and table, we can see that the E1 statement i.e. Steve Roger saying “Men
prefer SUV by a large margin, compared to the women”, does not hold true.
E2) Ned Stark believes that a salaried person is more likely to buy a Sedan.

Data visualization for make of the car and profession

From the above graph, we can conclude that the statement E2 holds true.
If we compare the preference of salaried class for the type of car preferred we see that , of the total
salaried data comparsion to SUV, Sedan and Hatchback .

Hence, the probability of owning Sedan amongst the Salaried class is high

E3) Sheldon Cooper does not believe any of them; he claims that a salaried male is an easier target
for a SUV sale over a Sedan Sale.

From the above graph and table analysis, we can conclude that that the statement E3 doesn’t hold
true. A salaried male is an easier target for a Sedan sale than SUV sale.
F. From the given data, comment on the amount spent on purchasing automobiles across
the following categories. Comment on how a Business can utilize the results from this
exercise. Give justification along with presenting metrics/charts used for arriving at the
conclusions.
Give justification along with presenting metrics/charts used for arriving at the conclusions.
F1) Gender

As per above graph, we can say Female has bought altogether more expensive car than
Male .

F2) Personal_loan

With reference to above plot, customer who don’t take loan buy more expensive cars than who
avails Personal loan.
G. From the current data set comment if having a working partner leads to the
purchase of a higher-priced car.

With the above graph we can concludes that it doesn’t matter if partner is working or not working
Customer will buy their preferred car, there is although a marginal difference between them
which slightly shows that customers whose partner is not working buy more expensive cars.

H. The main objective of this analysis is to devise an improved marketing strategy to send targeted
information to different groups of potential buyers present in the data. For the current analysis
use the Gender and Marital_status - fields to arrive at groups with similar purchase history.

❖ For Female customer they buys more SUV as compared to Sedan, Hatchback takes last position in
the buying preference list for Females.
Taken the number of Gender and maker

 Married customers buys more Sedan as compared to Hatchback and SUV becomes the last
choice.
 Single customer buys more Hatchback as compared to Sedan and SUV takes last position
for the choice

As per below data’s and Graph below insights can be derived


✓ There are total 1443 married and 138 Singles , henceforth there are more married
customers in the company record.
✓ Married business professional they prefer to buy Sedan followed by Hatchback and SUV
became last choice for them.

 Single business professional tends to buy more Hatchback than Sedan with very few prefers
to buy SUV.
 Salaried and Married prefers to buy more Sedan than Hatchback and SUV is the last choice
for them.
 Salaried and single prefers to buy Hatchback followed by Sedan with fewer choice for SUV

Overall, we can see that SUVs have very low popularity amongst Male segment (both single and
married). The company can further investigate and try to find out reasons behind the same in order
to improve their topline for SUVs in Male category. Similarly, Sedan seems to be the top choice for
married males and Hatchback enjoys popularity amongst the single males. Based on the information
and analysis, the company can customise and provide targeted information
regarding festive offers, deals, new launches to the identified segments for their preferred car make,
in order to boost their topline and margins.

You might also like