Exploratiory Data Analysis
Exploratiory Data Analysis
1.1 Objectives
• Explore features or characteristics to predict price of car
• Analyze patterns and run descriptive statistical analysis
• Group data based on identified parameters and create pivot tables
• Identify the effect of independent attributes on price of cars
Table of Contents
Import Data from Module
Analyzing Individual Feature Patterns using Visualization
Descriptive Statistical Analysis
Basics of Grouping
Correlation and Causation
What are the main characteristics that have the most impact on the car price?
1
The functions below will download the dataset into your browser and store it in dataframe df:
This dataset was hosted on IBM Cloud object. Click HERE for free storage.
↪automobileEDA.csv"
[5]: file_name="usedcars.csv"
[6]: df = pd.read_csv(file_name)
[7]: #filepath='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/
↪IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/
↪automobileEDA.csv'
[8]: df.head()
2
0 5000.0 21 8.703704 13495.0 11.190476 Low
1 5000.0 21 8.703704 16500.0 11.190476 Low
2 5000.0 19 9.038462 16500.0 12.368421 Medium
3 5500.0 24 7.833333 13950.0 9.791667 Low
4 5500.0 18 10.681818 17450.0 13.055556 Low
[5 rows x 31 columns]
Unnamed: 0 int64
symboling int64
normalized-losses int64
make object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
3
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower int64
peak-rpm float64
city-mpg int64
highway-mpg float64
price float64
city-L/100km float64
horsepower-binned object
fuel-type-diesel bool
fuel-type-gas bool
aspiration-std bool
aspiration-turbo bool
dtype: object
[11]: df['peak-rpm'].dtypes
[11]: dtype('float64')
For example, we can calculate the correlation between variables of type “int64” or “float64” using
the method “corr”:
4
Unnamed: 0 0.161848 0.043976 0.252015 0.064820 -0.047764
symboling -0.365404 -0.242423 -0.550160 -0.233118 -0.110581
normalized-losses 0.019424 0.086802 -0.373737 0.099404 0.112360
wheel-base 0.876024 0.814507 0.590742 0.782097 0.572027
length 1.000000 0.857170 0.492063 0.880665 0.685025
width 0.857170 1.000000 0.306002 0.866201 0.729436
height 0.492063 0.306002 1.000000 0.307581 0.074694
curb-weight 0.880665 0.866201 0.307581 1.000000 0.849072
engine-size 0.685025 0.729436 0.074694 0.849072 1.000000
bore 0.608971 0.544885 0.180449 0.644060 0.572609
stroke 0.123952 0.188822 -0.060663 0.167438 0.205928
compression-ratio 0.159733 0.189867 0.259737 0.156433 0.028889
horsepower 0.579795 0.615056 -0.087001 0.757981 0.822668
peak-rpm -0.285970 -0.245800 -0.309974 -0.279361 -0.256733
city-mpg -0.665192 -0.633531 -0.049800 -0.749543 -0.650546
highway-mpg 0.707108 0.736728 0.084301 0.836921 0.783465
price 0.690628 0.751265 0.135486 0.834415 0.872335
city-L/100km 0.657373 0.673363 0.003811 0.785353 0.745059
5
curb-weight -0.279361 -0.749543 0.836921 0.834415 0.785353
engine-size -0.256733 -0.650546 0.783465 0.872335 0.745059
bore -0.267392 -0.582027 0.559112 0.543155 0.554610
stroke -0.063561 -0.033956 0.047089 0.082269 0.036133
compression-ratio -0.435780 0.331425 -0.223361 0.071107 -0.299372
horsepower 0.107884 -0.822192 0.840627 0.809607 0.889482
peak-rpm 1.000000 -0.115413 0.017694 -0.101616 0.115830
city-mpg -0.115413 1.000000 -0.909024 -0.686571 -0.949713
highway-mpg 0.017694 -0.909024 1.000000 0.801118 0.958306
price -0.101616 -0.686571 0.801118 1.000000 0.789898
city-L/100km 0.115830 -0.949713 0.958306 0.789898 1.000000
The diagonal elements are always one; we will study correlation more precisely Pearson correlation
in-depth at the end of the notebook.
6
As the engine-size goes up, the price goes up: this indicates a positive direct correlation between
these two variables. Engine size seems like a pretty good predictor of price since the regression line
is almost a perfect diagonal line.
We can examine the correlation between ‘engine-size’ and ‘price’ and see that it’s approximately
0.87.
Highway mpg is a potential predictor variable of price. Let’s find the scatterplot of “highway-mpg”
and “price”.
7
As highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship
between these two variables. Highway mpg could potentially be a predictor of price.
We can examine the correlation between ‘highway-mpg’ and ‘price’ and see it’s approximately
-0.704.
8
Peak rpm does not seem like a good predictor of the price at all since the regression line is close
to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of
variability. Therefore, it’s not a reliable variable.
We can examine the correlation between ‘peak-rpm’ and ‘price’ and see it’s approximately -0.101616.
[19]: df[['peak-rpm','price']].corr()
9
[21]: <Axes: xlabel='stroke', ylabel='price'>
Categorical Variables
These are variables that describe a ‘characteristic’ of a data unit, and are selected from a small
group of categories. The categorical variables can have the type “object” or “int64”. A good way
to visualize categorical variables is by using boxplots.
Let’s look at the relationship between “body-style” and “price”.
10
We see that the distributions of price between the different body-style categories have a significant
overlap, so body-style would not be a good predictor of price. Let’s examine engine “engine-
location” and “price”:
11
Here we see that the distribution of price between these two engine-location categories, front and
rear, are distinct enough to take engine-location as a potential good predictor of price.
Let’s examine “drive-wheels” and “price”.
[24]: # drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)
12
Here we see that the distribution of price between the different drive-wheels categories differs. As
such, drive-wheels could potentially be a predictor of price.
13
[25]: df.describe()
The default setting of “describe” skips variables of type object. We can apply the method “describe”
on the variables of type ‘object’ as follows:
[26]: df.describe(include=['object'])
14
[26]: make num-of-doors body-style drive-wheels engine-location \
count 201 201 201 201 201
unique 22 2 5 3 2
top toyota four sedan fwd front
freq 32 115 94 118 198
Value Counts
Value counts is a good way of understanding how many units of each characteristic/variable we
have. We can apply the “value_counts” method on the column “drive-wheels”. Don’t forget the
method “value_counts” only works on pandas series, not pandas dataframes. As a result, we only
include one bracket df[‘drive-wheels’], not two brackets df[[‘drive-wheels’]].
[27]: df['drive-wheels'].value_counts()
[27]: drive-wheels
fwd 118
rwd 75
4wd 8
Name: count, dtype: int64
[28]: df['drive-wheels'].value_counts().to_frame()
[28]: count
drive-wheels
fwd 118
rwd 75
4wd 8
Let’s repeat the above steps but save the results to the dataframe “drive_wheels_counts” and
rename the column ‘drive-wheels’ to ‘value_counts’.
drive_wheels_counts
[29]: count
drive-wheels
fwd 118
rwd 75
15
4wd 8
[30]: count
drive-wheels
fwd 118
rwd 75
4wd 8
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)
[31]: count
engine-location
front 198
rear 3
After examining the value counts of the engine location, we see that engine location would not be
a good predictor variable for the price. This is because we only have three cars with a rear engine
and 198 with an engine in the front, so this result is skewed. Thus, we are not able to draw any
conclusions about the engine location.
[32]: df['drive-wheels'].unique()
If we want to know, on average, which type of drive wheel is most valuable, we can group “drive-
wheels” and then average them.
We can select the columns ‘drive-wheels’, ‘body-style’ and ‘price’, then assign it to the variable
“df_group_one”.
16
We can then calculate the average price for each of the different categories of data.
From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel
and front-wheel are approximately the same in price.
You can also group by multiple variables. For example, let’s group by both ‘drive-wheels’ and ‘body-
style’. This groups the dataframe by the unique combination of ‘drive-wheels’ and ‘body-style’. We
can store the results in the variable ‘grouped_test1’.
grouped_test1
This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is
like an Excel spreadsheet, with one variable along the column and another along the row. We can
convert the dataframe to a pivot table using the method “pivot” to create a pivot table from the
groups.
In this case, we will leave the drive-wheels variable as the rows of the table, and pivot body-style
to become the columns of the table:
17
[36]: price \
body-style convertible hardtop hatchback sedan
drive-wheels
4wd NaN NaN 7603.000000 12647.333333
fwd 11595.0 8249.000000 8396.387755 9811.800000
rwd 23949.6 24202.714286 14337.777778 21711.833333
body-style wagon
drive-wheels
4wd 9095.750000
fwd 9997.333333
rwd 16994.222222
Often, we won’t have data for some of the pivot cells. We can fill these missing cells with the value
0, but any other value could potentially be used as well. It should be mentioned that missing data
is quite a complex subject and is an entire course on its own.
[37]: price \
body-style convertible hardtop hatchback sedan
drive-wheels
4wd 0.0 0.000000 7603.000000 12647.333333
fwd 11595.0 8249.000000 8396.387755 9811.800000
rwd 23949.6 24202.714286 14337.777778 21711.833333
body-style wagon
drive-wheels
4wd 9095.750000
fwd 9997.333333
rwd 16994.222222
grouped_test_bodystyle
18
[39]: import matplotlib.pyplot as plt
%matplotlib inline
The heatmap plots the target variable (price) proportional to colour with respect to the variables
‘drive-wheel’ and ‘body-style’ on the vertical and horizontal axis, respectively. This allows us to
visualize how the price is related to ‘drive-wheel’ and ‘body-style’.
The default labels convey no useful information to us. Let’s change that:
#label names
row_labels = grouped_pivot.columns.levels[1]
19
col_labels = grouped_pivot.index
#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)
fig.colorbar(im)
plt.show()
Visualization is very important in data science, and Python visualization packages provide great
freedom. We will go more in-depth in a separate Python visualizations course.
20
The main question we want to answer in this module is, “What are the main characteristics which
have the most impact on the car price?”.
To get a better measure of the important characteristics, we look at the correlation of these variables
with the car price. In other words: how is the car price dependent on this variable?
21
length width height curb-weight engine-size \
Unnamed: 0 0.161848 0.043976 0.252015 0.064820 -0.047764
symboling -0.365404 -0.242423 -0.550160 -0.233118 -0.110581
normalized-losses 0.019424 0.086802 -0.373737 0.099404 0.112360
wheel-base 0.876024 0.814507 0.590742 0.782097 0.572027
length 1.000000 0.857170 0.492063 0.880665 0.685025
width 0.857170 1.000000 0.306002 0.866201 0.729436
height 0.492063 0.306002 1.000000 0.307581 0.074694
curb-weight 0.880665 0.866201 0.307581 1.000000 0.849072
engine-size 0.685025 0.729436 0.074694 0.849072 1.000000
bore 0.608971 0.544885 0.180449 0.644060 0.572609
stroke 0.123952 0.188822 -0.060663 0.167438 0.205928
compression-ratio 0.159733 0.189867 0.259737 0.156433 0.028889
horsepower 0.579795 0.615056 -0.087001 0.757981 0.822668
peak-rpm -0.285970 -0.245800 -0.309974 -0.279361 -0.256733
city-mpg -0.665192 -0.633531 -0.049800 -0.749543 -0.650546
highway-mpg 0.707108 0.736728 0.084301 0.836921 0.783465
price 0.690628 0.751265 0.135486 0.834415 0.872335
city-L/100km 0.657373 0.673363 0.003811 0.785353 0.745059
22
width -0.245800 -0.633531 0.736728 0.751265 0.673363
height -0.309974 -0.049800 0.084301 0.135486 0.003811
curb-weight -0.279361 -0.749543 0.836921 0.834415 0.785353
engine-size -0.256733 -0.650546 0.783465 0.872335 0.745059
bore -0.267392 -0.582027 0.559112 0.543155 0.554610
stroke -0.063561 -0.033956 0.047089 0.082269 0.036133
compression-ratio -0.435780 0.331425 -0.223361 0.071107 -0.299372
horsepower 0.107884 -0.822192 0.840627 0.809607 0.889482
peak-rpm 1.000000 -0.115413 0.017694 -0.101616 0.115830
city-mpg -0.115413 1.000000 -0.909024 -0.686571 -0.949713
highway-mpg 0.017694 -0.909024 1.000000 0.801118 0.958306
price -0.101616 -0.686571 0.801118 1.000000 0.789898
city-L/100km 0.115830 -0.949713 0.958306 0.789898 1.000000
23
[45]: pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value␣
↪of P = ", p_value)
Conclusion: Since the p-value is < 0.001, the correlation between width and price is statistically
significant, and the linear relationship is quite strong (~0.751).
24
Conclusion:
Since the p-value is < 0.001, the correlation between curb-weight and price is statistically significant,
and the linear relationship is quite strong (~0.834).
Engine-Size vs. Price
Let’s calculate the Pearson Correlation Coefficient and P-value of ‘engine-size’ and ‘price’:
25
[52]: pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value␣
↪of P = ", p_value )
Conclusion: Since the p-value is < 0.001, the correlation between highway-mpg and price is
statistically significant, and the coefficient of about -0.705 shows that the relationship is negative
and moderately strong.
Conclusion: Important Variables
We now have a better idea of what our data looks like and which variables are important to take
into account when predicting the car price. We have narrowed it down to the following variables:
Continuous numerical variables:
Length
Width
Curb-weight
Engine-size
Horsepower
City-mpg
Highway-mpg
Wheel-base
Bore
Categorical variables:
Drive-wheels
As we now move into building machine learning models to automate our analysis, feeding the model
with variables that meaningfully affect our target variable will improve our model’s prediction
performance.
[ ]:
26