Eda 1
Eda 1
Python
Estimated time needed: 15 minutes
Objectives
After completing this lab you will be able to:
Table of Contents
What are the main characteristics that have the most impact
on the car price?
Import libraries:
This dataset was hosted on IBM Cloud object. Click HERE for free storage.
In [3]: path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDevelop
df = pd.read_csv(path)
df.head()
Out[3]: num-
normalized- body- drive- engine- wheel
symboling make aspiration of-
losses style wheels location base
doors
alfa-
0 3 122 std two convertible rwd front 88.6
romero
alfa-
1 3 122 std two convertible rwd front 88.6
romero
alfa-
2 1 122 std two hatchback rwd front 94.5
romero
5 rows × 29 columns
2. Analyzing Individual Feature Patterns Using
Visualization
To install Seaborn we use pip, the Python package manager.
Import visualization packages "Matplotlib" and "Seaborn". Don't forget about "%matplotlib
inline" to plot in a Jupyter notebook.
symboling int64
normalized-losses int64
make object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower float64
peak-rpm float64
city-mpg int64
highway-mpg int64
price float64
city-L/100km float64
horsepower-binned object
diesel int64
gas int64
dtype: object
Question #1:
What is the data type of the column "peak-rpm"?
Out[56]: dtype('float64')
For example, we can calculate the correlation between variables of type "int64" or "float64"
using the method "corr":
In [7]: df.corr()
Out[7]: normalized- wheel- curb
symboling length width height
losses base weigh
normalized-
0.466264 1.000000 -0.056661 0.019424 0.086802 -0.373737 0.099404
losses
compression-
-0.182196 -0.114713 0.250313 0.159733 0.189867 0.259737 0.156433
ratio
highway-
0.036233 -0.181877 -0.543304 -0.698142 -0.680635 -0.104812 -0.794889
mpg
The diagonal elements are always one; we will study correlation more precisely Pearson
correlation in-depth at the end of the notebook.
Question #2:
Find the correlation between the following columns: bore, stroke, compression-ratio, and
horsepower.
Hint: if you would like to select those columns, use the following syntax:
df[['bore','stroke','compression-ratio','horsepower']]
In order to start understanding the (linear) relationship between an individual variable and
the price, we can use "regplot" which plots the scatterplot plus the fitted regression line for
the data. This will be useful later on for visualizing the fit of the simple linear regression
model as well.
We can examine the correlation between 'engine-size' and 'price' and see that it's
approximately 0.87.
Highway mpg is a potential predictor variable of price. Let's find the scatterplot of "highway-
mpg" and "price".
We can examine the correlation between 'highway-mpg' and 'price' and see it's
approximately -0.704.
We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately
-0.101616.
In [14]: df[['peak-rpm','price']].corr()
Question 3 a):
Find the correlation between x="stroke" and y="price".
Hint: if you would like to select those columns, use the following syntax:
df[["stroke","price"]].
In [58]: # Write your code below and press Shift+Enter to execute
df[["stroke","price"]].corr()
Question 3 b):
Given the correlation results between "price" and "stroke", do you expect a linear
relationship?
Categorical Variables
These are variables that describe a 'characteristic' of a data unit, and are selected from a
small group of categories. The categorical variables can have the type "object" or "int64". A
good way to visualize categorical variables is by using boxplots.
We see that the distributions of price between the different body-style categories have a
significant overlap, so body-style would not be a good predictor of price. Let's examine
engine "engine-location" and "price":
In [19]: # drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)
The describe function automatically computes basic statistics for all continuous variables.
Any NaN values are automatically skipped in these statistics.
In [20]: df.describe()
Out[20]: normalized- wheel- curb
symboling length width height
losses base weigh
The default setting of "describe" skips variables of type object. We can apply the method
"describe" on the variables of type 'object' as follows:
In [21]: df.describe(include=['object'])
Out[21]: num-
body- drive- engine- engine- num-of- fuel- ho
make aspiration of-
style wheels location type cylinders system
doors
count 201 201 201 201 201 201 201 201 201
unique 22 2 2 5 3 2 6 7 8
top toyota std four sedan fwd front ohc four mpfi
Value Counts
Value counts is a good way of understanding how many units of each characteristic/variable
we have. We can apply the "value_counts" method on the column "drive-wheels". Don’t
forget the method "value_counts" only works on pandas series, not pandas dataframes. As a
result, we only include one bracket df['drive-wheels'] , not two brackets df[['drive-
wheels']] .
In [22]: df['drive-wheels'].value_counts()
In [23]: df['drive-wheels'].value_counts().to_frame()
Out[23]: drive-wheels
fwd 118
rwd 75
4wd 8
Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and
rename the column 'drive-wheels' to 'value_counts'.
Out[24]: value_counts
fwd 118
rwd 75
4wd 8
Out[25]: value_counts
drive-wheels
fwd 118
rwd 75
4wd 8
engine-location
front 198
rear 3
After examining the value counts of the engine location, we see that engine location would
not be a good predictor variable for the price. This is because we only have three cars with a
rear engine and 198 with an engine in the front, so this result is skewed. Thus, we are not
able to draw any conclusions about the engine location.
4. Basics of Grouping
The "groupby" method groups data by different categories. The data is grouped based on
one or several variables, and analysis is performed on the individual groups.
For example, let's group by the variable "drive-wheels". We see that there are 3 different
categories of drive wheels.
In [27]: df['drive-wheels'].unique()
If we want to know, on average, which type of drive wheel is most valuable, we can group
"drive-wheels" and then average them.
We can select the columns 'drive-wheels', 'body-style' and 'price', then assign it to the
variable "df_group_one".
We can then calculate the average price for each of the different categories of data.
0 4wd 10241.000000
1 fwd 9244.779661
2 rwd 19757.613333
From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while
4-wheel and front-wheel are approximately the same in price.
You can also group by multiple variables. For example, let's group by both 'drive-wheels' and
'body-style'. This groups the dataframe by the unique combination of 'drive-wheels' and
'body-style'. We can store the results in the variable 'grouped_test1'.
This grouped data is much easier to visualize when it is made into a pivot table. A pivot table
is like an Excel spreadsheet, with one variable along the column and another along the row.
We can convert the dataframe to a pivot table using the method "pivot" to create a pivot
table from the groups.
In this case, we will leave the drive-wheels variable as the rows of the table, and pivot body-
style to become the columns of the table:
drive-wheels
Often, we won't have data for some of the pivot cells. We can fill these missing cells with the
value 0, but any other value could potentially be used as well. It should be mentioned that
missing data is quite a complex subject and is an entire course on its own.
Out[32]: price
drive-wheels
Question 4:
Use the "groupby" function to find the average "price" of each car based on "body-style".
0 convertible 21890.500000
1 hardtop 22208.500000
2 hatchback 9957.441176
3 sedan 14459.755319
4 wagon 12371.960000
Let's use a heat map to visualize the relationship between Body Style vs Price.
The default labels convey no useful information to us. Let's change that:
#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index
#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)
fig.colorbar(im)
plt.show()
Visualization is very important in data science, and Python visualization packages provide
great freedom. We will go more in-depth in a separate Python visualizations course.
The main question we want to answer in this module is, "What are the main characteristics
which have the most impact on the car price?".
To get a better measure of the important characteristics, we look at the correlation of these
variables with the car price. In other words: how is the car price dependent on this variable?
Causation: the relationship between cause and effect between two variables.
It is important to know the difference between these two. Correlation does not imply
causation. Determining correlation is much simpler the determining causation as causation
may require independent experimentation.
Pearson Correlation
The Pearson Correlation measures the linear dependence between two variables X and Y.
The resulting coefficient is a value between -1 and 1 inclusive, where:
Pearson Correlation is the default method of the function "corr". Like before, we can
calculate the Pearson Correlation of the of the 'int64' or 'float64' variables.
In [37]: df.corr()
normalized-
0.466264 1.000000 -0.056661 0.019424 0.086802 -0.373737 0.099404
losses
compression-
-0.182196 -0.114713 0.250313 0.159733 0.189867 0.259737 0.156433
ratio
highway-
0.036233 -0.181877 -0.543304 -0.698142 -0.680635 -0.104812 -0.794889
mpg
What is this P-value? The P-value is the probability value that the correlation between these
two variables is statistically significant. Normally, we choose a significance level of 0.05,
which means that we are 95% confident that the correlation between the variables is
significant.
p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.
the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.
the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.
the p-value is $>$ 0.1: there is no evidence that the correlation is significant.
We can obtain this information using "stats" module in the "scipy" library.
Conclusion:
Since the p-value is $<$ 0.001, the correlation between wheel-base and price is statistically
significant, although the linear relationship isn't extremely strong (~0.585).
Conclusion:
Since the p-value is $<$ 0.001, the correlation between horsepower and price is statistically
significant, and the linear relationship is quite strong (~0.809, close to 1).
Length vs. Price
Let's calculate the Pearson Correlation Coefficient and P-value of 'length' and 'price'.
Conclusion:
Since the p-value is $<$ 0.001, the correlation between length and price is statistically
significant, and the linear relationship is moderately strong (~0.691).
Conclusion:
Since the p-value is < 0.001, the correlation between width and price is statistically
significant, and the linear relationship is quite strong (~0.751).
Conclusion:
Since the p-value is $<$ 0.001, the correlation between curb-weight and price is statistically
significant, and the linear relationship is quite strong (~0.834).
Conclusion:
Since the p-value is $<$ 0.001, the correlation between engine-size and price is statistically
significant, and the linear relationship is very strong (~0.872).
Conclusion:
Since the p-value is $<$ 0.001, the correlation between bore and price is statistically
significant, but the linear relationship is only moderate (~0.521).
Conclusion:
Since the p-value is $<$ 0.001, the correlation between city-mpg and price is statistically
significant, and the coefficient of about -0.687 shows that the relationship is negative and
moderately strong.
6. ANOVA
F-test score: ANOVA assumes the means of all groups are the same, calculates how much
the actual means deviate from the assumption, and reports it as the F-test score. A larger
score means there is a larger difference between the means.
P-value: P-value tells how statistically significant our calculated score value is.
If our price variable is strongly correlated with the variable we are analyzing, we expect
ANOVA to return a sizeable F-test score and a small p-value.
Drive Wheels
Since ANOVA analyzes the difference between different groups of the same variable, the
groupby function will come in handy. Because the ANOVA algorithm averages the data
automatically, we do not need to take the average before hand.
0 rwd 13495.0
1 rwd 16500.0
3 fwd 13950.0
4 4wd 17450.0
5 fwd 15250.0
We can obtain the values of the method group using the method "get_group".
In [50]: grouped_test2.get_group('4wd')['price']
Out[50]: 4 17450.0
136 7603.0
140 9233.0
141 11259.0
144 8013.0
145 11694.0
150 7898.0
151 8778.0
Name: price, dtype: float64
We can use the function 'f_oneway' in the module 'stats' to obtain the F-test score and P-
value.
In [51]: # ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test
This is a great result with a large F-test score showing a strong correlation and a P-value of
almost 0 implying almost certain statistical significance. But does this mean all three tested
groups are all this highly correlated?
Let's examine them separately.
We notice that ANOVA for the categories 4wd and fwd yields a high p-value > 0.1, so the
calculated F-test score is not very statistically significant. This suggests we can't reject the
assumption that the means of these two groups are the same, or, in other words, we can't
conclude the difference in correlation to be significant.
Length
Width
Curb-weight
Engine-size
Horsepower
City-mpg
Highway-mpg
Wheel-base
Bore
Categorical variables:
Drive-wheels
Author
Dev Agnihotri