vertopal.com_Lab_Exploratory-Data-Analysis
vertopal.com_Lab_Exploratory-Data-Analysis
Objectives
After completing this lab you will be able to:
Import libraries:
import pandas as pd
import numpy as np
filename="automobileEDA.csv"
df = pd.read_csv(filename)
df.head()
[5 rows x 29 columns]
Import visualization packages "Matplotlib" and "Seaborn". Don't forget about "%matplotlib
inline" to plot in a Jupyter notebook.
symboling int64
normalized-losses int64
make object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower float64
peak-rpm float64
city-mpg int64
highway-mpg int64
price float64
city-L/100km float64
horsepower-binned object
diesel int64
gas int64
dtype: object
dtype('float64')
For example, we can calculate the correlation between variables of type "int64" or "float64"
using the method "corr":
df.corr(numeric_only=True)
gas
symboling 0.196735
normalized-losses 0.101546
wheel-base -0.307237
length -0.211187
width -0.244356
height -0.281578
curb-weight -0.221046
engine-size -0.070779
bore -0.054458
stroke -0.241303
compression-ratio -0.985231
horsepower 0.169053
peak-rpm 0.475812
city-mpg -0.265676
highway-mpg -0.198690
price -0.110326
city-L/100km 0.241282
diesel -1.000000
gas 1.000000
The diagonal elements are always one; we will study correlation more precisely Pearson
correlation in-depth at the end of the notebook.
(0.0, 53069.02551644601)
We can examine the correlation between 'engine-size' and 'price' and see that it's approximately
0.87.
df[["engine-size", "price"]].corr()
engine-size price
engine-size 1.000000 0.872335
price 0.872335 1.000000
Highway mpg is a potential predictor variable of price. Let's find the scatterplot of "highway-
mpg" and "price".
df[['highway-mpg', 'price']].corr()
highway-mpg price
highway-mpg 1.000000 -0.704692
price -0.704692 1.000000
df[['peak-rpm','price']].corr()
peak-rpm price
peak-rpm 1.000000 -0.101616
price -0.101616 1.000000
Question 3 a):
# drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)
df.describe()
The default setting of "describe" skips variables of type object. We can apply the method
"describe" on the variables of type 'object' as follows:
df.describe(include=['object'])
horsepower-binned
count 200
unique 3
top Low
freq 115
df['drive-wheels'].value_counts()
fwd 118
rwd 75
4wd 8
Name: drive-wheels, dtype: int64
df['drive-wheels'].value_counts().to_frame()
drive-wheels
fwd 118
rwd 75
4wd 8
Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and
rename the column 'drive-wheels' to 'value_counts'.
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'},
inplace=True)
drive_wheels_counts
value_counts
fwd 118
rwd 75
4wd 8
value_counts
drive-wheels
fwd 118
rwd 75
4wd 8
# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'},
inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)
value_counts
engine-location
front 198
rear 3
df['drive-wheels'].unique()
df_group_one = df[['drive-wheels','body-style','price']]
We can then calculate the average price for each of the different categories of data.
# grouping results
df_group_one = df_group_one.groupby(['drive-
wheels'],as_index=False).mean()
df_group_one
C:\Users\Admin\AppData\Local\Temp\ipykernel_2596\1990336142.py:2:
FutureWarning: The default value of numeric_only in
DataFrameGroupBy.mean is deprecated. In a future version, numeric_only
will default to False. Either specify numeric_only or select only
columns which should be valid for the function.
df_group_one = df_group_one.groupby(['drive-
wheels'],as_index=False).mean()
drive-wheels price
0 4wd 10241.000000
1 fwd 9244.779661
2 rwd 19757.613333
# grouping results
df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-
style'],as_index=False).mean()
grouped_test1
grouped_pivot = grouped_test1.pivot(index='drive-
wheels',columns='body-style')
grouped_pivot
price \
body-style convertible hardtop hatchback sedan
drive-wheels
4wd NaN NaN 7603.000000 12647.333333
fwd 11595.0 8249.000000 8396.387755 9811.800000
rwd 23949.6 24202.714286 14337.777778 21711.833333
body-style wagon
drive-wheels
4wd 9095.750000
fwd 9997.333333
rwd 16994.222222
price \
body-style convertible hardtop hatchback sedan
drive-wheels
4wd 0.0 0.000000 7603.000000 12647.333333
fwd 11595.0 8249.000000 8396.387755 9811.800000
rwd 23949.6 24202.714286 14337.777778 21711.833333
body-style wagon
drive-wheels
4wd 9095.750000
fwd 9997.333333
rwd 16994.222222
Let's use a heat map to visualize the relationship between Body Style vs Price.
fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')
#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index
#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)
fig.colorbar(im)
plt.show()
df.corr()
C:\Users\Admin\AppData\Local\Temp\ipykernel_2596\1134722465.py:1:
FutureWarning: The default value of numeric_only in DataFrame.corr is
deprecated. In a future version, it will default to False. Select only
valid columns or specify the value of numeric_only to silence this
warning.
df.corr()
gas
symboling 0.196735
normalized-losses 0.101546
wheel-base -0.307237
length -0.211187
width -0.244356
height -0.281578
curb-weight -0.221046
engine-size -0.070779
bore -0.054458
stroke -0.241303
compression-ratio -0.985231
horsepower 0.169053
peak-rpm 0.475812
city-mpg -0.265676
highway-mpg -0.198690
price -0.110326
city-L/100km 0.241282
diesel -1.000000
gas 1.000000
P-value
We can obtain this information using "stats" module in the "scipy" library.
Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a
P-value of P =", p_value)
grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-
wheels'])
grouped_test2.head(2)
df_gptest
We can obtain the values of the method group using the method "get_group".
grouped_test2.get_group('4wd')['price']
We can use the function 'f_oneway' in the module 'stats' to obtain the F-test score and P-value.
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'],
grouped_test2.get_group('rwd')['price'],
grouped_test2.get_group('4wd')['price'])
print( "ANOVA results: F=", f_val, ", P =", p_val)
This is a great result with a large F-test score showing a strong correlation and a P-value of
almost 0 implying almost certain statistical significance. But does this mean all three tested
groups are all this highly correlated?
Categorical variables: