0% found this document useful (0 votes)

23 views26 pages

Exploratiory Data Analysis

data

Uploaded by

akanshtyagi70

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views26 pages

Exploratiory Data Analysis

data

Uploaded by

akanshtyagi70

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

EDA

January 12, 2024

1 Exploratory Data Analysis

Lab component by Dhesika
Estimated time needed: 30 minutes

1.1 Objectives
• Explore features or characteristics to predict price of car
• Analyze patterns and run descriptive statistical analysis
• Group data based on identified parameters and create pivot tables
• Identify the effect of independent attributes on price of cars
Table of Contents
Import Data from Module
Analyzing Individual Feature Patterns using Visualization
Descriptive Statistical Analysis
Basics of Grouping
Correlation and Causation
What are the main characteristics that have the most impact on the car price?

1.2 Import Data from Module 2

Setup
Import libraries:

[1]: #install specific version of libraries used in lab

#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
#! mamba install scipy=1.7.1-y
#! mamba install seaborn=0.9.0-y

[2]: import pandas as pd

import numpy as np
import seaborn as sns

Download the updated dataset by running the cell below.

1
The functions below will download the dataset into your browser and store it in dataframe df:
This dataset was hosted on IBM Cloud object. Click HERE for free storage.

[3]: '''from pyodide.http import pyfetch

async def download(url, filename):

response = await pyfetch(url)
if response.status == 200:
with open(filename, "wb") as f:
f.write(await response.bytes())'''

[3]: 'from pyodide.http import pyfetch\n\nasync def download(url, filename):\n

response = await pyfetch(url)\n if response.status == 200:\n with
open(filename, "wb") as f:\n f.write(await response.bytes())'

[4]: #file_path= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/

↪IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/

↪automobileEDA.csv"

[5]: file_name="usedcars.csv"

[6]: df = pd.read_csv(file_name)

[7]: #filepath='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/
↪IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/

↪automobileEDA.csv'

#df = pd.read_csv(filepath, header=None)

View the first 5 values of the updated dataframe using dataframe.head()

[8]: df.head()

[8]: Unnamed: 0 symboling normalized-losses make num-of-doors \

0 0 3 122 alfa-romero two
1 1 3 122 alfa-romero two
2 2 1 122 alfa-romero two
3 3 2 164 audi four
4 4 2 164 audi four

body-style drive-wheels engine-location wheel-base length … \

0 convertible rwd front 88.6 0.811148 …
1 convertible rwd front 88.6 0.811148 …
2 hatchback rwd front 94.5 0.822681 …
3 sedan fwd front 99.8 0.848630 …
4 sedan 4wd front 99.4 0.848630 …

peak-rpm city-mpg highway-mpg price city-L/100km horsepower-binned \

2
0 5000.0 21 8.703704 13495.0 11.190476 Low
1 5000.0 21 8.703704 16500.0 11.190476 Low
2 5000.0 19 9.038462 16500.0 12.368421 Medium
3 5500.0 24 7.833333 13950.0 9.791667 Low
4 5500.0 18 10.681818 17450.0 13.055556 Low

fuel-type-diesel fuel-type-gas aspiration-std aspiration-turbo

0 False True True False
1 False True True False
2 False True True False
3 False True True False
4 False True True False

[5 rows x 31 columns]

1.3 Analyzing Individual Feature Patterns Using Visualization

To install Seaborn we use pip, the Python package manager.
Import visualization packages “Matplotlib” and “Seaborn”. Don’t forget about “%matplotlib inline”
to plot in a Jupyter notebook.

[9]: import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline

How to choose the right visualization method?

When visualizing individual variables, it is important to first understand what type of variable you
are dealing with. This will help us find the right visualization method for that variable.

[10]: # list the data types for each column

print(df.dtypes)

Unnamed: 0 int64
symboling int64
normalized-losses int64
make object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64

3
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower int64
peak-rpm float64
city-mpg int64
highway-mpg float64
price float64
city-L/100km float64
horsepower-binned object
fuel-type-diesel bool
fuel-type-gas bool
aspiration-std bool
aspiration-turbo bool
dtype: object

[11]: df['peak-rpm'].dtypes

[11]: dtype('float64')

For example, we can calculate the correlation between variables of type “int64” or “float64” using
the method “corr”:

[12]: numeric_df = df.select_dtypes(include=['float64', 'int64'])

numeric_df.corr()

[12]: Unnamed: 0 symboling normalized-losses wheel-base \

Unnamed: 0 1.000000 -0.162764 -0.241092 0.125517
symboling -0.162764 1.000000 0.466264 -0.535987
normalized-losses -0.241092 0.466264 1.000000 -0.056661
wheel-base 0.125517 -0.535987 -0.056661 1.000000
length 0.161848 -0.365404 0.019424 0.876024
width 0.043976 -0.242423 0.086802 0.814507
height 0.252015 -0.550160 -0.373737 0.590742
curb-weight 0.064820 -0.233118 0.099404 0.782097
engine-size -0.047764 -0.110581 0.112360 0.572027
bore 0.244734 -0.140019 -0.029862 0.493244
stroke -0.162490 -0.008153 0.055045 0.158018
compression-ratio 0.144301 -0.182196 -0.114713 0.250313
horsepower -0.022505 0.075810 0.217300 0.371178
peak-rpm -0.195662 0.279740 0.239543 -0.360305
city-mpg 0.027956 -0.035527 -0.225016 -0.470606
highway-mpg -0.078346 -0.029807 0.181189 0.577576
price -0.118214 -0.082391 0.133999 0.584642
city-L/100km -0.099157 0.066171 0.238567 0.476153

length width height curb-weight engine-size \

4
Unnamed: 0 0.161848 0.043976 0.252015 0.064820 -0.047764
symboling -0.365404 -0.242423 -0.550160 -0.233118 -0.110581
normalized-losses 0.019424 0.086802 -0.373737 0.099404 0.112360
wheel-base 0.876024 0.814507 0.590742 0.782097 0.572027
length 1.000000 0.857170 0.492063 0.880665 0.685025
width 0.857170 1.000000 0.306002 0.866201 0.729436
height 0.492063 0.306002 1.000000 0.307581 0.074694
curb-weight 0.880665 0.866201 0.307581 1.000000 0.849072
engine-size 0.685025 0.729436 0.074694 0.849072 1.000000
bore 0.608971 0.544885 0.180449 0.644060 0.572609
stroke 0.123952 0.188822 -0.060663 0.167438 0.205928
compression-ratio 0.159733 0.189867 0.259737 0.156433 0.028889
horsepower 0.579795 0.615056 -0.087001 0.757981 0.822668
peak-rpm -0.285970 -0.245800 -0.309974 -0.279361 -0.256733
city-mpg -0.665192 -0.633531 -0.049800 -0.749543 -0.650546
highway-mpg 0.707108 0.736728 0.084301 0.836921 0.783465
price 0.690628 0.751265 0.135486 0.834415 0.872335
city-L/100km 0.657373 0.673363 0.003811 0.785353 0.745059

bore stroke compression-ratio horsepower \

Unnamed: 0 0.244734 -0.162490 0.144301 -0.022505
symboling -0.140019 -0.008153 -0.182196 0.075810
normalized-losses -0.029862 0.055045 -0.114713 0.217300
wheel-base 0.493244 0.158018 0.250313 0.371178
length 0.608971 0.123952 0.159733 0.579795
width 0.544885 0.188822 0.189867 0.615056
height 0.180449 -0.060663 0.259737 -0.087001
curb-weight 0.644060 0.167438 0.156433 0.757981
engine-size 0.572609 0.205928 0.028889 0.822668
bore 1.000000 -0.055390 0.001263 0.566903
stroke -0.055390 1.000000 0.187871 0.098128
compression-ratio 0.001263 0.187871 1.000000 -0.214489
horsepower 0.566903 0.098128 -0.214489 1.000000
peak-rpm -0.267392 -0.063561 -0.435780 0.107884
city-mpg -0.582027 -0.033956 0.331425 -0.822192
highway-mpg 0.559112 0.047089 -0.223361 0.840627
price 0.543155 0.082269 0.071107 0.809607
city-L/100km 0.554610 0.036133 -0.299372 0.889482

peak-rpm city-mpg highway-mpg price city-L/100km

Unnamed: 0 -0.195662 0.027956 -0.078346 -0.118214 -0.099157
symboling 0.279740 -0.035527 -0.029807 -0.082391 0.066171
normalized-losses 0.239543 -0.225016 0.181189 0.133999 0.238567
wheel-base -0.360305 -0.470606 0.577576 0.584642 0.476153
length -0.285970 -0.665192 0.707108 0.690628 0.657373
width -0.245800 -0.633531 0.736728 0.751265 0.673363
height -0.309974 -0.049800 0.084301 0.135486 0.003811

5
curb-weight -0.279361 -0.749543 0.836921 0.834415 0.785353
engine-size -0.256733 -0.650546 0.783465 0.872335 0.745059
bore -0.267392 -0.582027 0.559112 0.543155 0.554610
stroke -0.063561 -0.033956 0.047089 0.082269 0.036133
compression-ratio -0.435780 0.331425 -0.223361 0.071107 -0.299372
horsepower 0.107884 -0.822192 0.840627 0.809607 0.889482
peak-rpm 1.000000 -0.115413 0.017694 -0.101616 0.115830
city-mpg -0.115413 1.000000 -0.909024 -0.686571 -0.949713
highway-mpg 0.017694 -0.909024 1.000000 0.801118 0.958306
price -0.101616 -0.686571 0.801118 1.000000 0.789898
city-L/100km 0.115830 -0.949713 0.958306 0.789898 1.000000

The diagonal elements are always one; we will study correlation more precisely Pearson correlation
in-depth at the end of the notebook.

[13]: # Write your code below and press Shift+Enter to execute

df[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr()

[13]: bore stroke compression-ratio horsepower

bore 1.000000 -0.055390 0.001263 0.566903
stroke -0.055390 1.000000 0.187871 0.098128
compression-ratio 0.001263 0.187871 1.000000 -0.214489
horsepower 0.566903 0.098128 -0.214489 1.000000

Continuous Numerical Variables:

Continuous numerical variables are variables that may contain any value within some range. They
can be of type “int64” or “float64”. A great way to visualize these variables is by using scatterplots
with fitted lines.
In order to start understanding the (linear) relationship between an individual variable and the
price, we can use “regplot” which plots the scatterplot plus the fitted regression line for the data.
This will be useful later on for visualizing the fit of the simple linear regression model as well.
Let’s see several examples of different linear relationships:
Positive Linear Relationship
Let’s find the scatterplot of “engine-size” and “price”.

[14]: # Engine size as potential predictor variable of price

sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)

[14]: (0.0, 53407.189223961206)

6
As the engine-size goes up, the price goes up: this indicates a positive direct correlation between
these two variables. Engine size seems like a pretty good predictor of price since the regression line
is almost a perfect diagonal line.
We can examine the correlation between ‘engine-size’ and ‘price’ and see that it’s approximately
0.87.

[15]: df[["engine-size", "price"]].corr()

[15]: engine-size price

engine-size 1.000000 0.872335
price 0.872335 1.000000

Highway mpg is a potential predictor variable of price. Let’s find the scatterplot of “highway-mpg”
and “price”.

[16]: sns.regplot(x="highway-mpg", y="price", data=df)

[16]: <Axes: xlabel='highway-mpg', ylabel='price'>

7
As highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship
between these two variables. Highway mpg could potentially be a predictor of price.
We can examine the correlation between ‘highway-mpg’ and ‘price’ and see it’s approximately
-0.704.

[17]: df[['highway-mpg', 'price']].corr()

[17]: highway-mpg price

highway-mpg 1.000000 0.801118
price 0.801118 1.000000

Weak Linear Relationship

Let’s see if “peak-rpm” is a predictor variable of “price”.

[18]: sns.regplot(x="peak-rpm", y="price", data=df)

[18]: <Axes: xlabel='peak-rpm', ylabel='price'>

8
Peak rpm does not seem like a good predictor of the price at all since the regression line is close
to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of
variability. Therefore, it’s not a reliable variable.
We can examine the correlation between ‘peak-rpm’ and ‘price’ and see it’s approximately -0.101616.

[19]: df[['peak-rpm','price']].corr()

[19]: peak-rpm price

peak-rpm 1.000000 -0.101616
price -0.101616 1.000000

[20]: # Write your code below and press Shift+Enter to execute

df[["stroke","price"]].corr()

[20]: stroke price

stroke 1.000000 0.082269
price 0.082269 1.000000

[21]: # Write your code below and press Shift+Enter to execute

sns.regplot(x="stroke", y="price", data=df)

9
[21]: <Axes: xlabel='stroke', ylabel='price'>

Categorical Variables
These are variables that describe a ‘characteristic’ of a data unit, and are selected from a small
group of categories. The categorical variables can have the type “object” or “int64”. A good way
to visualize categorical variables is by using boxplots.
Let’s look at the relationship between “body-style” and “price”.

[22]: sns.boxplot(x="body-style", y="price", data=df)

[22]: <Axes: xlabel='body-style', ylabel='price'>

10
We see that the distributions of price between the different body-style categories have a significant
overlap, so body-style would not be a good predictor of price. Let’s examine engine “engine-
location” and “price”:

[23]: sns.boxplot(x="engine-location", y="price", data=df)

[23]: <Axes: xlabel='engine-location', ylabel='price'>

11
Here we see that the distribution of price between these two engine-location categories, front and
rear, are distinct enough to take engine-location as a potential good predictor of price.
Let’s examine “drive-wheels” and “price”.

[24]: # drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)

[24]: <Axes: xlabel='drive-wheels', ylabel='price'>

12
Here we see that the distribution of price between the different drive-wheels categories differs. As
such, drive-wheels could potentially be a predictor of price.

1.4 Descriptive Statistical Analysis

Let’s first take a look at the variables by utilizing a description method.
The describe function automatically computes basic statistics for all continuous variables. Any
NaN values are automatically skipped in these statistics.
This will show:
the count of that variable
the mean
the standard deviation (std)
the minimum value
the IQR (Interquartile Range: 25%, 50% and 75%)
the maximum value
We can apply the method “describe” as follows:

13
[25]: df.describe()

[25]: Unnamed: 0 symboling normalized-losses wheel-base length \

count 201.000000 201.000000 201.00000 201.000000 201.000000
mean 100.000000 0.840796 122.00000 98.797015 0.837102
std 58.167861 1.254802 31.99625 6.066366 0.059213
min 0.000000 -2.000000 65.00000 86.600000 0.678039
25% 50.000000 0.000000 101.00000 94.500000 0.801538
50% 100.000000 1.000000 122.00000 97.000000 0.832292
75% 150.000000 2.000000 137.00000 102.400000 0.881788
max 200.000000 3.000000 256.00000 120.900000 1.000000

width height curb-weight engine-size bore \

count 201.000000 201.000000 201.000000 201.000000 201.000000
mean 0.915126 0.899108 2555.666667 126.875622 3.330692
std 0.029187 0.040933 517.296727 41.546834 0.268072
min 0.837500 0.799331 1488.000000 61.000000 2.540000
25% 0.890278 0.869565 2169.000000 98.000000 3.150000
50% 0.909722 0.904682 2414.000000 120.000000 3.310000
75% 0.925000 0.928094 2926.000000 141.000000 3.580000
max 1.000000 1.000000 4066.000000 326.000000 3.940000

stroke compression-ratio horsepower peak-rpm city-mpg \

count 201.000000 201.000000 201.000000 201.000000 201.000000
mean 3.256874 10.164279 103.402985 5117.665368 25.179104
std 0.316048 4.004965 37.365650 478.113805 6.423220
min 2.070000 7.000000 48.000000 4150.000000 13.000000
25% 3.110000 8.600000 70.000000 4800.000000 19.000000
50% 3.290000 9.000000 95.000000 5125.369458 24.000000
75% 3.410000 9.400000 116.000000 5500.000000 30.000000
max 4.170000 23.000000 262.000000 6600.000000 49.000000

highway-mpg price city-L/100km

count 201.000000 201.000000 201.000000
mean 8.044957 13207.129353 9.944145
std 1.840739 7947.066342 2.534599
min 4.351852 5118.000000 4.795918
25% 6.911765 7775.000000 7.833333
50% 7.833333 10295.000000 9.791667
75% 9.400000 16500.000000 12.368421
max 14.687500 45400.000000 18.076923

The default setting of “describe” skips variables of type object. We can apply the method “describe”
on the variables of type ‘object’ as follows:

[26]: df.describe(include=['object'])

14
[26]: make num-of-doors body-style drive-wheels engine-location \
count 201 201 201 201 201
unique 22 2 5 3 2
top toyota four sedan fwd front
freq 32 115 94 118 198

engine-type num-of-cylinders fuel-system horsepower-binned

count 201 201 201 201
unique 6 7 8 3
top ohc four mpfi Low
freq 145 157 92 153

Value Counts
Value counts is a good way of understanding how many units of each characteristic/variable we
have. We can apply the “value_counts” method on the column “drive-wheels”. Don’t forget the
method “value_counts” only works on pandas series, not pandas dataframes. As a result, we only
include one bracket df[‘drive-wheels’], not two brackets df[[‘drive-wheels’]].

[27]: df['drive-wheels'].value_counts()

[27]: drive-wheels
fwd 118
rwd 75
4wd 8
Name: count, dtype: int64

We can convert the series to a dataframe as follows:

[28]: df['drive-wheels'].value_counts().to_frame()

[28]: count
drive-wheels
fwd 118
rwd 75
4wd 8

Let’s repeat the above steps but save the results to the dataframe “drive_wheels_counts” and
rename the column ‘drive-wheels’ to ‘value_counts’.

[29]: drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()

drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'},␣
↪inplace=True)

drive_wheels_counts

[29]: count
drive-wheels
fwd 118
rwd 75

15
4wd 8

Now let’s rename the index to ‘drive-wheels’:

[30]: drive_wheels_counts.index.name = 'drive-wheels'

drive_wheels_counts

[30]: count
drive-wheels
fwd 118
rwd 75
4wd 8

We can repeat the above process for the variable ‘engine-location’.

[31]: # engine-location as variable

engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'},␣
↪inplace=True)

engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)

[31]: count
engine-location
front 198
rear 3

After examining the value counts of the engine location, we see that engine location would not be
a good predictor variable for the price. This is because we only have three cars with a rear engine
and 198 with an engine in the front, so this result is skewed. Thus, we are not able to draw any
conclusions about the engine location.

1.5 Basics of Grouping

The “groupby” method groups data by different categories. The data is grouped based on one or
several variables, and analysis is performed on the individual groups.
For example, let’s group by the variable “drive-wheels”. We see that there are 3 different categories
of drive wheels.

[32]: df['drive-wheels'].unique()

[32]: array(['rwd', 'fwd', '4wd'], dtype=object)

If we want to know, on average, which type of drive wheel is most valuable, we can group “drive-
wheels” and then average them.
We can select the columns ‘drive-wheels’, ‘body-style’ and ‘price’, then assign it to the variable
“df_group_one”.

[33]: df_group_one = df[['drive-wheels','body-style','price']]

16
We can then calculate the average price for each of the different categories of data.

[34]: # Assuming you want to analyze 'price' column based on 'drive-wheels'

df_group_one = df.groupby(['drive-wheels'], as_index=False)['price'].mean()
df_group_one

[34]: drive-wheels price

0 4wd 10241.000000
1 fwd 9244.779661
2 rwd 19757.613333

From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel
and front-wheel are approximately the same in price.
You can also group by multiple variables. For example, let’s group by both ‘drive-wheels’ and ‘body-
style’. This groups the dataframe by the unique combination of ‘drive-wheels’ and ‘body-style’. We
can store the results in the variable ‘grouped_test1’.

[35]: # grouping results

df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).
↪mean()

grouped_test1

[35]: drive-wheels body-style price

0 4wd hatchback 7603.000000
1 4wd sedan 12647.333333
2 4wd wagon 9095.750000
3 fwd convertible 11595.000000
4 fwd hardtop 8249.000000
5 fwd hatchback 8396.387755
6 fwd sedan 9811.800000
7 fwd wagon 9997.333333
8 rwd convertible 23949.600000
9 rwd hardtop 24202.714286
10 rwd hatchback 14337.777778
11 rwd sedan 21711.833333
12 rwd wagon 16994.222222

This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is
like an Excel spreadsheet, with one variable along the column and another along the row. We can
convert the dataframe to a pivot table using the method “pivot” to create a pivot table from the
groups.
In this case, we will leave the drive-wheels variable as the rows of the table, and pivot body-style
to become the columns of the table:

[36]: grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')

grouped_pivot

17
[36]: price \
body-style convertible hardtop hatchback sedan
drive-wheels
4wd NaN NaN 7603.000000 12647.333333
fwd 11595.0 8249.000000 8396.387755 9811.800000
rwd 23949.6 24202.714286 14337.777778 21711.833333

body-style wagon
drive-wheels
4wd 9095.750000
fwd 9997.333333
rwd 16994.222222

Often, we won’t have data for some of the pivot cells. We can fill these missing cells with the value
0, but any other value could potentially be used as well. It should be mentioned that missing data
is quite a complex subject and is an entire course on its own.

[37]: grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0

grouped_pivot

[37]: price \
body-style convertible hardtop hatchback sedan
drive-wheels
4wd 0.0 0.000000 7603.000000 12647.333333
fwd 11595.0 8249.000000 8396.387755 9811.800000
rwd 23949.6 24202.714286 14337.777778 21711.833333

body-style wagon
drive-wheels
4wd 9095.750000
fwd 9997.333333
rwd 16994.222222

[38]: df_gptest2 = df[['body-style','price']]

grouped_test_bodystyle = df_gptest2.groupby(['body-style'],as_index= False).
↪mean()

grouped_test_bodystyle

[38]: body-style price

0 convertible 21890.500000
1 hardtop 22208.500000
2 hatchback 9957.441176
3 sedan 14459.755319
4 wagon 12371.960000

If you did not import “pyplot”, let’s do it again.

18
[39]: import matplotlib.pyplot as plt
%matplotlib inline

Variables: Drive Wheels and Body Style vs. Price

Let’s use a heat map to visualize the relationship between Body Style vs Price.

[40]: #use the grouped results

plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

The heatmap plots the target variable (price) proportional to colour with respect to the variables
‘drive-wheel’ and ‘body-style’ on the vertical and horizontal axis, respectively. This allows us to
visualize how the price is related to ‘drive-wheel’ and ‘body-style’.
The default labels convey no useful information to us. Let’s change that:

[41]: fig, ax = plt.subplots()

im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]

19
col_labels = grouped_pivot.index

#move ticks and labels to the center

ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long

plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

Visualization is very important in data science, and Python visualization packages provide great
freedom. We will go more in-depth in a separate Python visualizations course.

20
The main question we want to answer in this module is, “What are the main characteristics which
have the most impact on the car price?”.
To get a better measure of the important characteristics, we look at the correlation of these variables
with the car price. In other words: how is the car price dependent on this variable?

1.6 Correlation and Causation

Correlation: a measure of the extent of interdependence between variables.
Causation: the relationship between cause and effect between two variables.
It is important to know the difference between these two. Correlation does not imply causation.
Determining correlation is much simpler the determining causation as causation may require inde-
pendent experimentation.
Pearson Correlation
The Pearson Correlation measures the linear dependence between two variables X and Y.
The resulting coeﬀicient is a value between -1 and 1 inclusive, where:
1: Perfect positive linear correlation.
0: No linear correlation, the two variables most likely do not affect each other.
-1: Perfect negative linear correlation.
Pearson Correlation is the default method of the function “corr”. Like before, we can calculate the
Pearson Correlation of the of the ‘int64’ or ‘float64’ variables.

[42]: numeric_df = df.select_dtypes(include=['float64', 'int64'])

numeric_df.corr()

[42]: Unnamed: 0 symboling normalized-losses wheel-base \

21
length width height curb-weight engine-size \
Unnamed: 0 0.161848 0.043976 0.252015 0.064820 -0.047764
symboling -0.365404 -0.242423 -0.550160 -0.233118 -0.110581
normalized-losses 0.019424 0.086802 -0.373737 0.099404 0.112360
wheel-base 0.876024 0.814507 0.590742 0.782097 0.572027
length 1.000000 0.857170 0.492063 0.880665 0.685025
width 0.857170 1.000000 0.306002 0.866201 0.729436
height 0.492063 0.306002 1.000000 0.307581 0.074694
curb-weight 0.880665 0.866201 0.307581 1.000000 0.849072
engine-size 0.685025 0.729436 0.074694 0.849072 1.000000
bore 0.608971 0.544885 0.180449 0.644060 0.572609
stroke 0.123952 0.188822 -0.060663 0.167438 0.205928
compression-ratio 0.159733 0.189867 0.259737 0.156433 0.028889
horsepower 0.579795 0.615056 -0.087001 0.757981 0.822668
peak-rpm -0.285970 -0.245800 -0.309974 -0.279361 -0.256733
city-mpg -0.665192 -0.633531 -0.049800 -0.749543 -0.650546
highway-mpg 0.707108 0.736728 0.084301 0.836921 0.783465
price 0.690628 0.751265 0.135486 0.834415 0.872335
city-L/100km 0.657373 0.673363 0.003811 0.785353 0.745059

bore stroke compression-ratio horsepower \

peak-rpm city-mpg highway-mpg price city-L/100km

22
width -0.245800 -0.633531 0.736728 0.751265 0.673363
height -0.309974 -0.049800 0.084301 0.135486 0.003811
curb-weight -0.279361 -0.749543 0.836921 0.834415 0.785353
engine-size -0.256733 -0.650546 0.783465 0.872335 0.745059
bore -0.267392 -0.582027 0.559112 0.543155 0.554610
stroke -0.063561 -0.033956 0.047089 0.082269 0.036133
compression-ratio -0.435780 0.331425 -0.223361 0.071107 -0.299372
horsepower 0.107884 -0.822192 0.840627 0.809607 0.889482
peak-rpm 1.000000 -0.115413 0.017694 -0.101616 0.115830
city-mpg -0.115413 1.000000 -0.909024 -0.686571 -0.949713
highway-mpg 0.017694 -0.909024 1.000000 0.801118 0.958306
price -0.101616 -0.686571 0.801118 1.000000 0.789898
city-L/100km 0.115830 -0.949713 0.958306 0.789898 1.000000

Sometimes we would like to know the significant of the correlation estimate.

P-value
What is this P-value? The P-value is the probability value that the correlation between these two
variables is statistically significant. Normally, we choose a significance level of 0.05, which means
that we are 95% confident that the correlation between the variables is significant.
By convention, when the
p-value is < 0.001: we say there is strong evidence that the correlation is significant.
the p-value is < 0.05: there is moderate evidence that the correlation is significant.
the p-value is < 0.1: there is weak evidence that the correlation is significant.
the p-value is > 0.1: there is no evidence that the correlation is significant.
We can obtain this information using “stats” module in the “scipy” library.

[43]: from scipy import stats

Wheel-Base vs. Price

Let’s calculate the Pearson Correlation Coeﬀicient and P-value of ‘wheel-base’ and ‘price’.

[44]: pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value␣
↪of P =", p_value)

The Pearson Correlation Coefficient is 0.5846418222655083 with a P-value of P =

8.076488270732552e-20
Conclusion:
Since the p-value is < 0.001, the correlation between wheel-base and price is statistically significant,
although the linear relationship isn’t extremely strong (~0.585).
Horsepower vs. Price
Let’s calculate the Pearson Correlation Coeﬀicient and P-value of ‘horsepower’ and ‘price’.

23
[45]: pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value␣
↪of P = ", p_value)

The Pearson Correlation Coefficient is 0.8096068016571052 with a P-value of P =

6.273536270651023e-48
Conclusion:
Since the p-value is < 0.001, the correlation between horsepower and price is statistically significant,
and the linear relationship is quite strong (~0.809, close to 1).
Length vs. Price
Let’s calculate the Pearson Correlation Coeﬀicient and P-value of ‘length’ and ‘price’.

[46]: pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value␣
↪of P = ", p_value)

The Pearson Correlation Coefficient is 0.6906283804483644 with a P-value of P =

8.016477466158383e-30
Conclusion:
Since the p-value is < 0.001, the correlation between length and price is statistically significant,
and the linear relationship is moderately strong (~0.691).
Width vs. Price
Let’s calculate the Pearson Correlation Coeﬀicient and P-value of ‘width’ and ‘price’:

[47]: pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value␣
↪of P =", p_value )

The Pearson Correlation Coefficient is 0.7512653440522665 with a P-value of P =

9.200335510484122e-38

Conclusion: Since the p-value is < 0.001, the correlation between width and price is statistically
significant, and the linear relationship is quite strong (~0.751).

1.6.1 Curb-Weight vs. Price

Let’s calculate the Pearson Correlation Coeﬀicient and P-value of ‘curb-weight’ and ‘price’:

[48]: pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])

print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value␣
↪of P = ", p_value)

The Pearson Correlation Coefficient is 0.8344145257702849 with a P-value of P =

2.189577238893391e-53

24
Conclusion:
Since the p-value is < 0.001, the correlation between curb-weight and price is statistically significant,
and the linear relationship is quite strong (~0.834).
Engine-Size vs. Price
Let’s calculate the Pearson Correlation Coeﬀicient and P-value of ‘engine-size’ and ‘price’:

[49]: pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value␣
↪of P =", p_value)

The Pearson Correlation Coefficient is 0.8723351674455185 with a P-value of P =

9.265491622198793e-64
Conclusion:
Since the p-value is < 0.001, the correlation between engine-size and price is statistically significant,
and the linear relationship is very strong (~0.872).
Bore vs. Price
Let’s calculate the Pearson Correlation Coeﬀicient and P-value of ‘bore’ and ‘price’:

[50]: pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value␣
↪of P = ", p_value )

The Pearson Correlation Coefficient is 0.5431553832626606 with a P-value of P =

8.049189483935034e-17
Conclusion:
Since the p-value is < 0.001, the correlation between bore and price is statistically significant, but
the linear relationship is only moderate (~0.521).
We can relate the process for each ‘city-mpg’ and ‘highway-mpg’:
City-mpg vs. Price

[51]: pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value␣
↪of P = ", p_value)

The Pearson Correlation Coefficient is -0.6865710067844681 with a P-value of P

= 2.3211320655673725e-29
Conclusion:
Since the p-value is < 0.001, the correlation between city-mpg and price is statistically significant,
and the coeﬀicient of about -0.687 shows that the relationship is negative and moderately strong.
Highway-mpg vs. Price

25
[52]: pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value␣
↪of P = ", p_value )

The Pearson Correlation Coefficient is 0.8011176263981971 with a P-value of P =

3.046784581041982e-46

Conclusion: Since the p-value is < 0.001, the correlation between highway-mpg and price is
statistically significant, and the coeﬀicient of about -0.705 shows that the relationship is negative
and moderately strong.
Conclusion: Important Variables
We now have a better idea of what our data looks like and which variables are important to take
into account when predicting the car price. We have narrowed it down to the following variables:
Continuous numerical variables:
Length
Width
Curb-weight
Engine-size
Horsepower
City-mpg
Highway-mpg
Wheel-base
Bore
Categorical variables:
Drive-wheels
As we now move into building machine learning models to automate our analysis, feeding the model
with variables that meaningfully affect our target variable will improve our model’s prediction
performance.

[ ]:

Cars Sales Dashboard
No ratings yet
Cars Sales Dashboard
19 pages
Lec 20
No ratings yet
Lec 20
24 pages
Exploratory Data Analysis (EDA) Using Python
No ratings yet
Exploratory Data Analysis (EDA) Using Python
21 pages
Engo 645
No ratings yet
Engo 645
10 pages
1.5 Data Analysis With Python - Exploratory Data Analysis 1
No ratings yet
1.5 Data Analysis With Python - Exploratory Data Analysis 1
17 pages
Automobile Sales Predictions
No ratings yet
Automobile Sales Predictions
19 pages
Lec ExploratoryDataAnalysis1Unit5Part1
No ratings yet
Lec ExploratoryDataAnalysis1Unit5Part1
22 pages
Data Mining
No ratings yet
Data Mining
10 pages
Engo 645
No ratings yet
Engo 645
9 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
Data Analysis Report
No ratings yet
Data Analysis Report
74 pages
Car Price Prediction
No ratings yet
Car Price Prediction
72 pages
Laptop Price Prediction
No ratings yet
Laptop Price Prediction
15 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
7 pages
Machine Learning Project 1690186790
No ratings yet
Machine Learning Project 1690186790
18 pages
Internship
No ratings yet
Internship
23 pages
Eda 1
No ratings yet
Eda 1
29 pages
Data Vizualization - Jupyter Notebook
No ratings yet
Data Vizualization - Jupyter Notebook
20 pages
EDA Withoutcode
No ratings yet
EDA Withoutcode
36 pages
Lab1 For Module3 - Python Code
No ratings yet
Lab1 For Module3 - Python Code
10 pages
Exp 5 Exploratory Data Analysis SDK Ok
No ratings yet
Exp 5 Exploratory Data Analysis SDK Ok
13 pages
Topic 62 - Portfolio Risk Question
No ratings yet
Topic 62 - Portfolio Risk Question
8 pages
Car Price Prediction Project
No ratings yet
Car Price Prediction Project
34 pages
Temp 2 Lab 1
No ratings yet
Temp 2 Lab 1
5 pages
Automobile Price Data
No ratings yet
Automobile Price Data
53 pages
Car Price Prediction
No ratings yet
Car Price Prediction
35 pages
Eda Expt
No ratings yet
Eda Expt
6 pages
Consumer Behaviour Toward Shopping Malls in Ludhiana
100% (1)
Consumer Behaviour Toward Shopping Malls in Ludhiana
86 pages
Xii Project PDF
No ratings yet
Xii Project PDF
19 pages
Data Analisis 2
No ratings yet
Data Analisis 2
13 pages
Practical Example Full Notes
No ratings yet
Practical Example Full Notes
48 pages
Data Analysis
No ratings yet
Data Analysis
58 pages
Python Pandas Matplot
No ratings yet
Python Pandas Matplot
15 pages
Car Price Prediction 1
No ratings yet
Car Price Prediction 1
24 pages
Note
No ratings yet
Note
9 pages
Swaraj Project
No ratings yet
Swaraj Project
16 pages
Correlation
No ratings yet
Correlation
25 pages
Intro To Exploratory Data Analysis Eda in Python
No ratings yet
Intro To Exploratory Data Analysis Eda in Python
7 pages
Diagnosis of GCM RCM Driven Rainfall Patterns Under Changing Climate Through The Robust Selection of Multi Model Ensemble and Sub Ensembles
No ratings yet
Diagnosis of GCM RCM Driven Rainfall Patterns Under Changing Climate Through The Robust Selection of Multi Model Ensemble and Sub Ensembles
30 pages
Course2 - DataAnalysis With Python - Week3 - Exploratory Data Analysis
No ratings yet
Course2 - DataAnalysis With Python - Week3 - Exploratory Data Analysis
23 pages
Team AN
No ratings yet
Team AN
23 pages
Belarus Car Price Prediction
No ratings yet
Belarus Car Price Prediction
18 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Data Frames and Charts 2: 2.1 Dealing With Missing Values
No ratings yet
Data Frames and Charts 2: 2.1 Dealing With Missing Values
12 pages
Elite Sports Cars Eda
No ratings yet
Elite Sports Cars Eda
9 pages
Project - Analyzing The Impact of Car Features On Price and Profitability
No ratings yet
Project - Analyzing The Impact of Car Features On Price and Profitability
8 pages
Automobil E Data Analysis: Name Pgp-Dsba Online January' 21 Date: Dd/mm/yyyy
No ratings yet
Automobil E Data Analysis: Name Pgp-Dsba Online January' 21 Date: Dd/mm/yyyy
11 pages
Task 3 Car Price Prediction Using Machine Learning
No ratings yet
Task 3 Car Price Prediction Using Machine Learning
30 pages
Eda Notes
No ratings yet
Eda Notes
4 pages
3maout VAM
No ratings yet
3maout VAM
26 pages
9587 - 9638 - 9563 - ADS - Exp1.ipynb - Colab
No ratings yet
9587 - 9638 - 9563 - ADS - Exp1.ipynb - Colab
8 pages
DV Ca-1
No ratings yet
DV Ca-1
9 pages
Python Codes
No ratings yet
Python Codes
17 pages
Green Belt Analyse Phase
No ratings yet
Green Belt Analyse Phase
71 pages
PDF Experiments-1 DADV
No ratings yet
PDF Experiments-1 DADV
41 pages
Trilokesh Assignment
No ratings yet
Trilokesh Assignment
15 pages
Numpy,,Pandas (24.4.25)
No ratings yet
Numpy,,Pandas (24.4.25)
1 page
Group 5 Softt
No ratings yet
Group 5 Softt
12 pages
Impact of Teachers' Commitment and Academic Performance of Students in Selected Secondary Schools in Pasacao District, Division of Camarines Sur: An Analysis
No ratings yet
Impact of Teachers' Commitment and Academic Performance of Students in Selected Secondary Schools in Pasacao District, Division of Camarines Sur: An Analysis
10 pages
SMDM Business+Report
No ratings yet
SMDM Business+Report
11 pages
SMDM Business+Report
No ratings yet
SMDM Business+Report
11 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Political Science in Theory and Practice The Politics Model (Ruth Lane) (Z-Library)
No ratings yet
Political Science in Theory and Practice The Politics Model (Ruth Lane) (Z-Library)
214 pages
BT307 Biological Data Analysis Assignment 1
No ratings yet
BT307 Biological Data Analysis Assignment 1
2 pages
MAT 240 Module Three Assignment
No ratings yet
MAT 240 Module Three Assignment
3 pages
Jayalaxmi Agro Tech QT
No ratings yet
Jayalaxmi Agro Tech QT
16 pages
Wholesale Customer Retail
No ratings yet
Wholesale Customer Retail
1 page
3 DSeismic Waveform Classification
No ratings yet
3 DSeismic Waveform Classification
5 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
Abnormal Psyc Session 2 Slides
No ratings yet
Abnormal Psyc Session 2 Slides
28 pages
TP051159 AML Indi
No ratings yet
TP051159 AML Indi
25 pages
2016 Martela Ryan MotivEmot
No ratings yet
2016 Martela Ryan MotivEmot
7 pages
IMRAD-Inv MGT & MGRL Competence
No ratings yet
IMRAD-Inv MGT & MGRL Competence
36 pages
Unit 12 - Simple Correlation and Regression
No ratings yet
Unit 12 - Simple Correlation and Regression
35 pages
Anisco, Extracurricular Activities
No ratings yet
Anisco, Extracurricular Activities
7 pages
BPCC-104 Jun 2023
No ratings yet
BPCC-104 Jun 2023
6 pages
Susmit - Final Thesis - Chap5 - Conclusion
No ratings yet
Susmit - Final Thesis - Chap5 - Conclusion
22 pages
Social Re Serch Foundation Paper
No ratings yet
Social Re Serch Foundation Paper
6 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
10-Public Perception and Willingness To Pay For High-Occupancy Toll (HOT) Lanes in The United Arab Emirates - Questionnaire-Based Survey
No ratings yet
10-Public Perception and Willingness To Pay For High-Occupancy Toll (HOT) Lanes in The United Arab Emirates - Questionnaire-Based Survey
6 pages
Prediction of Mental Health Problems Among Children Using Machine Learning Techniques
No ratings yet
Prediction of Mental Health Problems Among Children Using Machine Learning Techniques
6 pages
BS - Abid - Term Paper
No ratings yet
BS - Abid - Term Paper
29 pages
Nmims - Esg
No ratings yet
Nmims - Esg
13 pages
Sciencedirect: Procedia Computer Science 225 (2023) 1458-1467
No ratings yet
Sciencedirect: Procedia Computer Science 225 (2023) 1458-1467
10 pages
The Effect of Clinician-Patient Alliance and Communication On Treatment Adherence in Mental Health Care: A Systematic Review
No ratings yet
The Effect of Clinician-Patient Alliance and Communication On Treatment Adherence in Mental Health Care: A Systematic Review
12 pages
Journal of Abnormal Psychology: Enduring Mental Health: Prevalence and Prediction
No ratings yet
Journal of Abnormal Psychology: Enduring Mental Health: Prevalence and Prediction
44 pages
1 s2.0 S2772442523000527 Main
No ratings yet
1 s2.0 S2772442523000527 Main
20 pages
2025-01-25 Contents Books
No ratings yet
2025-01-25 Contents Books
9 pages
Lesson 3.3 Probability Normal Distribution Linear Regression and Correlation
No ratings yet
Lesson 3.3 Probability Normal Distribution Linear Regression and Correlation
29 pages
Psychological Flexibility and Cognitive-Affective
No ratings yet
Psychological Flexibility and Cognitive-Affective
13 pages
Informative Pictures 1
No ratings yet
Informative Pictures 1
1 page
1 Esti+Widiani 79-88
No ratings yet
1 Esti+Widiani 79-88
10 pages
Hot Rod Small Block Mopar Engines HP1405
From Everand
Hot Rod Small Block Mopar Engines HP1405
Larry Shepard
No ratings yet
The Slot Car Handbook: The definitive guide to setting-up and running Scalextric sytle 1/32 scale ready-to-race slot cars
From Everand
The Slot Car Handbook: The definitive guide to setting-up and running Scalextric sytle 1/32 scale ready-to-race slot cars
Dave Chang
3/5 (1)
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet