0% found this document useful (0 votes)

7 views29 pages

Eda 1

Uploaded by

bassam abutraab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views29 pages

Eda 1

Uploaded by

bassam abutraab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Data Analysis with

Python
Estimated time needed: 15 minutes

Objectives
After completing this lab you will be able to:

Explore features or charecteristics to predict price of car

Table of Contents

1. Import Data from Module

2. Analyzing Individual Feature Patterns using Visualization
3. Descriptive Statistical Analysis
4. Basics of Grouping
5. Correlation and Causation
6. ANOVA

What are the main characteristics that have the most impact
on the car price?

1. Import Data from Module 2

Setup

Import libraries:

In [1]: #install specific version of libraries used in lab

#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
#! mamba install scipy=1.7.1-y
#! mamba install seaborn=0.9.0-y

In [2]: import pandas as pd

import numpy as np

Load the data and store it in dataframe df :

This dataset was hosted on IBM Cloud object. Click HERE for free storage.

In [3]: path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDevelop
df = pd.read_csv(path)
df.head()

Out[3]: num-
normalized- body- drive- engine- wheel
symboling make aspiration of-
losses style wheels location base
doors

alfa-
0 3 122 std two convertible rwd front 88.6
romero

alfa-
1 3 122 std two convertible rwd front 88.6
romero

alfa-
2 1 122 std two hatchback rwd front 94.5
romero

3 2 164 audi std four sedan fwd front 99.8

4 2 164 audi std four sedan 4wd front 99.4

5 rows × 29 columns
2. Analyzing Individual Feature Patterns Using
Visualization
To install Seaborn we use pip, the Python package manager.

Import visualization packages "Matplotlib" and "Seaborn". Don't forget about "%matplotlib
inline" to plot in a Jupyter notebook.

In [4]: import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline

How to choose the right visualization method?

When visualizing individual variables, it is important to first understand what type of variable
you are dealing with. This will help us find the right visualization method for that variable.

In [5]: # list the data types for each column

print(df.dtypes)

symboling int64
normalized-losses int64
make object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower float64
peak-rpm float64
city-mpg int64
highway-mpg int64
price float64
city-L/100km float64
horsepower-binned object
diesel int64
gas int64
dtype: object
Question #1:
What is the data type of the column "peak-rpm"?

In [56]: # Write your code below and press Shift+Enter to execute

df['peak-rpm'].dtype

Out[56]: dtype('float64')

Click here for the solution

For example, we can calculate the correlation between variables of type "int64" or "float64"
using the method "corr":

In [7]: df.corr()
Out[7]: normalized- wheel- curb
symboling length width height
losses base weigh

symboling 1.000000 0.466264 -0.535987 -0.365404 -0.242423 -0.550160 -0.233118

normalized-
0.466264 1.000000 -0.056661 0.019424 0.086802 -0.373737 0.099404
losses

wheel-base -0.535987 -0.056661 1.000000 0.876024 0.814507 0.590742 0.782097

length -0.365404 0.019424 0.876024 1.000000 0.857170 0.492063 0.880665

width -0.242423 0.086802 0.814507 0.857170 1.000000 0.306002 0.866201

height -0.550160 -0.373737 0.590742 0.492063 0.306002 1.000000 0.307581

curb-weight -0.233118 0.099404 0.782097 0.880665 0.866201 0.307581 1.000000

engine-size -0.110581 0.112360 0.572027 0.685025 0.729436 0.074694 0.849072

bore -0.140019 -0.029862 0.493244 0.608971 0.544885 0.180449 0.644060

stroke -0.008245 0.055563 0.158502 0.124139 0.188829 -0.062704 0.167562

compression-
-0.182196 -0.114713 0.250313 0.159733 0.189867 0.259737 0.156433
ratio

horsepower 0.075819 0.217299 0.371147 0.579821 0.615077 -0.087027 0.757976

peak-rpm 0.279740 0.239543 -0.360305 -0.285970 -0.245800 -0.309974 -0.279361

city-mpg -0.035527 -0.225016 -0.470606 -0.665192 -0.633531 -0.049800 -0.749543

highway-
0.036233 -0.181877 -0.543304 -0.698142 -0.680635 -0.104812 -0.794889
mpg

price -0.082391 0.133999 0.584642 0.690628 0.751265 0.135486 0.834415

city-L/100km 0.066171 0.238567 0.476153 0.657373 0.673363 0.003811 0.785353

diesel -0.196735 -0.101546 0.307237 0.211187 0.244356 0.281578 0.221046

gas 0.196735 0.101546 -0.307237 -0.211187 -0.244356 -0.281578 -0.221046

The diagonal elements are always one; we will study correlation more precisely Pearson
correlation in-depth at the end of the notebook.

Question #2:
Find the correlation between the following columns: bore, stroke, compression-ratio, and
horsepower.
Hint: if you would like to select those columns, use the following syntax:
df[['bore','stroke','compression-ratio','horsepower']]

In [57]: # Write your code below and press Shift+Enter to execute

df[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr()

Out[57]: bore stroke compression-ratio horsepower

bore 1.000000 -0.055390 0.001263 0.566936

stroke -0.055390 1.000000 0.187923 0.098462

compression-ratio 0.001263 0.187923 1.000000 -0.214514

horsepower 0.566936 0.098462 -0.214514 1.000000

Click here for the solution

Continuous Numerical Variables:

Continuous numerical variables are variables that may contain any value within some range.
They can be of type "int64" or "float64". A great way to visualize these variables is by using
scatterplots with fitted lines.

In order to start understanding the (linear) relationship between an individual variable and
the price, we can use "regplot" which plots the scatterplot plus the fitted regression line for
the data. This will be useful later on for visualizing the fit of the simple linear regression
model as well.

Let's see several examples of different linear relationships:

Positive Linear Relationship

Let's find the scatterplot of "engine-size" and "price".

In [9]: # Engine size as potential predictor variable of price

sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)

Out[9]: (0.0, 56437.33417281226)

As the engine-size goes up, the price goes up: this indicates a positive direct correlation
between these two variables. Engine size seems like a pretty good predictor of price since
the regression line is almost a perfect diagonal line.

We can examine the correlation between 'engine-size' and 'price' and see that it's
approximately 0.87.

In [10]: df[["engine-size", "price"]].corr()

Out[10]: engine-size price

engine-size 1.000000 0.872335

price 0.872335 1.000000

Highway mpg is a potential predictor variable of price. Let's find the scatterplot of "highway-
mpg" and "price".

In [11]: sns.regplot(x="highway-mpg", y="price", data=df)

Out[11]: <AxesSubplot:xlabel='highway-mpg', ylabel='price'>

As highway-mpg goes up, the price goes down: this indicates an inverse/negative
relationship between these two variables. Highway mpg could potentially be a predictor of
price.

We can examine the correlation between 'highway-mpg' and 'price' and see it's
approximately -0.704.

In [12]: df[['highway-mpg', 'price']].corr()

Out[12]: highway-mpg price

highway-mpg 1.000000 -0.704692

price -0.704692 1.000000

Weak Linear Relationship

Let's see if "peak-rpm" is a predictor variable of "price".

In [13]: sns.regplot(x="peak-rpm", y="price", data=df)

Out[13]: <AxesSubplot:xlabel='peak-rpm', ylabel='price'>

Peak rpm does not seem like a good predictor of the price at all since the regression line is
close to horizontal. Also, the data points are very scattered and far from the fitted line,
showing lots of variability. Therefore, it's not a reliable variable.

We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately
-0.101616.

In [14]: df[['peak-rpm','price']].corr()

Out[14]: peak-rpm price

peak-rpm 1.000000 -0.101616

price -0.101616 1.000000

Question 3 a):
Find the correlation between x="stroke" and y="price".

Hint: if you would like to select those columns, use the following syntax:
df[["stroke","price"]].
In [58]: # Write your code below and press Shift+Enter to execute
df[["stroke","price"]].corr()

Out[58]: stroke price

stroke 1.00000 0.08231

price 0.08231 1.00000

Click here for the solution

Question 3 b):
Given the correlation results between "price" and "stroke", do you expect a linear
relationship?

Verify your results using the function "regplot()".

In [59]: # Write your code below and press Shift+Enter to execute

sns.regplot(x="stroke", y="price", data=df)

Out[59]: <AxesSubplot:xlabel='stroke', ylabel='price'>

Click here for the solution

Categorical Variables
These are variables that describe a 'characteristic' of a data unit, and are selected from a
small group of categories. The categorical variables can have the type "object" or "int64". A
good way to visualize categorical variables is by using boxplots.

Let's look at the relationship between "body-style" and "price".

In [17]: sns.boxplot(x="body-style", y="price", data=df)

Out[17]: <AxesSubplot:xlabel='body-style', ylabel='price'>

We see that the distributions of price between the different body-style categories have a
significant overlap, so body-style would not be a good predictor of price. Let's examine
engine "engine-location" and "price":

In [18]: sns.boxplot(x="engine-location", y="price", data=df)

Out[18]: <AxesSubplot:xlabel='engine-location', ylabel='price'>

Here we see that the distribution of price between these two engine-location categories,
front and rear, are distinct enough to take engine-location as a potential good predictor of
price.

Let's examine "drive-wheels" and "price".

In [19]: # drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)

Out[19]: <AxesSubplot:xlabel='drive-wheels', ylabel='price'>

Here we see that the distribution of price between the different drive-wheels categories
differs. As such, drive-wheels could potentially be a predictor of price.

3. Descriptive Statistical Analysis

Let's first take a look at the variables by utilizing a description method.

The describe function automatically computes basic statistics for all continuous variables.
Any NaN values are automatically skipped in these statistics.

This will show:

the count of that variable

the mean
the standard deviation (std)
the minimum value
the IQR (Interquartile Range: 25%, 50% and 75%)
the maximum value

We can apply the method "describe" as follows:

In [20]: df.describe()
Out[20]: normalized- wheel- curb
symboling length width height
losses base weigh

count 201.000000 201.00000 201.000000 201.000000 201.000000 201.000000 201.00000

mean 0.840796 122.00000 98.797015 0.837102 0.915126 53.766667 2555.66666

std 1.254802 31.99625 6.066366 0.059213 0.029187 2.447822 517.29672

min -2.000000 65.00000 86.600000 0.678039 0.837500 47.800000 1488.00000

25% 0.000000 101.00000 94.500000 0.801538 0.890278 52.000000 2169.00000

50% 1.000000 122.00000 97.000000 0.832292 0.909722 54.100000 2414.00000

75% 2.000000 137.00000 102.400000 0.881788 0.925000 55.500000 2926.00000

max 3.000000 256.00000 120.900000 1.000000 1.000000 59.800000 4066.00000

The default setting of "describe" skips variables of type object. We can apply the method
"describe" on the variables of type 'object' as follows:

In [21]: df.describe(include=['object'])

Out[21]: num-
body- drive- engine- engine- num-of- fuel- ho
make aspiration of-
style wheels location type cylinders system
doors

count 201 201 201 201 201 201 201 201 201

unique 22 2 2 5 3 2 6 7 8

top toyota std four sedan fwd front ohc four mpfi

freq 32 165 115 94 118 198 145 157 92

Value Counts
Value counts is a good way of understanding how many units of each characteristic/variable
we have. We can apply the "value_counts" method on the column "drive-wheels". Don’t
forget the method "value_counts" only works on pandas series, not pandas dataframes. As a
result, we only include one bracket df['drive-wheels'] , not two brackets df[['drive-
wheels']] .

In [22]: df['drive-wheels'].value_counts()

Out[22]: fwd 118

rwd 75
4wd 8
Name: drive-wheels, dtype: int64
We can convert the series to a dataframe as follows:

In [23]: df['drive-wheels'].value_counts().to_frame()

Out[23]: drive-wheels

fwd 118

rwd 75

4wd 8

Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and
rename the column 'drive-wheels' to 'value_counts'.

In [24]: drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()

drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts

Out[24]: value_counts

fwd 118

rwd 75

4wd 8

Now let's rename the index to 'drive-wheels':

In [25]: drive_wheels_counts.index.name = 'drive-wheels'

drive_wheels_counts

Out[25]: value_counts

drive-wheels

fwd 118

rwd 75

4wd 8

We can repeat the above process for the variable 'engine-location'.

In [26]: # engine-location as variable

engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)
Out[26]: value_counts

engine-location

front 198

rear 3

After examining the value counts of the engine location, we see that engine location would
not be a good predictor variable for the price. This is because we only have three cars with a
rear engine and 198 with an engine in the front, so this result is skewed. Thus, we are not
able to draw any conclusions about the engine location.

4. Basics of Grouping
The "groupby" method groups data by different categories. The data is grouped based on
one or several variables, and analysis is performed on the individual groups.

For example, let's group by the variable "drive-wheels". We see that there are 3 different
categories of drive wheels.

In [27]: df['drive-wheels'].unique()

Out[27]: array(['rwd', 'fwd', '4wd'], dtype=object)

If we want to know, on average, which type of drive wheel is most valuable, we can group
"drive-wheels" and then average them.

We can select the columns 'drive-wheels', 'body-style' and 'price', then assign it to the
variable "df_group_one".

In [28]: df_group_one = df[['drive-wheels','body-style','price']]

We can then calculate the average price for each of the different categories of data.

In [29]: # grouping results

df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()
df_group_one

Out[29]: drive-wheels price

0 4wd 10241.000000

1 fwd 9244.779661

2 rwd 19757.613333

From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while
4-wheel and front-wheel are approximately the same in price.
You can also group by multiple variables. For example, let's group by both 'drive-wheels' and
'body-style'. This groups the dataframe by the unique combination of 'drive-wheels' and
'body-style'. We can store the results in the variable 'grouped_test1'.

In [30]: # grouping results

df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mea
grouped_test1

Out[30]: drive-wheels body-style price

0 4wd hatchback 7603.000000

1 4wd sedan 12647.333333

2 4wd wagon 9095.750000

3 fwd convertible 11595.000000

4 fwd hardtop 8249.000000

5 fwd hatchback 8396.387755

6 fwd sedan 9811.800000

7 fwd wagon 9997.333333

8 rwd convertible 23949.600000

9 rwd hardtop 24202.714286

10 rwd hatchback 14337.777778

11 rwd sedan 21711.833333

12 rwd wagon 16994.222222

This grouped data is much easier to visualize when it is made into a pivot table. A pivot table
is like an Excel spreadsheet, with one variable along the column and another along the row.
We can convert the dataframe to a pivot table using the method "pivot" to create a pivot
table from the groups.

In this case, we will leave the drive-wheels variable as the rows of the table, and pivot body-
style to become the columns of the table:

In [31]: grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')

grouped_pivot
Out[31]: price

body-style convertible hardtop hatchback sedan wagon

drive-wheels

4wd NaN NaN 7603.000000 12647.333333 9095.750000

fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333

rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222

Often, we won't have data for some of the pivot cells. We can fill these missing cells with the
value 0, but any other value could potentially be used as well. It should be mentioned that
missing data is quite a complex subject and is an entire course on its own.

In [32]: grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0

grouped_pivot

Out[32]: price

body-style convertible hardtop hatchback sedan wagon

drive-wheels

4wd 0.0 0.000000 7603.000000 12647.333333 9095.750000

fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333

rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222

Question 4:
Use the "groupby" function to find the average "price" of each car based on "body-style".

In [60]: # Write your code below and press Shift+Enter to execute

df_gptest2 = df[['body-style','price']]
grouped_test_bodystyle = df_gptest2.groupby(['body-style'],as_index= False).mean()
grouped_test_bodystyle
Out[60]: body-style price

0 convertible 21890.500000

1 hardtop 22208.500000

2 hatchback 9957.441176

3 sedan 14459.755319

4 wagon 12371.960000

Click here for the solution

If you did not import "pyplot", let's do it again.

In [61]: import matplotlib.pyplot as plt

%matplotlib inline

Variables: Drive Wheels and Body Style vs. Price

Let's use a heat map to visualize the relationship between Body Style vs Price.

In [35]: #use the grouped results

plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()
The heatmap plots the target variable (price) proportional to colour with respect to the
variables 'drive-wheel' and 'body-style' on the vertical and horizontal axis, respectively. This
allows us to visualize how the price is related to 'drive-wheel' and 'body-style'.

The default labels convey no useful information to us. Let's change that:

In [36]: fig, ax = plt.subplots()

im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index

#move ticks and labels to the center

ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long

plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()
Visualization is very important in data science, and Python visualization packages provide
great freedom. We will go more in-depth in a separate Python visualizations course.

The main question we want to answer in this module is, "What are the main characteristics
which have the most impact on the car price?".

To get a better measure of the important characteristics, we look at the correlation of these
variables with the car price. In other words: how is the car price dependent on this variable?

5. Correlation and Causation

Correlation: a measure of the extent of interdependence between variables.

Causation: the relationship between cause and effect between two variables.

It is important to know the difference between these two. Correlation does not imply
causation. Determining correlation is much simpler the determining causation as causation
may require independent experimentation.

Pearson Correlation

The Pearson Correlation measures the linear dependence between two variables X and Y.
The resulting coefficient is a value between -1 and 1 inclusive, where:

1: Perfect positive linear correlation.

0: No linear correlation, the two variables most likely do not affect each other.
-1: Perfect negative linear correlation.

Pearson Correlation is the default method of the function "corr". Like before, we can
calculate the Pearson Correlation of the of the 'int64' or 'float64' variables.

In [37]: df.corr()

Out[37]: normalized- wheel- curb

symboling length width height
losses base weigh

symboling 1.000000 0.466264 -0.535987 -0.365404 -0.242423 -0.550160 -0.233118

normalized-
0.466264 1.000000 -0.056661 0.019424 0.086802 -0.373737 0.099404
losses

wheel-base -0.535987 -0.056661 1.000000 0.876024 0.814507 0.590742 0.782097

length -0.365404 0.019424 0.876024 1.000000 0.857170 0.492063 0.880665

width -0.242423 0.086802 0.814507 0.857170 1.000000 0.306002 0.866201

height -0.550160 -0.373737 0.590742 0.492063 0.306002 1.000000 0.307581

curb-weight -0.233118 0.099404 0.782097 0.880665 0.866201 0.307581 1.000000

engine-size -0.110581 0.112360 0.572027 0.685025 0.729436 0.074694 0.849072

bore -0.140019 -0.029862 0.493244 0.608971 0.544885 0.180449 0.644060

stroke -0.008245 0.055563 0.158502 0.124139 0.188829 -0.062704 0.167562

compression-
-0.182196 -0.114713 0.250313 0.159733 0.189867 0.259737 0.156433
ratio

horsepower 0.075819 0.217299 0.371147 0.579821 0.615077 -0.087027 0.757976

peak-rpm 0.279740 0.239543 -0.360305 -0.285970 -0.245800 -0.309974 -0.279361

city-mpg -0.035527 -0.225016 -0.470606 -0.665192 -0.633531 -0.049800 -0.749543

highway-
0.036233 -0.181877 -0.543304 -0.698142 -0.680635 -0.104812 -0.794889
mpg

price -0.082391 0.133999 0.584642 0.690628 0.751265 0.135486 0.834415

city-L/100km 0.066171 0.238567 0.476153 0.657373 0.673363 0.003811 0.785353

diesel -0.196735 -0.101546 0.307237 0.211187 0.244356 0.281578 0.221046

gas 0.196735 0.101546 -0.307237 -0.211187 -0.244356 -0.281578 -0.221046

Sometimes we would like to know the significant of the correlation estimate.

P-value

What is this P-value? The P-value is the probability value that the correlation between these
two variables is statistically significant. Normally, we choose a significance level of 0.05,
which means that we are 95% confident that the correlation between the variables is
significant.

By convention, when the

p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.
the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.
the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.
the p-value is $>$ 0.1: there is no evidence that the correlation is significant.

We can obtain this information using "stats" module in the "scipy" library.

In [62]: from scipy import stats

Wheel-Base vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.

In [63]: pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P

The Pearson Correlation Coefficient is 0.5846418222655085 with a P-value of P = 8.0

76488270732243e-20

Conclusion:
Since the p-value is $<$ 0.001, the correlation between wheel-base and price is statistically
significant, although the linear relationship isn't extremely strong (~0.585).

Horsepower vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'.

In [64]: pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P

The Pearson Correlation Coefficient is 0.8095745670036559 with a P-value of P = 6.

369057428260101e-48

Conclusion:
Since the p-value is $<$ 0.001, the correlation between horsepower and price is statistically
significant, and the linear relationship is quite strong (~0.809, close to 1).
Length vs. Price
Let's calculate the Pearson Correlation Coefficient and P-value of 'length' and 'price'.

In [65]: pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P

The Pearson Correlation Coefficient is 0.6906283804483643 with a P-value of P = 8.

01647746615853e-30

Conclusion:
Since the p-value is $<$ 0.001, the correlation between length and price is statistically
significant, and the linear relationship is moderately strong (~0.691).

Width vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'width' and 'price':

In [66]: pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P

The Pearson Correlation Coefficient is 0.7512653440522666 with a P-value of P = 9.2

00335510483739e-38

Conclusion:
Since the p-value is < 0.001, the correlation between width and price is statistically
significant, and the linear relationship is quite strong (~0.751).

Curb-Weight vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'curb-weight' and 'price':

In [67]: pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])

print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of

The Pearson Correlation Coefficient is 0.8344145257702845 with a P-value of P = 2.

189577238893816e-53

Conclusion:
Since the p-value is $<$ 0.001, the correlation between curb-weight and price is statistically
significant, and the linear relationship is quite strong (~0.834).

Engine-Size vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'engine-size' and 'price':
In [68]: pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P

The Pearson Correlation Coefficient is 0.8723351674455188 with a P-value of P = 9.2

65491622196808e-64

Conclusion:
Since the p-value is $<$ 0.001, the correlation between engine-size and price is statistically
significant, and the linear relationship is very strong (~0.872).

Bore vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'bore' and 'price':

In [69]: pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P

The Pearson Correlation Coefficient is 0.54315538326266 with a P-value of P = 8.0

49189483935489e-17

Conclusion:
Since the p-value is $<$ 0.001, the correlation between bore and price is statistically
significant, but the linear relationship is only moderate (~0.521).

We can relate the process for each 'city-mpg' and 'highway-mpg':

City-mpg vs. Price

In [70]: pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P

The Pearson Correlation Coefficient is -0.6865710067844684 with a P-value of P =

2.3211320655672453e-29

Conclusion:
Since the p-value is $<$ 0.001, the correlation between city-mpg and price is statistically
significant, and the coefficient of about -0.687 shows that the relationship is negative and
moderately strong.

Highway-mpg vs. Price

In [71]: pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of

The Pearson Correlation Coefficient is -0.7046922650589534 with a P-value of P =

1.749547114447437e-31
Conclusion:
Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically
significant, and the coefficient of about -0.705 shows that the relationship is negative and
moderately strong.

6. ANOVA

ANOVA: Analysis of Variance

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are
significant differences between the means of two or more groups. ANOVA returns two
parameters:

F-test score: ANOVA assumes the means of all groups are the same, calculates how much
the actual means deviate from the assumption, and reports it as the F-test score. A larger
score means there is a larger difference between the means.

P-value: P-value tells how statistically significant our calculated score value is.

If our price variable is strongly correlated with the variable we are analyzing, we expect
ANOVA to return a sizeable F-test score and a small p-value.

Drive Wheels
Since ANOVA analyzes the difference between different groups of the same variable, the
groupby function will come in handy. Because the ANOVA algorithm averages the data
automatically, we do not need to take the average before hand.

To see if different types of 'drive-wheels' impact 'price', we group the data.

In [48]: grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])

grouped_test2.head(2)

Out[48]: drive-wheels price

0 rwd 13495.0

1 rwd 16500.0

3 fwd 13950.0

4 4wd 17450.0

5 fwd 15250.0

136 4wd 7603.0

In [49]: df_gptest

Out[49]: drive-wheels body-style price

0 rwd convertible 13495.0

1 rwd convertible 16500.0

2 rwd hatchback 16500.0

3 fwd sedan 13950.0

4 4wd sedan 17450.0

... ... ... ...

196 rwd sedan 16845.0

197 rwd sedan 19045.0

198 rwd sedan 21485.0

199 rwd sedan 22470.0

200 rwd sedan 22625.0

201 rows × 3 columns

We can obtain the values of the method group using the method "get_group".

In [50]: grouped_test2.get_group('4wd')['price']

Out[50]: 4 17450.0
136 7603.0
140 9233.0
141 11259.0
144 8013.0
145 11694.0
150 7898.0
151 8778.0
Name: price, dtype: float64

We can use the function 'f_oneway' in the module 'stats' to obtain the F-test score and P-
value.

In [51]: # ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test

print( "ANOVA results: F=", f_val, ", P =", p_val)

ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23

This is a great result with a large F-test score showing a strong correlation and a P-value of
almost 0 implying almost certain statistical significance. But does this mean all three tested
groups are all this highly correlated?
Let's examine them separately.

fwd and rwd

In [52]: f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test

print( "ANOVA results: F=", f_val, ", P =", p_val )

ANOVA results: F= 130.5533160959111 , P = 2.2355306355677845e-23

Let's examine the other groups.

4wd and rwd

In [53]: f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test

print( "ANOVA results: F=", f_val, ", P =", p_val)

ANOVA results: F= 8.580681368924756 , P = 0.004411492211225333

4wd and fwd

In [54]: f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test

print("ANOVA results: F=", f_val, ", P =", p_val)

ANOVA results: F= 0.665465750252303 , P = 0.41620116697845666

We notice that ANOVA for the categories 4wd and fwd yields a high p-value > 0.1, so the
calculated F-test score is not very statistically significant. This suggests we can't reject the
assumption that the means of these two groups are the same, or, in other words, we can't
conclude the difference in correlation to be significant.

Conclusion: Important Variables

We now have a better idea of what our data looks like and which variables are important to
take into account when predicting the car price. We have narrowed it down to the following
variables:

Continuous numerical variables:

Length
Width
Curb-weight
Engine-size
Horsepower
City-mpg
Highway-mpg
Wheel-base
Bore

Categorical variables:

Drive-wheels

Author
Dev Agnihotri

Principal Stresses & Strain
No ratings yet
Principal Stresses & Strain
39 pages
Final Demo Detailed Lesson Plan in Math 4
No ratings yet
Final Demo Detailed Lesson Plan in Math 4
7 pages
Cars Sales Dashboard
No ratings yet
Cars Sales Dashboard
19 pages
Analysis of Power System Stability
100% (2)
Analysis of Power System Stability
31 pages
Design and Analysis of Algorithms May 2008 Question Paper
100% (2)
Design and Analysis of Algorithms May 2008 Question Paper
5 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
Math 9 Q1 M2 W2 Revised Final
No ratings yet
Math 9 Q1 M2 W2 Revised Final
16 pages
Lec 20
No ratings yet
Lec 20
24 pages
Powersys PDF
No ratings yet
Powersys PDF
336 pages
Uncertainty Reduction Theory
100% (1)
Uncertainty Reduction Theory
6 pages
Project 8 Predictive Analytics - Ipynb - Colaboratory
No ratings yet
Project 8 Predictive Analytics - Ipynb - Colaboratory
8 pages
Recipes For Continuation Harry Dankowicz Frank Schilder Instant Download
No ratings yet
Recipes For Continuation Harry Dankowicz Frank Schilder Instant Download
88 pages
SR150
No ratings yet
SR150
100 pages
Automobile Price Data
No ratings yet
Automobile Price Data
53 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
Se Python - Merged
No ratings yet
Se Python - Merged
77 pages
Data Analysis Report
No ratings yet
Data Analysis Report
74 pages
الإحصاء الهندسي
No ratings yet
الإحصاء الهندسي
64 pages
Car Price Prediction
No ratings yet
Car Price Prediction
72 pages
All Cheat Sheets
No ratings yet
All Cheat Sheets
5 pages
Car Price Prediction
No ratings yet
Car Price Prediction
35 pages
1A Motion PDF
No ratings yet
1A Motion PDF
81 pages
Math Standards PDF
No ratings yet
Math Standards PDF
93 pages
Advance EDA & Predictive Analytics
No ratings yet
Advance EDA & Predictive Analytics
38 pages
Machine Learning Project 1690186790
No ratings yet
Machine Learning Project 1690186790
18 pages
EDA Withoutcode
No ratings yet
EDA Withoutcode
36 pages
1.5 Data Analysis With Python - Exploratory Data Analysis 1
No ratings yet
1.5 Data Analysis With Python - Exploratory Data Analysis 1
17 pages
Lect10 SMPCC
No ratings yet
Lect10 SMPCC
27 pages
Exploratiory Data Analysis
No ratings yet
Exploratiory Data Analysis
26 pages
AEM 3e Chapter 13
No ratings yet
AEM 3e Chapter 13
24 pages
Data Analysis
No ratings yet
Data Analysis
58 pages
Practical Example Full Notes
No ratings yet
Practical Example Full Notes
48 pages
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory
No ratings yet
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory
22 pages
Car Price Prediction 1
No ratings yet
Car Price Prediction 1
24 pages
Intro To Exploratory Data Analysis Eda in Python
No ratings yet
Intro To Exploratory Data Analysis Eda in Python
7 pages
Car Pricing Prediction
No ratings yet
Car Pricing Prediction
15 pages
Automobile Sales Predictions
No ratings yet
Automobile Sales Predictions
19 pages
Internship
No ratings yet
Internship
23 pages
Finalised FBA CIA 3
No ratings yet
Finalised FBA CIA 3
16 pages
Exp 5 Exploratory Data Analysis SDK Ok
No ratings yet
Exp 5 Exploratory Data Analysis SDK Ok
13 pages
Lab1 For Module3 - Python Code
No ratings yet
Lab1 For Module3 - Python Code
10 pages
Car Price Prediction Report
No ratings yet
Car Price Prediction Report
8 pages
Kinematics 2D - Study Notes
No ratings yet
Kinematics 2D - Study Notes
14 pages
Engo 645
No ratings yet
Engo 645
10 pages
Data Mining
No ratings yet
Data Mining
10 pages
Engo 645
No ratings yet
Engo 645
9 pages
Analyzing The Impact of Car Features On Price and Profitability
No ratings yet
Analyzing The Impact of Car Features On Price and Profitability
8 pages
Projection 1
No ratings yet
Projection 1
8 pages
Car Trend Analysis
No ratings yet
Car Trend Analysis
12 pages
SMDM Business+Report
No ratings yet
SMDM Business+Report
11 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Team AN
No ratings yet
Team AN
23 pages
GmPrac1 - Jupyter Notebook
No ratings yet
GmPrac1 - Jupyter Notebook
11 pages
Topic
No ratings yet
Topic
9 pages
Car Pricing Prediction
No ratings yet
Car Pricing Prediction
15 pages
ECMC49F Midterm Solution 2
No ratings yet
ECMC49F Midterm Solution 2
13 pages
SMDM Business+Report
No ratings yet
SMDM Business+Report
11 pages
9587 - 9638 - 9563 - ADS - Exp1.ipynb - Colab
No ratings yet
9587 - 9638 - 9563 - ADS - Exp1.ipynb - Colab
8 pages
Educ 202
No ratings yet
Educ 202
8 pages
Elite Sports Cars Eda
No ratings yet
Elite Sports Cars Eda
9 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Note
No ratings yet
Note
9 pages
Continous Random Variable
No ratings yet
Continous Random Variable
17 pages
DSBDA1
No ratings yet
DSBDA1
5 pages
Project - Analyzing The Impact of Car Features On Price and Profitability
No ratings yet
Project - Analyzing The Impact of Car Features On Price and Profitability
8 pages
DV Ca-1
No ratings yet
DV Ca-1
9 pages
Basic Oilfield Intro
No ratings yet
Basic Oilfield Intro
10 pages
Course2 - DataAnalysis With Python - Week3 - Exploratory Data Analysis
No ratings yet
Course2 - DataAnalysis With Python - Week3 - Exploratory Data Analysis
23 pages
Sixth Grade End of Year-Reflection Coloring Sheet
No ratings yet
Sixth Grade End of Year-Reflection Coloring Sheet
2 pages
Objectives of Production Logging
No ratings yet
Objectives of Production Logging
7 pages
Mathematics in Modern World FINALS
No ratings yet
Mathematics in Modern World FINALS
5 pages
Chen 2014
No ratings yet
Chen 2014
6 pages
Graph Parameters
No ratings yet
Graph Parameters
3 pages
Mtcars - Ipynb - Colab
No ratings yet
Mtcars - Ipynb - Colab
2 pages
COEB223 Tutorial 9
No ratings yet
COEB223 Tutorial 9
2 pages
Automobil E Data Analysis: Name Pgp-Dsba Online January' 21 Date: Dd/mm/yyyy
No ratings yet
Automobil E Data Analysis: Name Pgp-Dsba Online January' 21 Date: Dd/mm/yyyy
11 pages
Temp 2 Lab 1
No ratings yet
Temp 2 Lab 1
5 pages
Chapter 7
No ratings yet
Chapter 7
36 pages
Mohy - Jupyter Notebook
No ratings yet
Mohy - Jupyter Notebook
3 pages
Data Frames and Charts 2: 2.1 Dealing With Missing Values
No ratings yet
Data Frames and Charts 2: 2.1 Dealing With Missing Values
12 pages
ChE 310 - Mid Quiz Question - ChE 16
No ratings yet
ChE 310 - Mid Quiz Question - ChE 16
3 pages
Lab Assignment 6
No ratings yet
Lab Assignment 6
5 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
A Review On Design of Pile Foundations in Bangkok: Geotechnical Engineering April 2015
No ratings yet
A Review On Design of Pile Foundations in Bangkok: Geotechnical Engineering April 2015
11 pages
Teori Idw Dari Arcgis
No ratings yet
Teori Idw Dari Arcgis
2 pages
Secondhand Car Price Analysis
No ratings yet
Secondhand Car Price Analysis
12 pages
Position Sensorless Direct Torque Control of BLDC Motor by Using Modifier
No ratings yet
Position Sensorless Direct Torque Control of BLDC Motor by Using Modifier
7 pages
Analysis On Car Resale Price
No ratings yet
Analysis On Car Resale Price
13 pages
Adaptive Line Enhancer
No ratings yet
Adaptive Line Enhancer
9 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Deriving Insights From Data
No ratings yet
Deriving Insights From Data
8 pages
Hi To Every One
No ratings yet
Hi To Every One
1 page