0% found this document useful (0 votes)

20 views58 pages

Data Analysis

The document analyzes automobile data from a dataset containing 205 observations and 26 variables. It loads the data, cleans it by adding headers, and examines the data types of each column. Summary statistics are generated for all numeric columns using the describe() method. This provides measures like count, mean, standard deviation, minimum, and percentile values for each column. The describe() method is also run with include='all' to view descriptions of object type columns.

Uploaded by

Muhammad Musab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views58 pages

Data Analysis

Uploaded by

Muhammad Musab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

4/16/23, 5:00 AM Data Analysis

DATA ACQUISITION

BUT FIRST WE NEED TO IMPORT SOME LIBRARIES

In [1]: import pandas as pd

import numpy as np
import seaborn as sas
import matplotlib.pylab as plt

In [2]: path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDevelope

df = pd.read_csv(path, header = None)

In [3]: df.head(5)

Out[3]: 0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21

alfa-
0 3 ? gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111
romero

alfa-
1 3 ? gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111
romero

alfa-
2 1 ? gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154
romero

3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102

4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115

5 rows × 26 columns

We will name the headers

In [4]: headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-do
"drive-wheels","engine-location","wheel-base", "length","width","height","cur
"num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-
"peak-rpm","city-mpg","highway-mpg","price"]

df.columns = headers
df.head(5)

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 1/58

4/16/23, 5:00 AM Data Analysis

Out[4]: num-
normalized- fuel- body- drive- engine- wheel-
symboling make aspiration of- ...
losses type style wheels location base
doors

alfa-
0 3 ? gas std two convertible rwd front 88.6 ...
romero

alfa-
1 3 ? gas std two convertible rwd front 88.6 ...
romero

alfa-
2 1 ? gas std two hatchback rwd front 94.5 ...
romero

3 2 164 audi gas std four sedan fwd front 99.8 ...

4 2 164 audi gas std four sedan 4wd front 99.4 ...

5 rows × 26 columns

In [5]: df.dtypes

symboling int64
Out[5]:
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object

If we would like to get a statistical summary of each column e.g. count,

column mean value, column standard deviation, etc., we use the describe
method:

In [6]: df.describe()

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 2/58

4/16/23, 5:00 AM Data Analysis

Out[6]: wheel- curb- engine- compress

symboling length width height
base weight size r

count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000

mean 0.834146 98.756585 174.049268 65.907805 53.724878 2555.565854 126.907317 10.142

std 1.245307 6.021776 12.337289 2.145204 2.443522 520.680204 41.642693 3.972

min -2.000000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.000000 7.000

25% 0.000000 94.500000 166.300000 64.100000 52.000000 2145.000000 97.000000 8.600

50% 1.000000 97.000000 173.200000 65.500000 54.100000 2414.000000 120.000000 9.000

75% 2.000000 102.400000 183.100000 66.900000 55.500000 2935.000000 141.000000 9.400

max 3.000000 120.900000 208.100000 72.300000 59.800000 4066.000000 326.000000 23.000

what if we would also like to check all the columns including those that are of
type object?

In [7]: df.describe( include = "all")

Out[7]: num-
normalized- fuel- body- drive- engine- wheel
symboling make aspiration of-
losses type style wheels location base
doors

count 205.000000 205 205 205 205 205 205 205 205 205.000000

unique NaN 52 22 2 2 3 5 3 2 NaN

top NaN ? toyota gas std four sedan fwd front NaN

freq NaN 41 32 185 168 114 96 120 202 NaN

mean 0.834146 NaN NaN NaN NaN NaN NaN NaN NaN 98.756585

std 1.245307 NaN NaN NaN NaN NaN NaN NaN NaN 6.021776

min -2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 86.600000

25% 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN 94.500000

50% 1.000000 NaN NaN NaN NaN NaN NaN NaN NaN 97.000000

75% 2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 102.400000

max 3.000000 NaN NaN NaN NaN NaN NaN NaN NaN 120.900000

11 rows × 26 columns

Another method you can use to check your dataset is:

In [8]: df.info()

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 3/58

4/16/23, 5:00 AM Data Analysis
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 205 non-null int64
1 normalized-losses 205 non-null object
2 make 205 non-null object
3 fuel-type 205 non-null object
4 aspiration 205 non-null object
5 num-of-doors 205 non-null object
6 body-style 205 non-null object
7 drive-wheels 205 non-null object
8 engine-location 205 non-null object
9 wheel-base 205 non-null float64
10 length 205 non-null float64
11 width 205 non-null float64
12 height 205 non-null float64
13 curb-weight 205 non-null int64
14 engine-type 205 non-null object
15 num-of-cylinders 205 non-null object
16 engine-size 205 non-null int64
17 fuel-system 205 non-null object
18 bore 205 non-null object
19 stroke 205 non-null object
20 compression-ratio 205 non-null float64
21 horsepower 205 non-null object
22 peak-rpm 205 non-null object
23 city-mpg 205 non-null int64
24 highway-mpg 205 non-null int64
25 price 205 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB

Identify and handle missing values

We need to replace the "?" symbol with NaN so the dropna() can remove the
missing values:

In [9]: df.replace("?", np.nan, inplace = True)

df.head(5)

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 4/58

4/16/23, 5:00 AM Data Analysis

Out[9]: num-
normalized- fuel- body- drive- engine- wheel-
symboling make aspiration of- ...
losses type style wheels location base
doors

alfa-
0 3 NaN gas std two convertible rwd front 88.6 ...
romero

alfa-
1 3 NaN gas std two convertible rwd front 88.6 ...
romero

alfa-
2 1 NaN gas std two hatchback rwd front 94.5 ...
romero

3 2 164 audi gas std four sedan fwd front 99.8 ...

4 2 164 audi gas std four sedan 4wd front 99.4 ...

5 rows × 26 columns

Evaluating for Missing Data

The missing values are converted by default. We use the

following functions to identify these missing values. There are
two methods to detect missing data:
.isnull()

.notnull()

The output is a boolean value indicating whether the value that is passed into
the argument is in fact missing data.

"True" means the value is a missing value while "False" means the value is not
a missing value.

In [10]: missing_data = df.isnull()

missing_data.head(5)

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 5/58

4/16/23, 5:00 AM Data Analysis

Out[10]: num-
normalized- fuel- body- drive- engine- wheel- engi
symboling make aspiration of- ...
losses type style wheels location base
doors

0 False True False False False False False False False False ... F

1 False True False False False False False False False False ... F

2 False True False False False False False False False False ... F

3 False False False False False False False False False False ... F

4 False False False False False False False False False False ... F

5 rows × 26 columns

Count Missing Values in each column

In [11]: for column in missing_data.columns.values.tolist():
print(column)
print (missing_data[column].value_counts())
print("")

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 6/58

4/16/23, 5:00 AM Data Analysis
symboling
False 205
Name: symboling, dtype: int64

normalized-losses
False 164
True 41
Name: normalized-losses, dtype: int64

make
False 205
Name: make, dtype: int64

fuel-type
False 205
Name: fuel-type, dtype: int64

aspiration
False 205
Name: aspiration, dtype: int64

num-of-doors
False 203
True 2
Name: num-of-doors, dtype: int64

body-style
False 205
Name: body-style, dtype: int64

drive-wheels
False 205
Name: drive-wheels, dtype: int64

engine-location
False 205
Name: engine-location, dtype: int64

wheel-base
False 205
Name: wheel-base, dtype: int64

length
False 205
Name: length, dtype: int64

width
False 205
Name: width, dtype: int64

height
False 205
Name: height, dtype: int64

curb-weight
False 205
Name: curb-weight, dtype: int64

engine-type
False 205
localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 7/58
4/16/23, 5:00 AM Data Analysis

Name: engine-type, dtype: int64

num-of-cylinders
False 205
Name: num-of-cylinders, dtype: int64

engine-size
False 205
Name: engine-size, dtype: int64

fuel-system
False 205
Name: fuel-system, dtype: int64

bore
False 201
True 4
Name: bore, dtype: int64

stroke
False 201
True 4
Name: stroke, dtype: int64

compression-ratio
False 205
Name: compression-ratio, dtype: int64

horsepower
False 203
True 2
Name: horsepower, dtype: int64

peak-rpm
False 203
True 2
Name: peak-rpm, dtype: int64

city-mpg
False 205
Name: city-mpg, dtype: int64

highway-mpg
False 205
Name: highway-mpg, dtype: int64

price
False 201
True 4
Name: price, dtype: int64

Dealing with Missing Data

How to deal with missing data?

Drop data
localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 8/58
4/16/23, 5:00 AM Data Analysis

a. Drop the whole row

b. Drop the whole column

Replace data
a. Replace it by mean

b. Replace it by frequency

c. Replace it based on other functions

Replace by mean:
"normalized-losses": 41 missing data, replace them with mean "stroke": 4 missing data, replace
them with mean "bore": 4 missing data, replace them with mean "horsepower": 2 missing data,
replace them with mean "peak-rpm": 2 missing data, replace them with mean

Replace by frequency:
"num-of-doors": 2 missing data, replace them with "four". Reason: 84% sedans is four doors.
Since four doors is most frequent, it is most likely to occur

Drop the whole row:

"price": 4 missing data, simply delete the whole row Reason: price is what we want to predict.
Any data entry without price data cannot be used for prediction; therefore any row now without
price data is not useful to us

Calculating Mean Value for normalized-losses and replacing it

with NaN
In [12]: mean_nl = df["normalized-losses"].astype("float").mean(axis = 0)

print("Avg loss", mean_nl)

Avg loss 122.0

In [13]: df["normalized-losses"].replace(np.nan, mean_nl, inplace = True)

In [14]: mean_bore = df["bore"].astype("float").mean(axis=0)

print("mean :", mean_bore)

mean : 3.3297512437810957

In [15]: df["bore"].replace(np.nan, mean_bore, inplace = True)

In [16]: mean_stroke = df["stroke"].astype("float").mean(axis =0)

print("mean :", mean_stroke)

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 9/58

4/16/23, 5:00 AM Data Analysis

mean : 3.2554228855721337

In [17]: df["stroke"].replace(np.nan, mean_stroke, inplace = True)

In [18]: mean_horsepower = df["horsepower"].astype("float").mean(axis = 0)

print("mean :", mean_horsepower)

mean : 104.25615763546799

In [19]: df["horsepower"].replace(np.nan, mean_horsepower, inplace = True)

In [20]: mean_peak_rpm = df["peak-rpm"].astype("float").mean(axis = 0)

print("mean :", mean_peak_rpm)

mean : 5125.369458128079

In [21]: df["peak-rpm"].replace(np.nan, mean_peak_rpm, inplace = True)

To see which values are present in a particular column, we can

use the ".value_counts()" method:
In [22]: df["num-of-doors"].value_counts()

four 114
Out[22]:
two 89
Name: num-of-doors, dtype: int64

We can see that four doors are the most common type. We can also use the
".idxmax()" method to calculate the most common type automatically:

In [23]: df['num-of-doors'].value_counts().idxmax()

'four'
Out[23]:

The replacement procedure is very similar to what we have seen previously:

In [24]: df['num-of-doors'].replace(np.nan, "four", inplace = True)

Finally, let's drop all rows that do not have price data:

In [25]: df.dropna(subset = ['price'], axis = 0, inplace = True)

reset index, because we droped two rows

In [26]: df.reset_index(drop=True, inplace=True)

In [27]: df.head()

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 10/58

4/16/23, 5:00 AM Data Analysis

Out[27]: num-
normalized- fuel- body- drive- engine- wheel-
symboling make aspiration of- ...
losses type style wheels location base
doors

alfa-
0 3 122.0 gas std two convertible rwd front 88.6 ...
romero

alfa-
1 3 122.0 gas std two convertible rwd front 88.6 ...
romero

alfa-
2 1 122.0 gas std two hatchback rwd front 94.5 ...
romero

3 2 164 audi gas std four sedan fwd front 99.8 ...

4 2 164 audi gas std four sedan 4wd front 99.4 ...

5 rows × 26 columns

Now, we have a dataset with no missing values.

Now we check data type

In [28]: df.dtypes

symboling int64
Out[28]:
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object

As we can see there are some heads that should be in float or int

Convert data types to proper format

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 11/58
4/16/23, 5:00 AM Data Analysis

In [29]: df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")

df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")

In [30]: df.dtypes

symboling int64
Out[30]:
normalized-losses int32
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower object
peak-rpm float64
city-mpg int64
highway-mpg int64
price float64
dtype: object

Data Standardization
Data is usually collected from different agencies in different formats. (Data standardization is
also a term for a particular type of data normalization where we subtract the mean and divide
by the standard deviation.)

What is standardization?
Standardization is the process of transforming data into a common format, allowing the
researcher to make the meaningful comparison.

Example
Transform mpg to L/100km:

In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented
by mpg (miles per gallon) unit. Assume we are developing an application in a country that
accepts the fuel consumption with L/100km standard.

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 12/58

4/16/23, 5:00 AM Data Analysis

We will need to apply data transformation to transform mpg into L/100km.

In [31]: df.head()

Out[31]: num-
normalized- fuel- body- drive- engine- wheel-
symboling make aspiration of- ...
losses type style wheels location base
doors

alfa-
0 3 122 gas std two convertible rwd front 88.6 ...
romero

alfa-
1 3 122 gas std two convertible rwd front 88.6 ...
romero

alfa-
2 1 122 gas std two hatchback rwd front 94.5 ...
romero

3 2 164 audi gas std four sedan fwd front 99.8 ...

4 2 164 audi gas std four sedan 4wd front 99.4 ...

5 rows × 26 columns

In [32]: df['city-L/100km'] = 235/df["city-mpg"]

# check your transformed data

df.head()

Out[32]: num-
normalized- fuel- body- drive- engine- wheel-
symboling make aspiration of- ...
losses type style wheels location base
doors

alfa-
0 3 122 gas std two convertible rwd front 88.6 ...
romero

alfa-
1 3 122 gas std two convertible rwd front 88.6 ...
romero

alfa-
2 1 122 gas std two hatchback rwd front 94.5 ...
romero

3 2 164 audi gas std four sedan fwd front 99.8 ...

4 2 164 audi gas std four sedan 4wd front 99.4 ...

5 rows × 27 columns

Data Normalization

Why normalization?
Normalization is the process of transforming values of several variables into a similar range.
Typical normalizations include scaling the variable so the variable average is 0, scaling the
variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 13/58
4/16/23, 5:00 AM Data Analysis

Sample Code
replace (original value) by (original value)/(maximum value)

df['length'] = df['length']/df['length'].max()

df['width'] = df['width']/df['width'].max()

Binning
Why binning?
Binning is a process of transforming continuous numerical variables into discrete categorical
'bins' for grouped analysis.

Convert data to correct format:

In [33]: df["horsepower"]=df["horsepower"].astype(int, copy=True)

see the graph

In [34]: %matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["horsepower"])

# set x/y labels and plot title

plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

Text(0.5, 1.0, 'horsepower bins')

Out[34]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 14/58

4/16/23, 5:00 AM Data Analysis

We would like 3 bins of equal size bandwidth so we use numpy's

Since we want to include the minimum value of horsepower, we want to set start_value =
min(df["horsepower"]).

Since we want to include the maximum value of horsepower, we want to set end_value =
max(df["horsepower"]).

Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated
= 4.

In [35]: bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)

bins

array([ 48. , 119.33333333, 190.66666667, 262. ])

Out[35]:

Now We set group names:

In [36]: group_names = ['Low', 'Medium', 'High']

We apply the function "cut" to determine what each value of df['horsepower']

belongs to.

In [37]: df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_l

df[['horsepower','horsepower-binned']].head(20)

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 15/58

4/16/23, 5:00 AM Data Analysis

Out[37]: horsepower horsepower-binned

0 111 Low

1 111 Low

2 154 Medium

3 102 Low

4 115 Low

5 110 Low

6 110 Low

7 110 Low

8 140 Medium

9 101 Low

10 101 Low

11 121 Medium

12 121 Medium

13 121 Medium

14 182 Medium

15 182 Medium

16 182 Medium

17 48 Low

18 70 Low

19 70 Low

In [38]: df["horsepower-binned"].value_counts()

Low 153
Out[38]:
Medium 43
High 5
Name: horsepower-binned, dtype: int64

Let's plot the distribution of each bin:

In [39]: %matplotlib inline

import matplotlib as plt
from matplotlib import pyplot
pyplot.bar(group_names, df["horsepower-binned"].value_counts())

# set x/y labels and plot title

plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

Text(0.5, 1.0, 'horsepower bins')

Out[39]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 16/58

4/16/23, 5:00 AM Data Analysis

Bins Visualization
Normally, a histogram is used to visualize the distribution of bins we created above.

In [40]: plt.pyplot.hist(df["horsepower"], bins = 3)

# set x/y labels and plot title

plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

Text(0.5, 1.0, 'horsepower bins')

Out[40]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 17/58

4/16/23, 5:00 AM Data Analysis

The plot above shows the binning result for the attribute "horsepower".

Indicator Variable (or Dummy Variable)

What is an indicator variable?
An indicator variable (or dummy variable) is a numerical variable used to label categories. They
are called 'dummies' because the numbers themselves don't have inherent meaning.

Why we use them

We use indicator variables so we can use categorical variables for regression analysis in the later
modules.

Example
We see the column "fuel-type" has two unique values: "gas" or "diesel". Regression doesn't
understand words, only numbers. To use this attribute in regression analysis, we convert "fuel-
type" to indicator variables.

We will use pandas' method 'get_dummies' to assign numerical values to different categories of
fuel type.

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 18/58

4/16/23, 5:00 AM Data Analysis

In [41]: df.columns

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',

Out[41]:
'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
'highway-mpg', 'price', 'city-L/100km', 'horsepower-binned'],
dtype='object')

In [42]: dummy_variable_1 = pd.get_dummies(df["fuel-type"])

dummy_variable_1.head()

Out[42]: diesel gas

0 0 1

1 0 1

2 0 1

3 0 1

4 0 1

In [43]: dummy_variable_1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'},

dummy_variable_1.head()

Out[43]: fuel-type-diesel fuel-type-gas

0 0 1

1 0 1

2 0 1

3 0 1

4 0 1

In the dataframe, column 'fuel-type' has values for 'gas' and 'diesel' as 0s and 1s now.

merge data frame "df" and "dummy_variable_1"

In [44]: df = pd.concat([df, dummy_variable_1], axis=1)

drop original column "fuel-type" from "df"

In [45]: df.drop("fuel-type", axis = 1, inplace=True)

In [46]: df.head()

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 19/58

4/16/23, 5:00 AM Data Analysis

Out[46]: num-
normalized- body- drive- engine- wheel-
symboling make aspiration of- length .
losses style wheels location base
doors

alfa-
0 3 122 std two convertible rwd front 88.6 168.8
romero

alfa-
1 3 122 std two convertible rwd front 88.6 168.8
romero

alfa-
2 1 122 std two hatchback rwd front 94.5 171.2
romero

3 2 164 audi std four sedan fwd front 99.8 176.6

4 2 164 audi std four sedan 4wd front 99.4 176.6

5 rows × 29 columns

Now to save the file

In [47]: df.to_csv('F:/Coursera/clean_df.csv')

Now our Data is Cleaned

In [48]: df.corr()

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 20/58

4/16/23, 5:00 AM Data Analysis

Out[48]:
normalized- wheel- curb- engine-
symboling length width height
losses base weight size

symboling 1.000000 0.466264 -0.535987 -0.365404 -0.242423 -0.550160 -0.233118 -0.110581

normalized-
0.466264 1.000000 -0.056661 0.019424 0.086802 -0.373737 0.099404 0.112360
losses

wheel-base -0.535987 -0.056661 1.000000 0.876024 0.814507 0.590742 0.782097 0.572027

length -0.365404 0.019424 0.876024 1.000000 0.857170 0.492063 0.880665 0.685025

width -0.242423 0.086802 0.814507 0.857170 1.000000 0.306002 0.866201 0.729436

height -0.550160 -0.373737 0.590742 0.492063 0.306002 1.000000 0.307581 0.074694

curb-weight -0.233118 0.099404 0.782097 0.880665 0.866201 0.307581 1.000000 0.849072

engine-size -0.110581 0.112360 0.572027 0.685025 0.729436 0.074694 0.849072 1.000000

bore -0.140019 -0.029862 0.493244 0.608971 0.544885 0.180449 0.644060 0.572609

stroke -0.008153 0.055045 0.158018 0.123952 0.188822 -0.060663 0.167438 0.205928

compression-
-0.182196 -0.114713 0.250313 0.159733 0.189867 0.259737 0.156433 0.028889
ratio

horsepower 0.075810 0.217300 0.371178 0.579795 0.615056 -0.087001 0.757981 0.822668

peak-rpm 0.279740 0.239543 -0.360305 -0.285970 -0.245800 -0.309974 -0.279361 -0.256733

city-mpg -0.035527 -0.225016 -0.470606 -0.665192 -0.633531 -0.049800 -0.749543 -0.650546

highway-
0.036233 -0.181877 -0.543304 -0.698142 -0.680635 -0.104812 -0.794889 -0.679571
mpg

price -0.082391 0.133999 0.584642 0.690628 0.751265 0.135486 0.834415 0.872335

city-L/100km 0.066171 0.238567 0.476153 0.657373 0.673363 0.003811 0.785353 0.745059

fuel-type-
-0.196735 -0.101546 0.307237 0.211187 0.244356 0.281578 0.221046 0.070779
diesel

fuel-type-gas 0.196735 0.101546 -0.307237 -0.211187 -0.244356 -0.281578 -0.221046 -0.070779

This Shows us about the correlation

Continuous Numerical Variables:

Continuous numerical variables are variables that may contain any value within some range.
They can be of type "int64" or "float64". A great way to visualize these variables is by using
scatterplots with fitted lines.

In order to start understanding the (linear) relationship between an individual variable and the
price, we can use "regplot" which plots the scatterplot plus the fitted regression line for the
data.

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 21/58

4/16/23, 5:00 AM Data Analysis

Let's see several examples of different linear relationships:

Positive Linear Relationship

In [49]: sas.regplot(x = "engine-size", y = "price" , data = df)

<AxesSubplot:xlabel='engine-size', ylabel='price'>
Out[49]:

As the engine-size goes up, the price goes up: this indicates a positive direct correlation
between these two variables. Engine size seems like a pretty good predictor of price since the
regression line is almost a perfect diagonal line.

In [50]: df[["engine-size", "price"]].corr()

Out[50]: engine-size price

engine-size 1.000000 0.872335

price 0.872335 1.000000

Highway mpg is a potential predictor variable of price. Let's find the scatterplot of "highway-
mpg" and "price".

In [51]: sas.regplot(x="highway-mpg", y="price", data=df)

<AxesSubplot:xlabel='highway-mpg', ylabel='price'>
Out[51]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 22/58

4/16/23, 5:00 AM Data Analysis

As highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship
between these two variables. Highway mpg could potentially be a predictor of price.

In [52]: df[["highway-mpg", "price"]].corr()

Out[52]: highway-mpg price

highway-mpg 1.000000 -0.704692

price -0.704692 1.000000

Let's see if "peak-rpm" is a predictor variable of "price".

In [53]: sas.regplot(x="peak-rpm", y="price", data=df)

<AxesSubplot:xlabel='peak-rpm', ylabel='price'>
Out[53]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 23/58

4/16/23, 5:00 AM Data Analysis

Peak rpm does not seem like a good predictor of the price at all since the regression line is
close to horizontal. Also, the data points are very scattered and far from the fitted line, showing
lots of variability. Therefore, it's not a reliable variable.

In [54]: df[['peak-rpm','price']].corr()

Out[54]: peak-rpm price

peak-rpm 1.000000 -0.101616

price -0.101616 1.000000

Categorical Variables
These are variables that describe a 'characteristic' of a data unit, and are selected from a small
group of categories. The categorical variables can have the type "object" or "int64". A good way
to visualize categorical variables is by using boxplots.

In [55]: sas.boxplot(x="body-style", y="price", data=df)

<AxesSubplot:xlabel='body-style', ylabel='price'>
Out[55]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 24/58

4/16/23, 5:00 AM Data Analysis

We see that the distributions of price between the different body-style categories have a
significant overlap, so body-style would not be a good predictor of price. Let's examine engine
"engine-location" and "price":

In [56]: sas.boxplot(x="engine-location", y="price", data=df)

<AxesSubplot:xlabel='engine-location', ylabel='price'>
Out[56]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 25/58

4/16/23, 5:00 AM Data Analysis

Here we see that the distribution of price between these two engine-location categories, front
and rear, are distinct enough to take engine-location as a potential good predictor of price.

Let's examine "drive-wheels" and "price".

In [57]: sas.boxplot(x="drive-wheels", y="price", data=df)

<AxesSubplot:xlabel='drive-wheels', ylabel='price'>
Out[57]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 26/58

4/16/23, 5:00 AM Data Analysis

Here we see that the distribution of price between the different drive-wheels categories differs.
As such, drive-wheels could potentially be a predictor of price.

Descriptive Statistical Analysis

The describe function automatically computes basic statistics for all continuous variables. Any
NaN values are automatically skipped in these statistics.

In [58]: df.describe()

Out[58]: normalized- wheel- curb- engin

symboling length width height
losses base weight s

count 201.000000 201.00000 201.000000 201.000000 201.000000 201.000000 201.000000 201.0000

mean 0.840796 122.00000 98.797015 174.200995 65.889055 53.766667 2555.666667 126.8756

std 1.254802 31.99625 6.066366 12.322175 2.101471 2.447822 517.296727 41.5468

min -2.000000 65.00000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.0000

25% 0.000000 101.00000 94.500000 166.800000 64.100000 52.000000 2169.000000 98.0000

50% 1.000000 122.00000 97.000000 173.200000 65.500000 54.100000 2414.000000 120.0000

75% 2.000000 137.00000 102.400000 183.500000 66.600000 55.500000 2926.000000 141.0000

max 3.000000 256.00000 120.900000 208.100000 72.000000 59.800000 4066.000000 326.0000

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 27/58

4/16/23, 5:00 AM Data Analysis

In [59]: df.describe(include=['object'])

Out[59]: num-
body- drive- engine- engine- num-of- fuel-
make aspiration of-
style wheels location type cylinders system
doors

count 201 201 201 201 201 201 201 201 201

unique 22 2 2 5 3 2 6 7 8

top toyota std four sedan fwd front ohc four mpfi

freq 32 165 115 94 118 198 145 157 92

Value Counts
Value counts is a good way of understanding how many units of each characteristic/variable we
have. We can apply the "value_counts" method on the column "drive-wheels". Don’t forget the
method "value_counts" only works on pandas series, not pandas dataframes. As a result, we
only include one bracket df ['drive-wheels'], not two brackets df [['drive-wheels']].

In [60]: df['drive-wheels'].value_counts()

fwd 118
Out[60]:
rwd 75
4wd 8
Name: drive-wheels, dtype: int64

We can convert the series to a dataframe as follows:

In [61]: df['drive-wheels'].value_counts().to_frame()

Out[61]: drive-wheels

fwd 118

rwd 75

4wd 8

Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and
rename the column 'drive-wheels' to 'value_counts'.

In [62]: drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()

drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts

Out[62]: value_counts

fwd 118

rwd 75

4wd 8

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 28/58

4/16/23, 5:00 AM Data Analysis

Now let's rename the index to 'drive-wheels':

In [63]: drive_wheels_counts.index.name = 'drive-wheels'

drive_wheels_counts

Out[63]: value_counts

drive-wheels

fwd 118

rwd 75

4wd 8

We can repeat the above process for the variable 'engine-location'.

In [64]: engine_loc_counts = df['engine-location'].value_counts().to_frame()

engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)

Out[64]: value_counts

engine-location

front 198

rear 3

Basics of Grouping
The "groupby" method groups data by different categories. The data is grouped based on one
or several variables, and analysis is performed on the individual groups.

For example, let's group by the variable "drive-wheels". We see that there are 3 different
categories of drive wheels.

In [65]: df['drive-wheels'].unique()

array(['rwd', 'fwd', '4wd'], dtype=object)

Out[65]:

If we want to know, on average, which type of drive wheel is most valuable,

we can group "drive-wheels" and then average them. We can select the
columns 'drive-wheels', 'body-style' and 'price', then assign it to the variable
"df_group_one".

In [66]: df_group_one = df[['drive-wheels','body-style','price']]

We can then calculate the average price for each of the different categories of
data.

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 29/58

4/16/23, 5:00 AM Data Analysis

In [67]: df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()

df_group_one

Out[67]: drive-wheels price

0 4wd 10241.000000

1 fwd 9244.779661

2 rwd 19757.613333

From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-
wheel and front-wheel are approximately the same in price.

You can also group by multiple variables. For example, let's group by both 'drive-wheels' and
'body-style'. This groups the dataframe by the unique combination of 'drive-wheels' and 'body-
style'. We can store the results in the variable 'grouped_test1'.

In [68]: df_gptest = df[['drive-wheels','body-style','price']]

grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test1

Out[68]: drive-wheels body-style price

0 4wd hatchback 7603.000000

1 4wd sedan 12647.333333

2 4wd wagon 9095.750000

3 fwd convertible 11595.000000

4 fwd hardtop 8249.000000

5 fwd hatchback 8396.387755

6 fwd sedan 9811.800000

7 fwd wagon 9997.333333

8 rwd convertible 23949.600000

9 rwd hardtop 24202.714286

10 rwd hatchback 14337.777778

11 rwd sedan 21711.833333

12 rwd wagon 16994.222222

This grouped data is much easier to visualize when it is made into a pivot table

In [69]: grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')

grouped_pivot

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 30/58

4/16/23, 5:00 AM Data Analysis

Out[69]: price

body-style convertible hardtop hatchback sedan wagon

drive-wheels

4wd NaN NaN 7603.000000 12647.333333 9095.750000

fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333

rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222

Often, we won't have data for some of the pivot cells. We can fill these missing cells with the
value 0, but any other value could potentially be used as well. It should be mentioned that
missing data is quite a complex subject and is an entire course on its own.

In [70]: grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0

grouped_pivot

Out[70]: price

body-style convertible hardtop hatchback sedan wagon

drive-wheels

4wd 0.0 0.000000 7603.000000 12647.333333 9095.750000

fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333

rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222

In [71]: import matplotlib.pyplot as plt

%matplotlib inline

Variables: Drive Wheels and Body Style vs. Price

Let's use a heat map to visualize the relationship between Body Style vs Price.

In [72]: #use the grouped results

plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 31/58

4/16/23, 5:00 AM Data Analysis

The heatmap plots the target variable (price) proportional to colour with respect to the variables
'drive-wheel' and 'body-style' on the vertical and horizontal axis, respectively. This allows us to
visualize how the price is related to 'drive-wheel' and 'body-style'.

The default labels convey no useful information to us. Let's change that:

In [73]: fig, ax = plt.subplots()

im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index

#move ticks and labels to the center

ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long

plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 32/58

4/16/23, 5:00 AM Data Analysis

Correlation and Causation

Correlation: a measure of the extent of interdependence between variables.

Causation: the relationship between cause and effect between two variables.
It is important to know the difference between these two. Correlation does not imply causation.
Determining correlation is much simpler the determining causation as causation may require
independent experimentation.

Pearson Correlation
The Pearson Correlation measures the linear dependence between two variables X and Y.

The resulting coefficient is a value between -1 and 1 inclusive, where:

1: Perfect positive linear correlation. 0: No linear correlation, the two variables most likely do not
affect each other. -1: Perfect negative linear correlation.

In [74]: from scipy import stats

Wheel-Base vs. Price

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 33/58

4/16/23, 5:00 AM Data Analysis

Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.

In [75]: pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P ="

The Pearson Correlation Coefficient is 0.5846418222655081 with a P-value of P = 8.07

6488270732989e-20

Since the p-value is < 0.001, the correlation between wheel-base and price is statistically
significant, although the linear relationship isn't extremely strong (~0.585).

Horsepower vs. Price¶

In [76]: pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =

The Pearson Correlation Coefficient is 0.8096068016571054 with a P-value of P = 6.2

73536270650504e-48

Hmmmmm Since the p-value is < 0.001, the correlation between horsepower and price is
statistically significant, and the linear relationship is quite strong (~0.809, close to 1).

Length vs. Price

In [77]: pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =

The Pearson Correlation Coefficient is 0.6906283804483642 with a P-value of P = 8.0

16477466158759e-30

Since the p-value is < 0.001, the correlation between length and price is statistically
significant, and the linear relationship is moderately strong (~0.691).

Similarly checking with others

Width Vs Price

In [78]: pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P ="

The Pearson Correlation Coefficient is 0.7512653440522672 with a P-value of P = 9.20

033551048217e-38

Curb-Weight Vs Price

In [79]: pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])

print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =

The Pearson Correlation Coefficient is 0.8344145257702846 with a P-value of P = 2.1

895772388936914e-53

Enine-Size Vs Price

In [80]: pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P ="

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 34/58

4/16/23, 5:00 AM Data Analysis
The Pearson Correlation Coefficient is 0.8723351674455185 with a P-value of P = 9.26
5491622198389e-64

Bore Vs Price

In [81]: pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =

The Pearson Correlation Coefficient is 0.5431553832626602 with a P-value of P = 8.

049189483935489e-17

City-Mpg Vs price

In [82]: pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])

print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =

The Pearson Correlation Coefficient is -0.6865710067844677 with a P-value of P = 2.

321132065567674e-29

Highway-mpg Vs Price

In [83]: pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])

print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =

The Pearson Correlation Coefficient is -0.7046922650589529 with a P-value of P = 1.

7495471144477352e-31

ANOVA
The Analysis of Variance (ANOVA) is a statistical method used to test whether there are
significant differences between the means of two or more groups. ANOVA returns two
parameters:

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the
actual means deviate from the assumption, and reports it as the F-test score. A larger score
means there is a larger difference between the means.

P-value: P-value tells how statistically significant our calculated score value is.

If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA
to return a sizeable F-test score and a small p-value.

Drive Wheels
Since ANOVA analyzes the difference between different groups of the same variable, the
groupby function will come in handy. Because the ANOVA algorithm averages the data
automatically, we do not need to take the average before hand.

To see if different types of 'drive-wheels' impact 'price', we group the data.

In [84]: grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])

grouped_test2.head(2)
localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 35/58
4/16/23, 5:00 AM Data Analysis

Out[84]: drive-wheels price

0 rwd 13495.0

1 rwd 16500.0

3 fwd 13950.0

4 4wd 17450.0

5 fwd 15250.0

136 4wd 7603.0

In [85]: df_gptest

Out[85]: drive-wheels body-style price

0 rwd convertible 13495.0

1 rwd convertible 16500.0

2 rwd hatchback 16500.0

3 fwd sedan 13950.0

4 4wd sedan 17450.0

... ... ... ...

196 rwd sedan 16845.0

197 rwd sedan 19045.0

198 rwd sedan 21485.0

199 rwd sedan 22470.0

200 rwd sedan 22625.0

201 rows × 3 columns

We can obtain the values of the method group using the method "get_group".

In [86]: grouped_test2.get_group('4wd')['price']

4 17450.0
Out[86]:
136 7603.0
140 9233.0
141 11259.0
144 8013.0
145 11694.0
150 7898.0
151 8778.0
Name: price, dtype: float64

In [87]: # ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.g

print( "ANOVA results: F=", f_val, ", P =", p_val)

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 36/58

4/16/23, 5:00 AM Data Analysis
ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23

This is a great result with a large F-test score showing a strong correlation and a P-value of
almost 0 implying almost certain statistical significance. But does this mean all three tested
groups are all this highly correlated?

Let's examine them separately.

fwd and rwd

In [88]: f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.g

print( "ANOVA results: F=", f_val, ", P =", p_val )

ANOVA results: F= 130.5533160959111 , P = 2.2355306355677845e-23

4wd and rwd

In [89]: f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.g

print( "ANOVA results: F=", f_val, ", P =", p_val)

ANOVA results: F= 8.580681368924756 , P = 0.004411492211225333

4wd and fwd

In [90]: f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.g

print("ANOVA results: F=", f_val, ", P =", p_val)

ANOVA results: F= 0.665465750252303 , P = 0.41620116697845666

We now have a better idea of what our data looks like and which variables are important to take
into account when predicting the car price. We have narrowed it down to the following
variables:

Continuous numerical variables:

Length

Width

Curb-weight

Engine-size

Horsepower

City-mpg

Highway-mpg

Wheel-base
localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 37/58
4/16/23, 5:00 AM Data Analysis

Bore

Categorical variables:
Drive-wheels

Model Development
Linear Regression and Multiple Linear Regression
Let's load the modules for linear regression:

In [91]: from sklearn.linear_model import LinearRegression

Create the linear regression object:¶

In [92]: lm = LinearRegression()
lm

LinearRegression()
Out[92]:

How could highway-mpg help us predict car price?

For this example, we want to look at how highway-mpg can help us predict car price. Using
simple linear regression, we will create a linear function with "highway-mpg" as the predictor
variable and the "price" as the response variable.

In [93]: X = df[['highway-mpg']]
Y = df['price']

Fit the linear model using highway-mpg:

In [94]: lm.fit(X, Y)

LinearRegression()
Out[94]:

We can output a prediction:

In [95]: Yhat=lm.predict(X)
Yhat[0:5]

array([16236.50464347, 16236.50464347, 17058.23802179, 13771.3045085 ,

Out[95]:
20345.17153508])

In [96]: import matplotlib.pyplot as plt

# Plot the actual values

plt.scatter(X, Y, color='blue')

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 38/58

4/16/23, 5:00 AM Data Analysis
plt.xlabel("Highway MPG")
plt.ylabel("Price")
plt.title("Actual Prices vs. Highway MPG")

# Plot the predicted values

plt.plot(X, Yhat, color='red')
plt.show()

What is the value of the intercept ?

In [97]: lm.intercept_

38423.305858157386
Out[97]:

What is the value of the slope ?

In [98]: lm.coef_

array([-821.73337832])
Out[98]:

Similarly How could Engine Size help us predict price

?
Set lm, variables X = engine size and Y = Price

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 39/58

4/16/23, 5:00 AM Data Analysis

In [99]: lm1 = LinearRegression()

X1 = df[["engine-size"]]
Y1 = df["price"]

lm1.fit(X1,Y1)

LinearRegression()
Out[99]:

In [100… Yhat1 = lm1.predict(X1)

Yhat1[0:5]

array([13728.4631336 , 13728.4631336 , 17399.38347881, 10224.40280408,

Out[100]:
14729.62322775])

In [101… # Plot the actual values

plt.scatter(X1, Y1, color='blue')
plt.xlabel("Engine Size")
plt.ylabel("Price")
plt.title("Actual Prices vs. Engine Size")

# Plot the predicted values

plt.plot(X1, Yhat1, color='red')
plt.show()

In [102… lm1.intercept_

-7963.338906281049
Out[102]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 40/58

4/16/23, 5:00 AM Data Analysis

In [103… lm1.coef_

array([166.86001569])
Out[103]:

Equation of predicted line

In [104… Price=-7963.34 + 166.86*df['engine-size']

In [105… Price

0 13728.46
Out[105]:
1 13728.46
2 17399.38
3 10224.40
4 14729.62
...
196 15563.92
197 15563.92
198 20903.44
199 16231.36
200 15563.92
Name: engine-size, Length: 201, dtype: float64

In [106… df["price"]

0 13495.0
Out[106]:
1 16500.0
2 16500.0
3 13950.0
4 17450.0
...
196 16845.0
197 19045.0
198 21485.0
199 22470.0
200 22625.0
Name: price, Length: 201, dtype: float64

What if we want to predict car price using more than one

variable?
Multiple Linear Regression
From the previous section we know that other good predictors of price could be:

.Horsepower

.Curb-weight

.Engine-size

.Highway-mpg

Let's develop a model using these variables as the predictor variables.

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 41/58

4/16/23, 5:00 AM Data Analysis

In [169… Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]

Fit the linear model using the four above-mentioned variables.

In [170… lm.fit(Z, df["price"])

LinearRegression()
Out[170]:

In [171… Yhat2 = lm.predict(Z)

# Create a scatter plot of the actual values and predicted values

plt.scatter(df["price"], Yhat2, color='blue')
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Prices vs. Predicted Prices")

# Add a diagonal line to represent perfect predictions

xmin = min(df["price"])
xmax = max(df["price"])
plt.plot([xmin, xmax], [xmin, xmax], color='red')

plt.show()

What is the value of the intercept and Slope

In [109… lm.intercept_

-15811.863767729243
Out[109]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 42/58

4/16/23, 5:00 AM Data Analysis

In [110… lm.coef_

array([53.53022809, 4.70805253, 81.51280006, 36.1593925 ])

Out[110]:

The Linear Function we get

Price = -15678.742628061467 + 52.65851272 x horsepower + 4.69878948 x curb-weight +
81.95906216 x engine-size + 33.58258185 x highway-mpg

Now I Will Make a Linear Regression where the response

variable is price and predictor variables are normalized-losses
and highway-mppg
In [111… lm2 = LinearRegression()

In [172… Z2 = df[["normalized-losses" , "highway-mpg"]]

lm2.fit(Z2, df["price"])
yhat3 = lm2.predict(Z2)

plt.scatter(df["price"], yhat3, color='blue')

plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Prices vs. Predicted Prices")

# Add a diagonal line to represent perfect predictions

xmin = min(df["price"])
xmax = max(df["price"])
plt.plot([xmin, xmax], [xmin, xmax], color='red')

plt.show()

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 43/58

4/16/23, 5:00 AM Data Analysis

Intercept and Slope

In [113… lm2.intercept_

38201.31327245728
Out[113]:

In [114… lm2.coef_

array([ 1.49789586, -820.45434016])

Out[114]:

Model Evaluation Using Visualization

Now that we've developed some models, how do we evaluate our models and choose the best
one? One way to do this is by using a visualization.

Import the visualization package, seaborn:

Regression Plot
When it comes to simple linear regression, an excellent way to visualize the fit of our model is
by using regression plots.

In [115… width = 12
height = 10

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 44/58

4/16/23, 5:00 AM Data Analysis
plt.figure(figsize=(width, height))
sas.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)

(0.0, 48175.27099289158)
Out[115]:

We can see from this plot that price is negatively correlated to highway-mpg since the
regression slope is negative.

One thing to keep in mind when looking at a regression plot is to pay attention to how
scattered the data points are around the regression line. This will give you a good indication of
the variance of the data and whether a linear model would be the best fit or not. If the data is
too far off from the line, this linear model might not be the best model for this data.

Let's compare this plot to the regression plot of "peak-rpm".

In [116… plt.figure(figsize=(width, height))

sas.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)

(0.0, 47414.1)
Out[116]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 45/58

4/16/23, 5:00 AM Data Analysis

Comparing the regression plot of "peak-rpm" and "highway-mpg", we see that the points for
"highway-mpg" are much closer to the generated line and, on average, decrease. The points for
"peak-rpm" have more spread around the predicted line and it is much harder to determine if
the points are decreasing or increasing as the "peak-rpm" increases.

In [117… df[["peak-rpm","highway-mpg","price"]].corr()

Out[117]: peak-rpm highway-mpg price

peak-rpm 1.000000 -0.058598 -0.101616

highway-mpg -0.058598 1.000000 -0.704692

price -0.101616 -0.704692 1.000000

Residual Plot
A good way to visualize the variance of the data is to use a residual plot.

In [118… width = 12
height = 10
plt.figure(figsize=(width, height))
sas.residplot(x=df['highway-mpg'],y=df['price'])
plt.show()
localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 46/58
4/16/23, 5:00 AM Data Analysis

What is this plot telling us?

We can see from this residual plot that the residuals are not randomly spread around the x-axis,
leading us to believe that maybe a non-linear model is more appropriate for this data.

How do we visualize a model for Multiple Linear

Regression?
This gets a bit more complicated because you can't visualize it with regression or residual plot.

One way to look at the fit of the model is by looking at the distribution plot. We can look at the
distribution of the fitted values that result from the model and compare it to the distribution of
the actual values.

First, let's make a prediction:

In [119… Y_hat = lm.predict(Z)

In [174… plt.figure(figsize=(width, height))

ax1 = sas.distplot(df['price'], hist=False, color="r", label="Actual Value")

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 47/58

4/16/23, 5:00 AM Data Analysis
sas.distplot(Y_hat, hist=False, color="b", label="Fitted Values" , ax=ax1)

plt.title('Actual vs Fitted Values for Price')

plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

D:\Python\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot`

is a deprecated function and will be removed in a future version. Please adapt your c
ode to use either `displot` (a figure-level function with similar flexibility) or `kd
eplot` (an axes-level function for kernel density plots).
warnings.warn(msg, FutureWarning)
D:\Python\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot`
is a deprecated function and will be removed in a future version. Please adapt your c
ode to use either `displot` (a figure-level function with similar flexibility) or `kd
eplot` (an axes-level function for kernel density plots).
warnings.warn(msg, FutureWarning)

We can see that the fitted values are reasonably close to the actual values since the two
distributions overlap a bit. However, there is definitely some room for improvement.

Polynomial Regression and Pipelines

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 48/58
4/16/23, 5:00 AM Data Analysis

Polynomial regression is a particular case of the general linear regression model or multiple
linear regression models.

We get non-linear relationships by squaring or setting higher-order terms of the predictor

variables.

There are different orders of polynomial regression:

Quadratic - 2nd Order

2
Y hat = a + b1 X + b2 X

Cubic - 3rd Order

2 3
Y hat = a + b1 X + b2 X + b3 X

Higher-Order:

2 3
Y = a + b1 X + b2 X + b3 X . . . .

We saw earlier that a linear model did not provide the best fit while using "highway-mpg" as
the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.

We will use the following function to plot the data:

In [121… def PlotPolly(model, independent_variable, dependent_variabble, Name):

x_new = np.linspace(15, 55, 100)
y_new = model(x_new)

plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')

plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
ax = plt.gca()
ax.set_facecolor((0.898, 0.898, 0.898))
fig = plt.gcf()
plt.xlabel(Name)
plt.ylabel('Price of Cars')

plt.show()
plt.close()

Ye code ek function ko define karta hai jis ka naam "PlotPolly" hai. Is function mein chaar
parameters diye gaye hain: "model", jo ke polynomial regression model hai,
"independent_variable", jo ke model ke input data hai, "dependent_variable", jo ke model ke
output data hai, aur "Name", jo ke independent variable ka naam hai.

Phir ye function np.linspace function se naye data points generate karta hai, matplotlib se plot
bana kar original data points aur polynomial fit ko dikhta hai, plot ka background color set karta
hai, x aur y axis pe labels add karta hai, aur plot ko display karta hai.

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 49/58

4/16/23, 5:00 AM Data Analysis

Overall, ye function polynomial regression model ke results ko visualize karna aur model ke data
ke sath kaisa fit kar raha hai, is ke baare mein insights hasil karna ke liye use kiya ja sakta hai.

Let's get the variables:

In [122… x = df['highway-mpg']
y = df['price']

Let's fit the polynomial using the function polyfit, then use the function poly1d to display the
polynomial function.

In [123… # Here we use a polynomial of the 3rd order (cubic)

f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)

3 2
-1.557 x + 204.8 x - 8965 x + 1.379e+05

Let's plot the function:

In [124… PlotPolly(p, x, y, 'highway-mpg')

In [125… np.polyfit(x, y, 3)

array([-1.55663829e+00, 2.04754306e+02, -8.96543312e+03, 1.37923594e+05])

Out[125]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 50/58

4/16/23, 5:00 AM Data Analysis

We can already see from plotting that this polynomial model performs better
than the linear model. This is because the generated polynomial function
"hits" more of the data points.

11 Order Polynomial
In [126… f1 = np.polyfit(x, y, 11)
p1 = np.poly1d(f1)
print(p1)
PlotPolly(p1,x,y, 'Highway MPG')

11 10 9 8 7
-1.243e-08 x + 4.722e-06 x - 0.0008028 x + 0.08056 x - 5.297 x
6 5 4 3 2
+ 239.5 x - 7588 x + 1.684e+05 x - 2.565e+06 x + 2.551e+07 x - 1.491e+08 x + 3.879e+
08

We can perform a polynomial transform on multiple features. First, we import the module:

In [127… from sklearn.preprocessing import PolynomialFeatures

We create a PolynomialFeatures object of degree 2:

In [128… pr=PolynomialFeatures(degree=2)
pr

PolynomialFeatures()
Out[128]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 51/58

4/16/23, 5:00 AM Data Analysis

In [129… Z_pr=pr.fit_transform(Z)

In [130… Z.shape

(201, 4)
Out[130]:

In [131… Z_pr.shape

(201, 15)
Out[131]:

Pipeline
Data Pipelines simplify the steps of processing the data. We use the module Pipeline to create a
pipeline. We also use StandardScaler as a step in our pipeline.

In [132… from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

We create the pipeline by creating a list of tuples including the name of the model or estimator
and its corresponding constructor.

In [133… Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=Fals

We input the list as an argument to the pipeline constructor:

In [134… pipe=Pipeline(Input)
pipe

Pipeline(steps=[('scale', StandardScaler()),
Out[134]:
('polynomial', PolynomialFeatures(include_bias=False)),
('model', LinearRegression())])

First, we convert the data type Z to type float to avoid conversion warnings that may appear as
a result of StandardScaler taking float inputs.

Then, we can normalize the data, perform a transform and fit the model simultaneously.

In [135… Z = Z.astype(float)
pipe.fit(Z,y)

Pipeline(steps=[('scale', StandardScaler()),
Out[135]:
('polynomial', PolynomialFeatures(include_bias=False)),
('model', LinearRegression())])

Similarly, we can normalize the data, perform a transform and produce a prediction
simultaneously.

In [136… ypipe=pipe.predict(Z)
ypipe[0:4]

array([13102.93329646, 13102.93329646, 18226.43450275, 10391.09183955])

Out[136]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 52/58

4/16/23, 5:00 AM Data Analysis

Create a pipeline that standardizes the data, then produce a

prediction using a linear regression model using the features Z
and target y.
In [137… Input=[('scale',StandardScaler()),('model',LinearRegression())]

pipe=Pipeline(Input)

pipe.fit(Z,y)

ypipe=pipe.predict(Z)
ypipe[0:10]

array([13699.07700462, 13699.07700462, 19052.71346719, 10620.61524404,

Out[137]:
15520.90025344, 13869.27463809, 15455.88834114, 15973.77411958,
17612.7829335 , 10722.47987021])

Measures for In-Sample Evaluation

R^2 / R-squared

Mean Squared Error (MSE)

Model 1: Simple Linear Regression

Let's calculate the R^2:

In [138… #highway_mpg_fit
lm.fit(X, Y)
# Find the R^2
print('The R-square is: ', lm.score(X, Y))

The R-square is: 0.4965911884339175

Let's calculate the MSE:

We can predict the output i.e., "yhat" using the predict method, where X is the input variable:

In [139… Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])

The output of the first four predicted value is: [16236.50464347 16236.50464347 1705
8.23802179 13771.3045085 ]

let's import the function mean_squared_error from the module metrics:

In [140… from sklearn.metrics import mean_squared_error

We can compare the predicted results with the actual results:

In [141… mse = mean_squared_error(df['price'], Yhat)

print('The mean square error of price and predicted value is: ', mse)

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 53/58

4/16/23, 5:00 AM Data Analysis
The mean square error of price and predicted value is: 31635042.944639895

Model 2: Multiple Linear Regression

In [142… # fit the model
lm.fit(Z, df['price'])
# Find the R^2
print('The R-square is: ', lm.score(Z, df['price']))

The R-square is: 0.8093732522175299

In [143… Y_predict_multifit = lm.predict(Z)

We compare the predicted results with the actual results:

In [144… print('The mean square error of price and predicted value using multifit is: ', \
mean_squared_error(df['price'], Y_predict_multifit))

The mean square error of price and predicted value using multifit is: 11979300.34981
8885

Model 3: Polynomial Fit

Let’s import the function r2_score from the module metrics as we are using a different function.

In [145… from sklearn.metrics import r2_score

We apply the function to get the value of R^2:

In [146… r_squared = r2_score(y, p(x))

print('The R-square value is: ', r_squared)

The R-square value is: 0.6741946663906513

In [147… mean_squared_error(df['price'], p(x))

20474146.42636125
Out[147]:

Prediction and Decision Making

In [148… import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

Create a new input:

In [149… new_input=np.arange(1, 100, 1).reshape(-1, 1)

Fit the model

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 54/58

4/16/23, 5:00 AM Data Analysis

In [150… lm.fit(X, Y)
lm

LinearRegression()
Out[150]:

Produce a prediction:

In [151… yhat=lm.predict(new_input)
yhat[0:5]

D:\Python\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid f

eature names, but LinearRegression was fitted with feature names
warnings.warn(
array([37601.57247984, 36779.83910151, 35958.10572319, 35136.37234487,
Out[151]:
34314.63896655])

We can plot the data:

In [152… plt.plot(new_input, yhat)

plt.show()

Decision Making: Determining a Good Model Fit

Now that we have visualized the different models, and generated the R-
squared and MSE values for the fits, how do we determine a good model fit?
What is a good R-squared value?

When comparing models, the model with the higher R-squared value is a better fit for the data.

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 55/58

4/16/23, 5:00 AM Data Analysis

What is a good MSE?

When comparing models, the model with the smallest MSE value is a better fit for the data.

Let's take a look at the values for the different models.

Simple Linear Regression: Using Highway-mpg as a Predictor Variable of Price.

R-squared: 0.49659118843391759

MSE: 3.16 x10^7

Multiple Linear Regression: Using Horsepower, Curb-weight, Engine-size, and Highway-mpg

as Predictor Variables of Price.

R-squared: 0.80896354913783497

MSE: 1.2 x10^7

Polynomial Fit: Using Highway-mpg as a Predictor Variable of Price.

R-squared: 0.6741946663906514

MSE: 2.05 x 10^7

Conclusion:
Comparing these three models, we conclude that the MLR model is the best model to be able
to predict price from our dataset. This result makes sense since we have 27 variables in total and
we know that more than one of those variables are potential predictors of the final car price.

Part 1: Training and Testing

An important step in testing your model is to split your data into training and testing data. We
will place the target data price in a separate dataframe y_data:

In [153… y_data = df['price']

Drop price data in dataframe x_data:

In [154… x_data=df.drop('price',axis=1)

Now, we randomly split our data into training and testing data using the function
train_test_split.

In [155… from sklearn.model_selection import train_test_split

In [156… x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.10,

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 56/58

4/16/23, 5:00 AM Data Analysis

In [157… print("Number of test sample :", x_test.shape[0])

print("Number of training samples :", x_train.shape[0])

Number of test sample : 21

Number of training samples : 180

The test_size parameter sets the proportion of data that is split into the testing set. In the above,
the testing set is 10% of the total dataset.

Let's import LinearRegression from the module linear_model.

In [158… from sklearn.linear_model import LinearRegression

We create a Linear Regression object:

In [159… lre = LinearRegression()

We fit the model using the feature "horsepower":

In [160… lre.fit(x_train[['horsepower']], y_train)

LinearRegression()
Out[160]:

Lets find R-square

In [161… print("R-Square for test :" ,lre.score(x_test[['horsepower']], y_test))

print("R-Square for training :" ,lre.score(x_train[['horsepower']], y_train))

R-Square for test : 0.3635480624962413

R-Square for training : 0.662028747521533

We can see the R^2 is much smaller using the test data compared to the training data.

Cross-Validation Score
In [ ]:

In [ ]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 57/58

4/16/23, 5:00 AM Data Analysis

In [ ]:

localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 58/58

Basic Fuse Terminology: Fuse: Fuse Elements: Rated Current: Fusing Current: Fusing Factor
50% (2)
Basic Fuse Terminology: Fuse: Fuse Elements: Rated Current: Fusing Current: Fusing Factor
3 pages
RCC 2
No ratings yet
RCC 2
22 pages
Cars Sales Dashboard
No ratings yet
Cars Sales Dashboard
19 pages
Maintenance Schedules / Maintenance Parts
100% (1)
Maintenance Schedules / Maintenance Parts
29 pages
Electro Chemistry (MS)
No ratings yet
Electro Chemistry (MS)
208 pages
Project Report
100% (1)
Project Report
58 pages
Introduction To Computer Fundamentals
No ratings yet
Introduction To Computer Fundamentals
15 pages
Air, Atmospheric Pressure and Winds
100% (1)
Air, Atmospheric Pressure and Winds
42 pages
Report Analysis Super Cars
100% (1)
Report Analysis Super Cars
15 pages
Project 8 Predictive Analytics - Ipynb - Colaboratory
No ratings yet
Project 8 Predictive Analytics - Ipynb - Colaboratory
8 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
STAT-231-Statistical Methods
No ratings yet
STAT-231-Statistical Methods
98 pages
Advanced Creating of 3D Dental Models in Blender Software: September 2016
No ratings yet
Advanced Creating of 3D Dental Models in Blender Software: September 2016
67 pages
Human Skin Grade 6
No ratings yet
Human Skin Grade 6
15 pages
Iachasta (Inter-Administration Charging and Statistics)
No ratings yet
Iachasta (Inter-Administration Charging and Statistics)
15 pages
Oracle Data Encryption
No ratings yet
Oracle Data Encryption
40 pages
6 Mips Datapath
No ratings yet
6 Mips Datapath
55 pages
4th Sem End Semester Question Papers
No ratings yet
4th Sem End Semester Question Papers
15 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Glycol Dehydrator Design Manual
No ratings yet
Glycol Dehydrator Design Manual
36 pages
Data Analysis Report
No ratings yet
Data Analysis Report
74 pages
Practical Example Full Notes
No ratings yet
Practical Example Full Notes
48 pages
Jawaban Exam
No ratings yet
Jawaban Exam
26 pages
Best Practices For Working With Large Data Sets: Autocad Civil 3D 2008
No ratings yet
Best Practices For Working With Large Data Sets: Autocad Civil 3D 2008
32 pages
IP Project Model
No ratings yet
IP Project Model
51 pages
Data Wrangling
No ratings yet
Data Wrangling
24 pages
Import As Import As: Numpy NP Pandas PD
No ratings yet
Import As Import As: Numpy NP Pandas PD
22 pages
Se Python - Merged
No ratings yet
Se Python - Merged
77 pages
Automobile Sales Predictions
No ratings yet
Automobile Sales Predictions
19 pages
Course2 - DataAnalysis With Python - Week3 - Exploratory Data Analysis
No ratings yet
Course2 - DataAnalysis With Python - Week3 - Exploratory Data Analysis
23 pages
Eda 1
No ratings yet
Eda 1
29 pages
EDA Withoutcode
No ratings yet
EDA Withoutcode
36 pages
Draft: Chapter 3 Introduction To Shells and Scripting
No ratings yet
Draft: Chapter 3 Introduction To Shells and Scripting
12 pages
Data Wrangling
No ratings yet
Data Wrangling
24 pages
Nalysis Manipulation and Cleaning
No ratings yet
Nalysis Manipulation and Cleaning
15 pages
Internship
No ratings yet
Internship
23 pages
Practical Training Seminar: Shubham Jain 132 KV G.S.S.Chambal Jaipur
No ratings yet
Practical Training Seminar: Shubham Jain 132 KV G.S.S.Chambal Jaipur
16 pages
SMDM Business+Report
No ratings yet
SMDM Business+Report
11 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
SMDM Business+Report
No ratings yet
SMDM Business+Report
11 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Xii Project PDF
No ratings yet
Xii Project PDF
19 pages
BDA-4 EDA Project
No ratings yet
BDA-4 EDA Project
19 pages
Data Logging of Power Profiles From Wireless Iot and Other Low-Power Devices
No ratings yet
Data Logging of Power Profiles From Wireless Iot and Other Low-Power Devices
16 pages
City Cycle Fuel Consumption 2024
No ratings yet
City Cycle Fuel Consumption 2024
23 pages
Advance EDA & Predictive Analytics
No ratings yet
Advance EDA & Predictive Analytics
38 pages
Exploratiory Data Analysis
No ratings yet
Exploratiory Data Analysis
26 pages
Auto Dataset MK - Part 1: Pandas PD Numpy NP
No ratings yet
Auto Dataset MK - Part 1: Pandas PD Numpy NP
18 pages
Car Price Prediction 1
No ratings yet
Car Price Prediction 1
24 pages
Phy 421 Note
No ratings yet
Phy 421 Note
27 pages
22eg107a11 DWV
No ratings yet
22eg107a11 DWV
15 pages
Pandas Notes Basic To Advance
No ratings yet
Pandas Notes Basic To Advance
21 pages
Car Price Prediction
No ratings yet
Car Price Prediction
35 pages
Read CSV Files Using Pandas Library
No ratings yet
Read CSV Files Using Pandas Library
11 pages
Quikr Car Price Prediction Using Linear Regression 1717999953
No ratings yet
Quikr Car Price Prediction Using Linear Regression 1717999953
12 pages
Data Preparation-All Pds
No ratings yet
Data Preparation-All Pds
15 pages
Automobil E Data Analysis: Name Pgp-Dsba Online January' 21 Date: Dd/mm/yyyy
No ratings yet
Automobil E Data Analysis: Name Pgp-Dsba Online January' 21 Date: Dd/mm/yyyy
11 pages
Exp 5 Exploratory Data Analysis SDK Ok
No ratings yet
Exp 5 Exploratory Data Analysis SDK Ok
13 pages
Various Methods of Ligation Ties: Review Article
No ratings yet
Various Methods of Ligation Ties: Review Article
6 pages
Compactly Powerful: Ugeo Pt60A
No ratings yet
Compactly Powerful: Ugeo Pt60A
6 pages
Car Price Prediction Using ML
No ratings yet
Car Price Prediction Using ML
11 pages
Automobile Price Data
No ratings yet
Automobile Price Data
53 pages
Intro To Exploratory Data Analysis Eda in Python
No ratings yet
Intro To Exploratory Data Analysis Eda in Python
7 pages
GmPrac1 - Jupyter Notebook
No ratings yet
GmPrac1 - Jupyter Notebook
11 pages
Elite Sports Cars Eda
No ratings yet
Elite Sports Cars Eda
9 pages
DV Ca-1
No ratings yet
DV Ca-1
9 pages
Multiple Choice (8 X 1 PT)
No ratings yet
Multiple Choice (8 X 1 PT)
5 pages
W73153 International GCSE Science (Single Award) 4SS0 AN Accessible Version
No ratings yet
W73153 International GCSE Science (Single Award) 4SS0 AN Accessible Version
4 pages
Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory
No ratings yet
Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory
6 pages
ENB301 Practice Mid-Sem Exam PDF
No ratings yet
ENB301 Practice Mid-Sem Exam PDF
2 pages
Car Price
No ratings yet
Car Price
6 pages
Untitled 21
No ratings yet
Untitled 21
6 pages
Introduction To Python - Minor Project
No ratings yet
Introduction To Python - Minor Project
5 pages
Machine Learning Project 1690186790
No ratings yet
Machine Learning Project 1690186790
18 pages
Pc102 Document SemesterProjectWorkbook
No ratings yet
Pc102 Document SemesterProjectWorkbook
6 pages
Lab Assignment 6
No ratings yet
Lab Assignment 6
5 pages
Untitled - Ipynb - (5) - JupyterLab
No ratings yet
Untitled - Ipynb - (5) - JupyterLab
4 pages
3 Exp-3
No ratings yet
3 Exp-3
3 pages
Untitled 0
No ratings yet
Untitled 0
3 pages
Mohy - Jupyter Notebook
No ratings yet
Mohy - Jupyter Notebook
3 pages
Haxmaps 159016197889
No ratings yet
Haxmaps 159016197889
2 pages
Data Sheet USB5 V 2019 05 EN
No ratings yet
Data Sheet USB5 V 2019 05 EN
1 page
Drop The Columns - Id - and - Unnamed - 0 - From Axis...
No ratings yet
Drop The Columns - Id - and - Unnamed - 0 - From Axis...
3 pages
Expt2.ipynb - Colaboratory
No ratings yet
Expt2.ipynb - Colaboratory
2 pages
Muestre Los Tipos de Datos de Cada Columna Utiliz...
No ratings yet
Muestre Los Tipos de Datos de Cada Columna Utiliz...
2 pages
Question Bank
No ratings yet
Question Bank
7 pages
Numpy,,Pandas (24.4.25)
No ratings yet
Numpy,,Pandas (24.4.25)
1 page
Automotive Intelligentsia 2009-2010 Sports Car Guide
From Everand
Automotive Intelligentsia 2009-2010 Sports Car Guide
Jim Gorzelany
5/5 (2)
How to Build Max-Performance Hemi Engines
From Everand
How to Build Max-Performance Hemi Engines
Richard Nedbal
No ratings yet
New Hemi Engines 2003-Present: How to Rebuild
From Everand
New Hemi Engines 2003-Present: How to Rebuild
Larry Shepard
5/5 (20)