Data Analysis
Data Analysis
DATA ACQUISITION
In [3]: df.head(5)
Out[3]: 0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21
alfa-
0 3 ? gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111
romero
alfa-
1 3 ? gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111
romero
alfa-
2 1 ? gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154
romero
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115
5 rows × 26 columns
df.columns = headers
df.head(5)
Out[4]: num-
normalized- fuel- body- drive- engine- wheel-
symboling make aspiration of- ...
losses type style wheels location base
doors
alfa-
0 3 ? gas std two convertible rwd front 88.6 ...
romero
alfa-
1 3 ? gas std two convertible rwd front 88.6 ...
romero
alfa-
2 1 ? gas std two hatchback rwd front 94.5 ...
romero
3 2 164 audi gas std four sedan fwd front 99.8 ...
4 2 164 audi gas std four sedan 4wd front 99.4 ...
5 rows × 26 columns
In [5]: df.dtypes
symboling int64
Out[5]:
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object
In [6]: df.describe()
what if we would also like to check all the columns including those that are of
type object?
Out[7]: num-
normalized- fuel- body- drive- engine- wheel
symboling make aspiration of-
losses type style wheels location base
doors
count 205.000000 205 205 205 205 205 205 205 205 205.000000
top NaN ? toyota gas std four sedan fwd front NaN
mean 0.834146 NaN NaN NaN NaN NaN NaN NaN NaN 98.756585
std 1.245307 NaN NaN NaN NaN NaN NaN NaN NaN 6.021776
min -2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 86.600000
25% 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN 94.500000
50% 1.000000 NaN NaN NaN NaN NaN NaN NaN NaN 97.000000
75% 2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 102.400000
max 3.000000 NaN NaN NaN NaN NaN NaN NaN NaN 120.900000
11 rows × 26 columns
In [8]: df.info()
Out[9]: num-
normalized- fuel- body- drive- engine- wheel-
symboling make aspiration of- ...
losses type style wheels location base
doors
alfa-
0 3 NaN gas std two convertible rwd front 88.6 ...
romero
alfa-
1 3 NaN gas std two convertible rwd front 88.6 ...
romero
alfa-
2 1 NaN gas std two hatchback rwd front 94.5 ...
romero
3 2 164 audi gas std four sedan fwd front 99.8 ...
4 2 164 audi gas std four sedan 4wd front 99.4 ...
5 rows × 26 columns
.notnull()
The output is a boolean value indicating whether the value that is passed into
the argument is in fact missing data.
"True" means the value is a missing value while "False" means the value is not
a missing value.
missing_data.head(5)
Out[10]: num-
normalized- fuel- body- drive- engine- wheel- engi
symboling make aspiration of- ...
losses type style wheels location base
doors
0 False True False False False False False False False False ... F
1 False True False False False False False False False False ... F
2 False True False False False False False False False False ... F
3 False False False False False False False False False False ... F
4 False False False False False False False False False False ... F
5 rows × 26 columns
normalized-losses
False 164
True 41
Name: normalized-losses, dtype: int64
make
False 205
Name: make, dtype: int64
fuel-type
False 205
Name: fuel-type, dtype: int64
aspiration
False 205
Name: aspiration, dtype: int64
num-of-doors
False 203
True 2
Name: num-of-doors, dtype: int64
body-style
False 205
Name: body-style, dtype: int64
drive-wheels
False 205
Name: drive-wheels, dtype: int64
engine-location
False 205
Name: engine-location, dtype: int64
wheel-base
False 205
Name: wheel-base, dtype: int64
length
False 205
Name: length, dtype: int64
width
False 205
Name: width, dtype: int64
height
False 205
Name: height, dtype: int64
curb-weight
False 205
Name: curb-weight, dtype: int64
engine-type
False 205
localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 7/58
4/16/23, 5:00 AM Data Analysis
num-of-cylinders
False 205
Name: num-of-cylinders, dtype: int64
engine-size
False 205
Name: engine-size, dtype: int64
fuel-system
False 205
Name: fuel-system, dtype: int64
bore
False 201
True 4
Name: bore, dtype: int64
stroke
False 201
True 4
Name: stroke, dtype: int64
compression-ratio
False 205
Name: compression-ratio, dtype: int64
horsepower
False 203
True 2
Name: horsepower, dtype: int64
peak-rpm
False 203
True 2
Name: peak-rpm, dtype: int64
city-mpg
False 205
Name: city-mpg, dtype: int64
highway-mpg
False 205
Name: highway-mpg, dtype: int64
price
False 201
True 4
Name: price, dtype: int64
Replace data
a. Replace it by mean
b. Replace it by frequency
Replace by mean:
"normalized-losses": 41 missing data, replace them with mean "stroke": 4 missing data, replace
them with mean "bore": 4 missing data, replace them with mean "horsepower": 2 missing data,
replace them with mean "peak-rpm": 2 missing data, replace them with mean
Replace by frequency:
"num-of-doors": 2 missing data, replace them with "four". Reason: 84% sedans is four doors.
Since four doors is most frequent, it is most likely to occur
mean : 3.3297512437810957
mean : 3.2554228855721337
mean : 104.25615763546799
mean : 5125.369458128079
four 114
Out[22]:
two 89
Name: num-of-doors, dtype: int64
We can see that four doors are the most common type. We can also use the
".idxmax()" method to calculate the most common type automatically:
In [23]: df['num-of-doors'].value_counts().idxmax()
'four'
Out[23]:
Finally, let's drop all rows that do not have price data:
In [27]: df.head()
Out[27]: num-
normalized- fuel- body- drive- engine- wheel-
symboling make aspiration of- ...
losses type style wheels location base
doors
alfa-
0 3 122.0 gas std two convertible rwd front 88.6 ...
romero
alfa-
1 3 122.0 gas std two convertible rwd front 88.6 ...
romero
alfa-
2 1 122.0 gas std two hatchback rwd front 94.5 ...
romero
3 2 164 audi gas std four sedan fwd front 99.8 ...
4 2 164 audi gas std four sedan 4wd front 99.4 ...
5 rows × 26 columns
In [28]: df.dtypes
symboling int64
Out[28]:
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object
As we can see there are some heads that should be in float or int
In [30]: df.dtypes
symboling int64
Out[30]:
normalized-losses int32
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower object
peak-rpm float64
city-mpg int64
highway-mpg int64
price float64
dtype: object
Data Standardization
Data is usually collected from different agencies in different formats. (Data standardization is
also a term for a particular type of data normalization where we subtract the mean and divide
by the standard deviation.)
What is standardization?
Standardization is the process of transforming data into a common format, allowing the
researcher to make the meaningful comparison.
Example
Transform mpg to L/100km:
In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented
by mpg (miles per gallon) unit. Assume we are developing an application in a country that
accepts the fuel consumption with L/100km standard.
In [31]: df.head()
Out[31]: num-
normalized- fuel- body- drive- engine- wheel-
symboling make aspiration of- ...
losses type style wheels location base
doors
alfa-
0 3 122 gas std two convertible rwd front 88.6 ...
romero
alfa-
1 3 122 gas std two convertible rwd front 88.6 ...
romero
alfa-
2 1 122 gas std two hatchback rwd front 94.5 ...
romero
3 2 164 audi gas std four sedan fwd front 99.8 ...
4 2 164 audi gas std four sedan 4wd front 99.4 ...
5 rows × 26 columns
Out[32]: num-
normalized- fuel- body- drive- engine- wheel-
symboling make aspiration of- ...
losses type style wheels location base
doors
alfa-
0 3 122 gas std two convertible rwd front 88.6 ...
romero
alfa-
1 3 122 gas std two convertible rwd front 88.6 ...
romero
alfa-
2 1 122 gas std two hatchback rwd front 94.5 ...
romero
3 2 164 audi gas std four sedan fwd front 99.8 ...
4 2 164 audi gas std four sedan 4wd front 99.4 ...
5 rows × 27 columns
Data Normalization
Why normalization?
Normalization is the process of transforming values of several variables into a similar range.
Typical normalizations include scaling the variable so the variable average is 0, scaling the
variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 13/58
4/16/23, 5:00 AM Data Analysis
Sample Code
replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
Binning
Why binning?
Binning is a process of transforming continuous numerical variables into discrete categorical
'bins' for grouped analysis.
Since we want to include the minimum value of horsepower, we want to set start_value =
min(df["horsepower"]).
Since we want to include the maximum value of horsepower, we want to set end_value =
max(df["horsepower"]).
Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated
= 4.
0 111 Low
1 111 Low
2 154 Medium
3 102 Low
4 115 Low
5 110 Low
6 110 Low
7 110 Low
8 140 Medium
9 101 Low
10 101 Low
11 121 Medium
12 121 Medium
13 121 Medium
14 182 Medium
15 182 Medium
16 182 Medium
17 48 Low
18 70 Low
19 70 Low
In [38]: df["horsepower-binned"].value_counts()
Low 153
Out[38]:
Medium 43
High 5
Name: horsepower-binned, dtype: int64
Bins Visualization
Normally, a histogram is used to visualize the distribution of bins we created above.
The plot above shows the binning result for the attribute "horsepower".
Example
We see the column "fuel-type" has two unique values: "gas" or "diesel". Regression doesn't
understand words, only numbers. To use this attribute in regression analysis, we convert "fuel-
type" to indicator variables.
We will use pandas' method 'get_dummies' to assign numerical values to different categories of
fuel type.
In [41]: df.columns
dummy_variable_1.head()
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
In the dataframe, column 'fuel-type' has values for 'gas' and 'diesel' as 0s and 1s now.
In [46]: df.head()
Out[46]: num-
normalized- body- drive- engine- wheel-
symboling make aspiration of- length .
losses style wheels location base
doors
alfa-
0 3 122 std two convertible rwd front 88.6 168.8
romero
alfa-
1 3 122 std two convertible rwd front 88.6 168.8
romero
alfa-
2 1 122 std two hatchback rwd front 94.5 171.2
romero
5 rows × 29 columns
Out[48]:
normalized- wheel- curb- engine-
symboling length width height
losses base weight size
normalized-
0.466264 1.000000 -0.056661 0.019424 0.086802 -0.373737 0.099404 0.112360
losses
compression-
-0.182196 -0.114713 0.250313 0.159733 0.189867 0.259737 0.156433 0.028889
ratio
highway-
0.036233 -0.181877 -0.543304 -0.698142 -0.680635 -0.104812 -0.794889 -0.679571
mpg
fuel-type-
-0.196735 -0.101546 0.307237 0.211187 0.244356 0.281578 0.221046 0.070779
diesel
In order to start understanding the (linear) relationship between an individual variable and the
price, we can use "regplot" which plots the scatterplot plus the fitted regression line for the
data.
<AxesSubplot:xlabel='engine-size', ylabel='price'>
Out[49]:
As the engine-size goes up, the price goes up: this indicates a positive direct correlation
between these two variables. Engine size seems like a pretty good predictor of price since the
regression line is almost a perfect diagonal line.
Highway mpg is a potential predictor variable of price. Let's find the scatterplot of "highway-
mpg" and "price".
<AxesSubplot:xlabel='highway-mpg', ylabel='price'>
Out[51]:
As highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship
between these two variables. Highway mpg could potentially be a predictor of price.
<AxesSubplot:xlabel='peak-rpm', ylabel='price'>
Out[53]:
Peak rpm does not seem like a good predictor of the price at all since the regression line is
close to horizontal. Also, the data points are very scattered and far from the fitted line, showing
lots of variability. Therefore, it's not a reliable variable.
In [54]: df[['peak-rpm','price']].corr()
Categorical Variables
These are variables that describe a 'characteristic' of a data unit, and are selected from a small
group of categories. The categorical variables can have the type "object" or "int64". A good way
to visualize categorical variables is by using boxplots.
<AxesSubplot:xlabel='body-style', ylabel='price'>
Out[55]:
We see that the distributions of price between the different body-style categories have a
significant overlap, so body-style would not be a good predictor of price. Let's examine engine
"engine-location" and "price":
<AxesSubplot:xlabel='engine-location', ylabel='price'>
Out[56]:
Here we see that the distribution of price between these two engine-location categories, front
and rear, are distinct enough to take engine-location as a potential good predictor of price.
<AxesSubplot:xlabel='drive-wheels', ylabel='price'>
Out[57]:
Here we see that the distribution of price between the different drive-wheels categories differs.
As such, drive-wheels could potentially be a predictor of price.
In [58]: df.describe()
In [59]: df.describe(include=['object'])
Out[59]: num-
body- drive- engine- engine- num-of- fuel-
make aspiration of-
style wheels location type cylinders system
doors
count 201 201 201 201 201 201 201 201 201
unique 22 2 2 5 3 2 6 7 8
top toyota std four sedan fwd front ohc four mpfi
Value Counts
Value counts is a good way of understanding how many units of each characteristic/variable we
have. We can apply the "value_counts" method on the column "drive-wheels". Don’t forget the
method "value_counts" only works on pandas series, not pandas dataframes. As a result, we
only include one bracket df ['drive-wheels'], not two brackets df [['drive-wheels']].
In [60]: df['drive-wheels'].value_counts()
fwd 118
Out[60]:
rwd 75
4wd 8
Name: drive-wheels, dtype: int64
In [61]: df['drive-wheels'].value_counts().to_frame()
Out[61]: drive-wheels
fwd 118
rwd 75
4wd 8
Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and
rename the column 'drive-wheels' to 'value_counts'.
Out[62]: value_counts
fwd 118
rwd 75
4wd 8
Out[63]: value_counts
drive-wheels
fwd 118
rwd 75
4wd 8
Out[64]: value_counts
engine-location
front 198
rear 3
Basics of Grouping
The "groupby" method groups data by different categories. The data is grouped based on one
or several variables, and analysis is performed on the individual groups.
For example, let's group by the variable "drive-wheels". We see that there are 3 different
categories of drive wheels.
In [65]: df['drive-wheels'].unique()
We can then calculate the average price for each of the different categories of
data.
0 4wd 10241.000000
1 fwd 9244.779661
2 rwd 19757.613333
From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-
wheel and front-wheel are approximately the same in price.
You can also group by multiple variables. For example, let's group by both 'drive-wheels' and
'body-style'. This groups the dataframe by the unique combination of 'drive-wheels' and 'body-
style'. We can store the results in the variable 'grouped_test1'.
This grouped data is much easier to visualize when it is made into a pivot table
Out[69]: price
drive-wheels
Often, we won't have data for some of the pivot cells. We can fill these missing cells with the
value 0, but any other value could potentially be used as well. It should be mentioned that
missing data is quite a complex subject and is an entire course on its own.
Out[70]: price
drive-wheels
The heatmap plots the target variable (price) proportional to colour with respect to the variables
'drive-wheel' and 'body-style' on the vertical and horizontal axis, respectively. This allows us to
visualize how the price is related to 'drive-wheel' and 'body-style'.
The default labels convey no useful information to us. Let's change that:
#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index
#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)
fig.colorbar(im)
plt.show()
Causation: the relationship between cause and effect between two variables.
It is important to know the difference between these two. Correlation does not imply causation.
Determining correlation is much simpler the determining causation as causation may require
independent experimentation.
Pearson Correlation
The Pearson Correlation measures the linear dependence between two variables X and Y.
1: Perfect positive linear correlation. 0: No linear correlation, the two variables most likely do not
affect each other. -1: Perfect negative linear correlation.
Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.
Since the p-value is < 0.001, the correlation between wheel-base and price is statistically
significant, although the linear relationship isn't extremely strong (~0.585).
Hmmmmm Since the p-value is < 0.001, the correlation between horsepower and price is
statistically significant, and the linear relationship is quite strong (~0.809, close to 1).
Since the p-value is < 0.001, the correlation between length and price is statistically
significant, and the linear relationship is moderately strong (~0.691).
Width Vs Price
Curb-Weight Vs Price
Enine-Size Vs Price
Bore Vs Price
City-Mpg Vs price
Highway-mpg Vs Price
ANOVA
The Analysis of Variance (ANOVA) is a statistical method used to test whether there are
significant differences between the means of two or more groups. ANOVA returns two
parameters:
F-test score: ANOVA assumes the means of all groups are the same, calculates how much the
actual means deviate from the assumption, and reports it as the F-test score. A larger score
means there is a larger difference between the means.
P-value: P-value tells how statistically significant our calculated score value is.
If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA
to return a sizeable F-test score and a small p-value.
Drive Wheels
Since ANOVA analyzes the difference between different groups of the same variable, the
groupby function will come in handy. Because the ANOVA algorithm averages the data
automatically, we do not need to take the average before hand.
0 rwd 13495.0
1 rwd 16500.0
3 fwd 13950.0
4 4wd 17450.0
5 fwd 15250.0
In [85]: df_gptest
We can obtain the values of the method group using the method "get_group".
In [86]: grouped_test2.get_group('4wd')['price']
4 17450.0
Out[86]:
136 7603.0
140 9233.0
141 11259.0
144 8013.0
145 11694.0
150 7898.0
151 8778.0
Name: price, dtype: float64
In [87]: # ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.g
This is a great result with a large F-test score showing a strong correlation and a P-value of
almost 0 implying almost certain statistical significance. But does this mean all three tested
groups are all this highly correlated?
We now have a better idea of what our data looks like and which variables are important to take
into account when predicting the car price. We have narrowed it down to the following
variables:
Width
Curb-weight
Engine-size
Horsepower
City-mpg
Highway-mpg
Wheel-base
localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 37/58
4/16/23, 5:00 AM Data Analysis
Bore
Categorical variables:
Drive-wheels
Model Development
Linear Regression and Multiple Linear Regression
Let's load the modules for linear regression:
In [92]: lm = LinearRegression()
lm
LinearRegression()
Out[92]:
In [93]: X = df[['highway-mpg']]
Y = df['price']
In [94]: lm.fit(X, Y)
LinearRegression()
Out[94]:
In [95]: Yhat=lm.predict(X)
Yhat[0:5]
In [97]: lm.intercept_
38423.305858157386
Out[97]:
In [98]: lm.coef_
array([-821.73337832])
Out[98]:
X1 = df[["engine-size"]]
Y1 = df["price"]
lm1.fit(X1,Y1)
LinearRegression()
Out[99]:
Yhat1[0:5]
In [102… lm1.intercept_
-7963.338906281049
Out[102]:
In [103… lm1.coef_
array([166.86001569])
Out[103]:
In [105… Price
0 13728.46
Out[105]:
1 13728.46
2 17399.38
3 10224.40
4 14729.62
...
196 15563.92
197 15563.92
198 20903.44
199 16231.36
200 15563.92
Name: engine-size, Length: 201, dtype: float64
In [106… df["price"]
0 13495.0
Out[106]:
1 16500.0
2 16500.0
3 13950.0
4 17450.0
...
196 16845.0
197 19045.0
198 21485.0
199 22470.0
200 22625.0
Name: price, Length: 201, dtype: float64
.Horsepower
.Curb-weight
.Engine-size
.Highway-mpg
LinearRegression()
Out[170]:
plt.show()
In [109… lm.intercept_
-15811.863767729243
Out[109]:
In [110… lm.coef_
plt.show()
In [113… lm2.intercept_
38201.31327245728
Out[113]:
In [114… lm2.coef_
Regression Plot
When it comes to simple linear regression, an excellent way to visualize the fit of our model is
by using regression plots.
In [115… width = 12
height = 10
(0.0, 48175.27099289158)
Out[115]:
We can see from this plot that price is negatively correlated to highway-mpg since the
regression slope is negative.
One thing to keep in mind when looking at a regression plot is to pay attention to how
scattered the data points are around the regression line. This will give you a good indication of
the variance of the data and whether a linear model would be the best fit or not. If the data is
too far off from the line, this linear model might not be the best model for this data.
(0.0, 47414.1)
Out[116]:
Comparing the regression plot of "peak-rpm" and "highway-mpg", we see that the points for
"highway-mpg" are much closer to the generated line and, on average, decrease. The points for
"peak-rpm" have more spread around the predicted line and it is much harder to determine if
the points are decreasing or increasing as the "peak-rpm" increases.
In [117… df[["peak-rpm","highway-mpg","price"]].corr()
Residual Plot
A good way to visualize the variance of the data is to use a residual plot.
In [118… width = 12
height = 10
plt.figure(figsize=(width, height))
sas.residplot(x=df['highway-mpg'],y=df['price'])
plt.show()
localhost:8888/nbconvert/html/Data Analysis.ipynb?download=false 46/58
4/16/23, 5:00 AM Data Analysis
We can see from this residual plot that the residuals are not randomly spread around the x-axis,
leading us to believe that maybe a non-linear model is more appropriate for this data.
One way to look at the fit of the model is by looking at the distribution plot. We can look at the
distribution of the fitted values that result from the model and compare it to the distribution of
the actual values.
plt.show()
plt.close()
We can see that the fitted values are reasonably close to the actual values since the two
distributions overlap a bit. However, there is definitely some room for improvement.
Polynomial regression is a particular case of the general linear regression model or multiple
linear regression models.
2
Y hat = a + b1 X + b2 X
2 3
Y hat = a + b1 X + b2 X + b3 X
Higher-Order:
2 3
Y = a + b1 X + b2 X + b3 X . . . .
We saw earlier that a linear model did not provide the best fit while using "highway-mpg" as
the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.
plt.show()
plt.close()
Ye code ek function ko define karta hai jis ka naam "PlotPolly" hai. Is function mein chaar
parameters diye gaye hain: "model", jo ke polynomial regression model hai,
"independent_variable", jo ke model ke input data hai, "dependent_variable", jo ke model ke
output data hai, aur "Name", jo ke independent variable ka naam hai.
Phir ye function np.linspace function se naye data points generate karta hai, matplotlib se plot
bana kar original data points aur polynomial fit ko dikhta hai, plot ka background color set karta
hai, x aur y axis pe labels add karta hai, aur plot ko display karta hai.
Overall, ye function polynomial regression model ke results ko visualize karna aur model ke data
ke sath kaisa fit kar raha hai, is ke baare mein insights hasil karna ke liye use kiya ja sakta hai.
In [122… x = df['highway-mpg']
y = df['price']
Let's fit the polynomial using the function polyfit, then use the function poly1d to display the
polynomial function.
3 2
-1.557 x + 204.8 x - 8965 x + 1.379e+05
In [125… np.polyfit(x, y, 3)
We can already see from plotting that this polynomial model performs better
than the linear model. This is because the generated polynomial function
"hits" more of the data points.
11 Order Polynomial
In [126… f1 = np.polyfit(x, y, 11)
p1 = np.poly1d(f1)
print(p1)
PlotPolly(p1,x,y, 'Highway MPG')
11 10 9 8 7
-1.243e-08 x + 4.722e-06 x - 0.0008028 x + 0.08056 x - 5.297 x
6 5 4 3 2
+ 239.5 x - 7588 x + 1.684e+05 x - 2.565e+06 x + 2.551e+07 x - 1.491e+08 x + 3.879e+
08
We can perform a polynomial transform on multiple features. First, we import the module:
In [128… pr=PolynomialFeatures(degree=2)
pr
PolynomialFeatures()
Out[128]:
In [129… Z_pr=pr.fit_transform(Z)
In [130… Z.shape
(201, 4)
Out[130]:
In [131… Z_pr.shape
(201, 15)
Out[131]:
Pipeline
Data Pipelines simplify the steps of processing the data. We use the module Pipeline to create a
pipeline. We also use StandardScaler as a step in our pipeline.
We create the pipeline by creating a list of tuples including the name of the model or estimator
and its corresponding constructor.
In [134… pipe=Pipeline(Input)
pipe
Pipeline(steps=[('scale', StandardScaler()),
Out[134]:
('polynomial', PolynomialFeatures(include_bias=False)),
('model', LinearRegression())])
First, we convert the data type Z to type float to avoid conversion warnings that may appear as
a result of StandardScaler taking float inputs.
Then, we can normalize the data, perform a transform and fit the model simultaneously.
In [135… Z = Z.astype(float)
pipe.fit(Z,y)
Pipeline(steps=[('scale', StandardScaler()),
Out[135]:
('polynomial', PolynomialFeatures(include_bias=False)),
('model', LinearRegression())])
Similarly, we can normalize the data, perform a transform and produce a prediction
simultaneously.
In [136… ypipe=pipe.predict(Z)
ypipe[0:4]
pipe=Pipeline(Input)
pipe.fit(Z,y)
ypipe=pipe.predict(Z)
ypipe[0:10]
In [138… #highway_mpg_fit
lm.fit(X, Y)
# Find the R^2
print('The R-square is: ', lm.score(X, Y))
We can predict the output i.e., "yhat" using the predict method, where X is the input variable:
In [139… Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])
The output of the first four predicted value is: [16236.50464347 16236.50464347 1705
8.23802179 13771.3045085 ]
In [144… print('The mean square error of price and predicted value using multifit is: ', \
mean_squared_error(df['price'], Y_predict_multifit))
The mean square error of price and predicted value using multifit is: 11979300.34981
8885
20474146.42636125
Out[147]:
%matplotlib inline
In [150… lm.fit(X, Y)
lm
LinearRegression()
Out[150]:
Produce a prediction:
In [151… yhat=lm.predict(new_input)
yhat[0:5]
When comparing models, the model with the higher R-squared value is a better fit for the data.
When comparing models, the model with the smallest MSE value is a better fit for the data.
R-squared: 0.49659118843391759
R-squared: 0.80896354913783497
R-squared: 0.6741946663906514
Conclusion:
Comparing these three models, we conclude that the MLR model is the best model to be able
to predict price from our dataset. This result makes sense since we have 27 variables in total and
we know that more than one of those variables are potential predictors of the final car price.
In [154… x_data=df.drop('price',axis=1)
Now, we randomly split our data into training and testing data using the function
train_test_split.
The test_size parameter sets the proportion of data that is split into the testing set. In the above,
the testing set is 10% of the total dataset.
LinearRegression()
Out[160]:
We can see the R^2 is much smaller using the test data compared to the training data.
Cross-Validation Score
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: