0% found this document useful (0 votes)
6 views9 pages

Linear Regression

The document outlines a data preprocessing and analysis workflow for a CO2 emissions dataset, including data loading, cleaning, and feature encoding. It employs linear regression to predict CO2 emissions based on various vehicle attributes, evaluating model performance with metrics such as R-squared and Mean Absolute Error. Visualizations, including correlation heatmaps and scatter plots, are also utilized to illustrate relationships and model predictions.

Uploaded by

hmussawar477
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views9 pages

Linear Regression

The document outlines a data preprocessing and analysis workflow for a CO2 emissions dataset, including data loading, cleaning, and feature encoding. It employs linear regression to predict CO2 emissions based on various vehicle attributes, evaluating model performance with metrics such as R-squared and Mean Absolute Error. Visualizations, including correlation heatmaps and scatter plots, are also utilized to illustrate relationships and model predictions.

Uploaded by

hmussawar477
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Preprocessing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error,
r2_score

df = pd.read_csv('co2.csv')
df.head()

Make Model Vehicle Class Engine Size(L) Cylinders


Transmission \
0 ACURA ILX COMPACT 2.0 4
AS5
1 ACURA ILX COMPACT 2.4 4
M6
2 ACURA ILX HYBRID COMPACT 1.5 4
AV7
3 ACURA MDX 4WD SUV - SMALL 3.5 6
AS6
4 ACURA RDX AWD SUV - SMALL 3.5 6
AS6

Fuel Type Fuel Consumption City (L/100 km) \


0 Z 9.9
1 Z 11.2
2 Z 6.0
3 Z 12.7
4 Z 12.1

Fuel Consumption Hwy (L/100 km) Fuel Consumption Comb (L/100


km) \
0 6.7 8.5

1 7.7 9.6

2 5.8 5.9

3 9.1 11.1

4 8.7 10.6

Fuel Consumption Comb (mpg) CO2 Emissions(g/km)


0 33 196
1 29 221
2 48 136
3 25 255
4 27 244

df.shape
df.columns

Index(['Make', 'Model', 'Vehicle Class', 'Engine Size(L)',


'Cylinders',
'Transmission', 'Fuel Type', 'Fuel Consumption City (L/100
km)',
'Fuel Consumption Hwy (L/100 km)', 'Fuel Consumption Comb
(L/100 km)',
'Fuel Consumption Comb (mpg)', 'CO2 Emissions(g/km)'],
dtype='object')

df = df.drop(['Make', 'Model','Vehicle Class','Transmission'], axis=1)


df.shape

(7385, 8)

df["Fuel Type"].value_counts()

Fuel Type
X 3637
Z 3202
E 370
D 175
N 1
Name: count, dtype: int64

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
df["Fuel Type"] = le.fit_transform(df["Fuel Type"])
df["Fuel Type"].value_counts()

Fuel Type
3 3637
4 3202
1 370
0 175
2 1
Name: count, dtype: int64

Correlation
correlation = df.corr()
correlation
Engine Size(L) Cylinders Fuel Type
\
Engine Size(L) 1.000000 0.927653 0.058296

Cylinders 0.927653 1.000000 0.125175

Fuel Type 0.058296 0.125175 1.000000

Fuel Consumption City (L/100 km) 0.831379 0.800702 -0.075605

Fuel Consumption Hwy (L/100 km) 0.761526 0.715252 -0.129812

Fuel Consumption Comb (L/100 km) 0.817060 0.780534 -0.095539

Fuel Consumption Comb (mpg) -0.757854 -0.719321 -0.016880

CO2 Emissions(g/km) 0.851145 0.832644 0.100306

Fuel Consumption City (L/100 km) \


Engine Size(L) 0.831379
Cylinders 0.800702
Fuel Type -0.075605
Fuel Consumption City (L/100 km) 1.000000
Fuel Consumption Hwy (L/100 km) 0.948180
Fuel Consumption Comb (L/100 km) 0.993810
Fuel Consumption Comb (mpg) -0.927059
CO2 Emissions(g/km) 0.919592

Fuel Consumption Hwy (L/100 km) \


Engine Size(L) 0.761526
Cylinders 0.715252
Fuel Type -0.129812
Fuel Consumption City (L/100 km) 0.948180
Fuel Consumption Hwy (L/100 km) 1.000000
Fuel Consumption Comb (L/100 km) 0.977299
Fuel Consumption Comb (mpg) -0.890638
CO2 Emissions(g/km) 0.883536

Fuel Consumption Comb (L/100 km) \


Engine Size(L) 0.817060
Cylinders 0.780534
Fuel Type -0.095539
Fuel Consumption City (L/100 km) 0.993810
Fuel Consumption Hwy (L/100 km) 0.977299
Fuel Consumption Comb (L/100 km) 1.000000
Fuel Consumption Comb (mpg) -0.925576
CO2 Emissions(g/km) 0.918052

Fuel Consumption Comb (mpg) \


Engine Size(L) -0.757854
Cylinders -0.719321
Fuel Type -0.016880
Fuel Consumption City (L/100 km) -0.927059
Fuel Consumption Hwy (L/100 km) -0.890638
Fuel Consumption Comb (L/100 km) -0.925576
Fuel Consumption Comb (mpg) 1.000000
CO2 Emissions(g/km) -0.907426

CO2 Emissions(g/km)
Engine Size(L) 0.851145
Cylinders 0.832644
Fuel Type 0.100306
Fuel Consumption City (L/100 km) 0.919592
Fuel Consumption Hwy (L/100 km) 0.883536
Fuel Consumption Comb (L/100 km) 0.918052
Fuel Consumption Comb (mpg) -0.907426
CO2 Emissions(g/km) 1.000000

# constructing a heatmap to nderstand the correlation


plt.figure(figsize=(8,8))
sns.heatmap(correlation, cbar=True, square=True, fmt='.1f',
annot=True, annot_kws={'size':8}, cmap='Blues')

<Axes: >
DATA SPLITING
X = df.drop(['CO2 Emissions(g/km)'], axis=1)
Y = df['CO2 Emissions(g/km)']

X.head()

Engine Size(L) Cylinders Fuel Type Fuel Consumption City (L/100


km) \
0 2.0 4 4
9.9
1 2.4 4 4
11.2
2 1.5 4 4
6.0
3 3.5 6 4
12.7
4 3.5 6 4
12.1

Fuel Consumption Hwy (L/100 km) Fuel Consumption Comb (L/100


km) \
0 6.7 8.5

1 7.7 9.6

2 5.8 5.9

3 9.1 11.1

4 8.7 10.6

Fuel Consumption Comb (mpg)


0 33
1 29
2 48
3 25
4 27

from sklearn.model_selection import train_test_split


X_TRAIN , X_TEST , Y_TRAIN, Y_TEST = train_test_split(X,Y, test_size =
0.25, random_state=25)
print("Size of Train X = " , len(X_TRAIN))
print("Size of Train Y = " , len(Y_TRAIN))
print("Size of Test X = " , len(X_TEST))
print("Size of Test Y = " , len(Y_TEST))

Size of Train X = 5538


Size of Train Y = 5538
Size of Test X = 1847
Size of Test Y = 1847

LINEAR REGRESSION
from sklearn.linear_model import LinearRegression
model= LinearRegression()
model.fit(X_TRAIN, Y_TRAIN)

LinearRegression()
Evaluation
Prediction on Training Data
# accuracy for prediction on training data
training_data_prediction = model.predict(X_TRAIN)
print(training_data_prediction)

[277.53821201 312.33640233 147.36539728 ... 298.22931218 239.77783577


201.51288521]

# R squared error
score_1 = metrics.r2_score(Y_TRAIN, training_data_prediction)

# Mean Absolute Error


score_2 = metrics.mean_absolute_error(Y_TRAIN,
training_data_prediction)

print("R squared : ", score_1)


print('Mean Absolute Error : ', score_2)

R squared : 0.9124830358066793
Mean Absolute Error : 11.128722988272692

plt.scatter(Y_TRAIN, training_data_prediction)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Price vs Preicted Price")
plt.show()
Prediction on Test Data
y_pred = model.predict(X_TEST)
y_pred
print(y_pred)

[232.4062013 280.46048354 246.9986957 ... 232.6947596 188.04758241


175.35879895]

# R squared Score
score_1 = metrics.r2_score(Y_TEST, y_pred)

# Mean Absolute Error


score_2 = metrics.mean_absolute_error(Y_TEST, y_pred)

print("R squared Score : ", score_1)


print('Mean Absolute Error : ', score_2)

R squared Score : 0.915458485471068


Mean Absolute Error : 10.936146875715842

plt.scatter(Y_TEST, y_pred)
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Actual vs Preicted")
plt.show()

You might also like