0% found this document useful (0 votes)
2 views

Assignment 2 ML

This document outlines a lab assignment focused on multiple linear regression using the Auto MPG dataset. It covers the setup, data loading, simple linear regression implementation, multiple linear regression model creation, and the importance of feature scaling. Key results include the coefficients of the regression models and their respective mean squared errors for both training and testing datasets.

Uploaded by

2busy4you1234
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Assignment 2 ML

This document outlines a lab assignment focused on multiple linear regression using the Auto MPG dataset. It covers the setup, data loading, simple linear regression implementation, multiple linear regression model creation, and the importance of feature scaling. Key results include the coefficients of the regression models and their respective mean squared errors for both training and testing datasets.

Uploaded by

2busy4you1234
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Certificate in AI 2.

0: Machine Learning

Lab 02: Multiple Linear Regression

Assignment 02
Introduction
In this lab, we'll build on the fundamentals of linear regression that we covered in Lab 01.
Instead of implementing the algorithms from scratch, we'll use scikit-learn (sklearn) to focus on
the concepts of multiple linear regression, feature scaling, and model evaluation.

We'll work with a real-world dataset called Auto MPG, which contains information about various
cars and their fuel efficiency.

Part 1: Basic Setup and Data Loading


First, let's import the necessary libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns

# Set the random seed for reproducibility


np.random.seed(42)

Now, let's load and explore the Auto MPG dataset:

# Load the data


auto_data = pd.read_csv('Fish.csv')

# Display the first few rows


print("First 5 rows of the dataset:")
auto_data.head(5)

First 5 rows of the dataset:


{"columns":[{"name":"index","rawType":"int64","type":"integer"},
{"name":"Species","rawType":"object","type":"string"},
{"name":"Weight","rawType":"float64","type":"float"},
{"name":"Length1","rawType":"float64","type":"float"},
{"name":"Length2","rawType":"float64","type":"float"},
{"name":"Length3","rawType":"float64","type":"float"},
{"name":"Height","rawType":"float64","type":"float"},
{"name":"Width","rawType":"float64","type":"float"}],"conversionMethod
":"pd.DataFrame","ref":"4960030c-c382-4710-8133-f0394966cd34","rows":
[["0","Bream","242.0","23.2","25.4","30.0","11.52","4.02"],
["1","Bream","290.0","24.0","26.3","31.2","12.48","4.3056"],
["2","Bream","340.0","23.9","26.5","31.1","12.3778","4.6961"],
["3","Bream","363.0","26.3","29.0","33.5","12.73","4.4555"],
["4","Bream","430.0","26.5","29.0","34.0","12.444","5.134"]],"shape":
{"columns":7,"rows":5}}

# Get basic information about the dataset


auto_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Species 159 non-null object
1 Weight 159 non-null float64
2 Length1 159 non-null float64
3 Length2 159 non-null float64
4 Length3 159 non-null float64
5 Height 159 non-null float64
6 Width 159 non-null float64
dtypes: float64(6), object(1)
memory usage: 8.8+ KB

# Summary statistics
print("\nSummary statistics:")
auto_data.describe()

Summary statistics:

{"columns":[{"name":"index","rawType":"object","type":"string"},
{"name":"Weight","rawType":"float64","type":"float"},
{"name":"Length1","rawType":"float64","type":"float"},
{"name":"Length2","rawType":"float64","type":"float"},
{"name":"Length3","rawType":"float64","type":"float"},
{"name":"Height","rawType":"float64","type":"float"},
{"name":"Width","rawType":"float64","type":"float"}],"conversionMethod
":"pd.DataFrame","ref":"dc61edd1-d244-4649-b2e7-d64e6459823e","rows":
[["count","159.0","159.0","159.0","159.0","159.0","159.0"],
["mean","398.3264150943396","26.247169811320756","28.415723270440253",
"31.227044025157234","8.970993710691824","4.417485534591195"],
["std","357.9783165508931","9.996441210553128","10.716328098884247","1
1.610245832690964","4.2862076199688675","1.685803869992167"],
["min","0.0","7.5","8.4","8.8","1.7284","1.0476"],
["25%","120.0","19.05","21.0","23.15","5.9448","3.38565"],
["50%","273.0","25.2","27.3","29.4","7.786","4.2485"],
["75%","650.0","32.7","35.5","39.650000000000006","12.3659","5.5845"],
["max","1650.0","59.0","63.4","68.0","18.957","8.142"]],"shape":
{"columns":6,"rows":8}}

# Check for missing values


print("\nMissing values in each column:")
auto_data.isnull().sum()

Missing values in each column:

Species 0
Weight 0
Length1 0
Length2 0
Length3 0
Height 0
Width 0
dtype: int64

Let's visualize the relationships between features and the target (mpg):

auto_data.columns

Index(['Species', 'Weight', 'Length1', 'Length2', 'Length3', 'Height',


'Width'],
dtype='object')

# Create a pair plot to see relationships between features


plt.figure(figsize=(12, 8))
sns.pairplot(auto_data, x_vars=['Species','Length1', 'Length2',
'Length3', 'Height', 'Width'],
y_vars=['Weight'])
plt.show()

<Figure size 1200x800 with 0 Axes>


# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = auto_data.select_dtypes(include=['float64',
'int64']).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

Part 2: Simple Linear Regression with sklearn


Before diving into multiple regression, let's quickly implement a simple linear regression using
sklearn to see how it compares to our implementation from Lab 01. We'll use 'weight' as our
single feature since it has a strong correlation with 'mpg'.
# Extract feature and target
X_simple = auto_data[['Height']].values
y = auto_data['Weight'].values

# Split the data into training and testing sets


X_simple_train, X_simple_test, y_train, y_test =
train_test_split(X_simple, y, test_size=0.2, random_state=42)

# Create and train the model


simple_model = LinearRegression()
simple_model.fit(X_simple_train, y_train)

LinearRegression()

# Make predictions
y_simple_train_pred = simple_model.predict(X_simple_train)
y_simple_test_pred = simple_model.predict(X_simple_test)

print(y_simple_train_pred)
print(y_simple_test_pred)

[129.31217172 323.93630223 602.8797132 221.62420462 277.79245131


221.90401178 442.2095737 606.13399216 845.21704785 273.32770222
-30.08058725 185.88796363 369.28939378 172.25040579 196.22866315
794.17657157 238.29097914 159.09338634 511.83289529 724.14570476
460.06857005 479.2779401 609.63158171 854.65749824 -29.70345586
575.81141151 225.13395969 606.90650324 292.79254838 990.74110394
392.92701633 292.96286579 421.5038436 717.53982259 212.08643
232.47585635 550.70175997 527.38044116 360.44505431 253.49180744
196.22866315 877.32187848 243.8384603 548.77960641 845.10755809
790.88579602 493.01890493 192.77973572 228.07193491 433.91268302
278.89951444 -22.87859417 239.02091087 239.26422145 575.5559354
266.48459225 64.56722719 983.5147798 313.20630579 -16.45519494
628.57331012 250.04288001 24.71295468 237.74353034 249.05747217
371.13855416 677.36924633 509.03482366 527.28311693 270.60870653
102.2317045 162.47540336 167.63358759 921.96328659 248.38836808
620.65355084 203.0413593 524.42421765 563.47556526 473.84603147
206.17398298 209.9148831 433.91268302 507.63578784 207.32970822
701.85237314 495.29385883 696.25622987 565.37338776 323.57133637
624.3032095 758.61064798 191.61184495 -21.56471706 317.12360608
712.47287983 293.40082482 300.40816944 426.91750393 311.3571454
299.89721723 -11.52815576 503.49950803 308.26710108 221.62420462
822.82030924 383.91235945 -15.94424273 197.1593261 189.11791154
-23.31655321 609.09629844 389.92213071 211.06452558 285.93119011
241.6669134 527.38044116 193.03521183 82.57220988 603.06219613
28.32611675 770.19831421 541.45595804 325.66380733 759.60822135
263.10865799 356.23578133]
[ 166.22238624 -2.22152619 188.65562144 359.7272881 167.39635978
1003.07695019 -44.89820139 281.02848198 259.54415804 499.98975296
730.37445553 601.43201526 789.0244701 390.89537301 837.83865461
947.10943472 616.63892632 285.97985223 293.58330776 753.89042279
-15.36638011 800.88586073 686.90702095 715.33786187 949.97441677
611.07319687 231.66076592 345.43279171 -44.26559389 -15.36638011
265.11597025 129.04453008]

# Calculate metrics
simple_train_mse = mean_squared_error(y_train, y_simple_train_pred)
simple_test_mse = mean_squared_error(y_test, y_simple_test_pred)

print(f"Simple Linear Regression Results:")


print(f"Intercept: {simple_model.intercept_:.4f}")
print(f"Coefficient (weight): {simple_model.coef_[0]:.6f}")
print(f"Training MSE: {simple_train_mse:.4f}")
print(f"Test MSE: {simple_test_mse:.4f}")

Simple Linear Regression Results:


Intercept: -150.0327
Coefficient (weight): 60.827644
Training MSE: 64227.1955
Test MSE: 45896.9478

# Visualize the results


plt.figure(figsize=(10, 6))
plt.scatter(X_simple_train, y_train, color='blue', alpha=0.5,
label='Training data')
plt.scatter(X_simple_test, y_test, color='red', alpha=0.5, label='Test
data')
# Plot the regression line
x_line = np.array([min(X_simple[:, 0]), max(X_simple[:,
0])]).reshape(-1, 1)
y_line = simple_model.predict(x_line)
plt.plot(x_line, y_line, 'g-', linewidth=2, label=f'Regression line: y
= {simple_model.intercept_:.2f} + {simple_model.coef_[0]:.6f}x')

plt.title('Simple Linear Regression: MPG vs. Weight')


plt.xlabel('Weight')
plt.ylabel('MPG')
plt.legend()
plt.grid(True)
plt.show()
The model gives us a linear equation of the form:
mpg =b0 +b 1 × weight

Part 3: Multiple Linear Regression


Now, let's build a multiple linear regression model using all the features:

# Extract features and target


X_multi = auto_data[['Length1', 'Length2', 'Length3', 'Height',
'Width']].values
y = auto_data['Weight'].values

# Split the data


X_multi_train, X_multi_test, y_train, y_test =
train_test_split(X_multi, y, test_size=0.2, random_state=42)

# Create and train the model


multi_model = LinearRegression()
multi_model.fit(X_multi_train, y_train)

LinearRegression()

# Redefine the model if not already defined


multi_model = LinearRegression()
multi_model.fit(X_multi_train, y_train)
# Make predictions
y_multi_train_pred = multi_model.predict(X_multi_train)
y_multi_test_pred = multi_model.predict(X_multi_test)

# Calculate metrics
multi_train_mse = multi_model.score(X_multi_train, y_train)
multi_test_mse = multi_model.score(X_multi_test, y_test)

print(f"Multiple Linear Regression Results:")


print(f"Intercept: {multi_model.intercept_:.4f}")
print("Coefficients:")
feature_names = ['Length1', 'Length2', 'Length3', 'Height', 'Width']
for name, coef in zip(feature_names, multi_model.coef_):
print(f" {name}: {coef:.6f}")
print(f"Training MSE: {multi_train_mse:.4f}")
print(f"Test MSE: {multi_test_mse:.4f}")

Multiple Linear Regression Results:


Intercept: -515.3057
Coefficients:
Length1: 43.535265
Length2: 7.821796
Length3: -25.256701
Height: 23.228912
Width: 27.066493
Training MSE: 0.8839
Test MSE: 0.8821

# Visualize the predictions vs. actual values


plt.figure(figsize=(10, 6))
plt.scatter(y_train, y_multi_train_pred, color='blue', alpha=0.5,
label='Training data')
plt.scatter(y_test, y_multi_test_pred, color='red', alpha=0.5,
label='Test data')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', linewidth=2)
plt.title('Multiple Linear Regression: Predicted vs. Actual MPG')
plt.xlabel('Actual MPG')
plt.ylabel('Predicted MPG')
plt.legend()
plt.grid(True)
plt.show()
Part 4: Feature Scaling
One important aspect of multiple linear regression that we need to consider is the scale of the
features. Features with larger scales can dominate the model's learning process. Let's
implement feature scaling:

# Create a scaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and
test data
X_multi_train_scaled = scaler.fit_transform(X_multi_train)
X_multi_test_scaled = scaler.transform(X_multi_test)

# Create and train the model on scaled data


multi_model_scaled = LinearRegression()
multi_model_scaled.fit(X_multi_train_scaled, y_train)

LinearRegression()

# Make predictions
y_multi_train_pred_scaled =
multi_model_scaled.predict(X_multi_train_scaled)
y_multi_test_pred_scaled =
multi_model_scaled.predict(X_multi_test_scaled)
# Calculate metrics
multi_train_mse_scaled = mean_squared_error(y_train,
y_multi_train_pred_scaled)
multi_test_mse_scaled = mean_squared_error(y_test,
y_multi_test_pred_scaled)

multi_train_r2_scaled = r2_score(y_train, y_multi_train_pred_scaled)


multi_test_r2_scaled = r2_score(y_test, y_multi_test_pred_scaled)

print(f"Multiple Linear Regression Results (with Feature Scaling):")


print(f"Intercept: {multi_model_scaled.intercept_:.4f}")
print("Coefficients (scaled features):")
for name, coef in zip(feature_names, multi_model_scaled.coef_):
print(f" {name}: {coef:.6f}")
print(f"Training MSE: {multi_train_mse_scaled:.4f}")
print(f"Test MSE: {multi_test_mse_scaled:.4f}")
print(f"Training R^2: {multi_train_r2_scaled:.4f}")
print(f"Test R^2: {multi_test_r2_scaled:.4f}")

Multiple Linear Regression Results (with Feature Scaling):


Intercept: 386.7945
Coefficients (scaled features):
Length1: 432.274726
Length2: 83.013041
Length3: -288.567976
Height: 92.523216
Width: 44.067409
Training MSE: 14273.4542
Test MSE: 16763.8872
Training R^2: 0.8839
Test R^2: 0.8821

# Update feature names to match the number of features used in the


models
feature_names = ['Length1', 'Length2', 'Length3', 'Height', 'Width']

# Compare feature importance before and after scaling


plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.bar(feature_names, np.abs(multi_model.coef_))
plt.title('Feature Importance (Unscaled)')
plt.xticks(rotation=45)
plt.ylabel('|Coefficient|')

plt.subplot(1, 2, 2)
plt.bar(feature_names, np.abs(multi_model_scaled.coef_))
plt.title('Feature Importance (Scaled)')
plt.xticks(rotation=45)
plt.ylabel('|Coefficient|')
plt.tight_layout()
plt.show()

You might also like