0% found this document useful (0 votes)
0 views

Linear_Regression

The document outlines a lab exercise on supervised machine learning focusing on simple linear regression. It includes steps for implementing a regression model, importing and exploring a dataset, normalizing data, performing gradient descent, and evaluating the model using R-squared metrics. Additionally, it provides exercises for further practice with different datasets and parameters.

Uploaded by

nagulxlugan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Linear_Regression

The document outlines a lab exercise on supervised machine learning focusing on simple linear regression. It includes steps for implementing a regression model, importing and exploring a dataset, normalizing data, performing gradient descent, and evaluating the model using R-squared metrics. Additionally, it provides exercises for further practice with different datasets and parameters.

Uploaded by

nagulxlugan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1 Supervised Machine Learning - Regression

1.1 Simple Linear Regression


1. Implement a simple linear regression model
2. Demonstrate training, predicting and evaluating of model
Instruction to compplete lab exercises:
1. Open python notebook file under Lab folder
2. Read the problem statement in the exercise and expected output
3. Uncomment and remove the lines and fill in wiht your answer
4. Run your code to produce expected output.
Noted: Data files are stored in dataset folder

1.2 Import the data set


1. Using numpy import labelled dataset from “Income3.csv”
2. Understand / Explore the dataset
3. Plot the dataset to determine whether a linear regression model is suitable
4. Normalise(Scaling) the dataset. This step is important for multiple-variables regression. For
single variable regression it helps to limit the data range during gradient descent.

[1]: # 1) Import the data from Income3.csv


# 2) Understand / Explore the dataset

import matplotlib.pyplot as plt


import numpy as np
import pandas as pd

file = "./dataset/Income3.csv"
data = np.genfromtxt(file, dtype=None, names=True, delimiter=",", encoding =
,→None)

print (data.shape)
print (data.dtype)
print (data[:10])

(20,)
[('Observation', '<i4'), ('Years_of_Higher_Education_x', '<i4'), ('Income_y',
'<i4')]
[( 1, 6, 89617) ( 2, 0, 39826) ( 3, 6, 79894) ( 4, 3, 56547)
( 5, 4, 64795) ( 6, 0, 31007) ( 7, 5, 76229) ( 8, 6, 81044)
( 9, 0, 39467) (10, 6, 83546)]

1.2.1 Plot the interested data such as Years_of_Higher_Education_x vs Income_y


1.2.2 Normalised the data and plot again
• Use scatter plot to plot original data of years and income.
• Normalize the data and replot again.

1
Standard the data:
(X - min(X)) / max(X) - min(X)
Peak to peak function returns the difference between maximum and minimum
• scale X and Y
– normX = (X - X.min(0)) / X.ptp(0)
– normY = (Y - Y.min(0)) / Y.ptp(0)

[2]: # 3) Plot the interested data such as Years_of_Higher_Education_x vs Income_y


# 4) Normalised the data and plot again

X = data[data.dtype.names[1]]
Y = data[data.dtype.names[2]]

plt.scatter(X,Y)
plt.ylabel(data.dtype.names[2])
plt.xlabel(data.dtype.names[1])
plt.show()

##scale X and Y
normX = (X - X.min(0)) / X.ptp(0)
normY = (Y - Y.min(0)) / Y.ptp(0)

plt.scatter(normX,normY)
plt.ylabel("Norm " + data.dtype.names[2])
plt.xlabel("Norm " + data.dtype.names[1])
plt.show()

2
As you can see from the scatter plot, the data seems to have a linear relationship between the x
and y.

1.3 Build the model


The function will take in coefficient a and b, labelled data X and Y.
It returns the predicted Y value and cost of the prediction
Function also return the derivatives

3
[3]: # Building the model
#calculate predicted value, cost and change
#X and Y are original data and label

def forward(a, b, X, Y):

m = X.shape[0] # Number of elements in X


Y_pred, cost, D_a, D_b = 0,0,0,0

## Uncomment the following lines and fill in wiht your answer


## The predicted value of Y

#Y_pred = ________________________________________________

## Calculate the cost

#cost = (1/m) * np.sum(np.square((Y_pred - Y)))

## Calculate Derivative a, b

#D_a = ___________________________________________________

#D_b = ___________________________________________________

return {
"Y_pred" : Y_pred,
"cost" : cost,
"D_a" : D_a,
"D_b" : D_b
}

4
1.4 Perform the Gradient Descent
1. initialise the model variables to zero
2. initialise the learning rate to a small number eg0.001
3. initialise the number of iterations to perform gradient descent
4. plot the cost vs epochs (number of iterations)

[4]: #calculate costs based on updated coefficients and determinants


#epoch  number time to calculate
#learning_rate  rate of calculation

def gradient( X, Y, learning_rate = 0.001, epoch = 50000):


a = 0.1
b = 0.1

cost = np.zeros(epoch)

for i in range(epoch):
parameters = {}

## Uncomment the following line and fill in wiht your answer

parameters = forward(a, b, X, Y)
a = a - learning_rate * parameters["D_a"] # Update a
b = b - learning_rate * parameters["D_b"] # Update b
cost[i] = parameters["cost"]

return a, b, cost

1.4.1 Calculate the gradient and plot the cost from the gradient as per epoch as shown in the
figure:

5
[5]: # 4. plot the cost vs epochs (number of iterations)
# Performing Gradient Descent / Training model

learning_rate = 0.01 # The learning Rate


epoch = 1000 # The number of iterations to perform gradient descent
a, b, cost = gradient(normX, normY, learning_rate, epoch)
print ("a - " + str(a), "\nb - " + str(b))
plt.plot(cost)
plt.show()

a - 0.7147201151839904
b - 0.15562350762122823

We can see from the cost vs epoch plot that the cost/error decrease over time (epoch)

We can stop the gradient descent after the cost reduction is minimum even with increased epoch

1.5 Draw the predicted line over the dataset points


Lets see the linear model on the plotted dataset. We will use the a and b after our gradient descent
function. Overlay the model on the dataset plot.
Draw a scatter plot between , normalize X and Y data with regression line as shown below:

6
Note: Make sure you have completed the forward and gradient function steps

[6]: #1) Plot the normalised original data with the regression line

Y_pred_min = a*min(normX) + b
Y_pred_max = a*max(normX) + b

print(Y_pred_min , Y_pred_max)
print(min(normX), max(normX))
plt.scatter(normX, normY)
plt.plot([min(normX), max(normX)], [Y_pred_min, Y_pred_max], color='red') #
,→regression line

plt.show()

0.15562350762122823 0.8703436228052186
0.0 1.0

7
Observed that the model fits well to the data points.
We can also plot it based on the original value range.

[7]: #2) Plot the original data with the regression line

Y_pred_min = Y_pred_min*Y.ptp(0) + Y.min()


Y_pred_max = Y_pred_max*Y.ptp(0) + Y.min()

print(Y_pred_min , Y_pred_max)
plt.scatter(X, Y)
plt.plot([min(X), max(X)], [Y_pred_min, Y_pred_max], color='red') # regression
,→line

plt.show()

40128.09378168019 82017.83973261385

8
1.6 Evaluating the model using R squared & Adjusted R squared
When performing regression, we can determine whether the model is “well” by checking the mean
error. For the standard error of the estimate, the smaller the error the better the model. One other
statistical measurement that can be used to determine the performance of the regression model is
Coefficient of determination, R-squared.
Another measurement called the Adjusted R-squared, is a better measurement which takes into
consideration of the number of variables used for the model.
Both measurement should be within the range of 0 to 1. However, values out of this range is
possible. This will show that the model is not appropriate for this dataset.
Varience explained by model (Sum of squares residual errors):

[8]: # 1) R-squared and Adjusted R-squared function

def r_squared(Y, Y_pred, num_feature):

9
ss_err, ss_tot, r2, adjusted_r2 = 0,0,0,0

## Calculate sum squares residual errors and sum squares total

#ss_err = ________________________________________________

#ss_tot = ________________________________________________

## Calculate R-square and Adjusted R-square

r2 = 1 - (ss_err/ss_tot)

adjusted_r2 = 1 - ( (1-r2) * (len(Y)-1) / (len(Y)-num_feature-1) )

return r2, adjusted_r2

Y_pred = a * normX + b
#print(normX)
#print(Y_pred)
r2, adjusted_r2 = r_squared(normY, Y_pred, 1)

print (r2)
print (adjusted_r2)

0.9457699888541073
0.9427572104571134
We can see that we have a high r-squared and adjusted r-squared. This means that the model fits
well to the dataset.
[9]: # 2) Display both original and predicted result

a, b, cost = gradient(X, Y, learning_rate, epoch)

print ("a - " + str(a), "\nb - " + str(b))


Y_pred = a * X + b
#params = forward(a, b, X, Y)
display = pd.DataFrame()
display['Original']=Y
display['Predicted']=Y_pred #params['Y_pred']
print(display)

a - 7738.283872313206
b - 37043.63924056803
Original Predicted
0 89617 83473.342474
1 39826 37043.639241
2 79894 83473.342474

10
3 56547 60258.490858
4 64795 67996.774730
5 31007 37043.639241
6 76229 75735.058602
7 81044 83473.342474
8 39467 37043.639241
9 83546 83473.342474
10 68852 67996.774730
11 79357 75735.058602
12 68901 67996.774730
13 49198 52520.206985
14 47407 44781.923113
15 89594 83473.342474
16 56696 52520.206985
17 55078 52520.206985
18 70756 75735.058602
19 79069 83473.342474

2 Exercise 1: Re-run the gradient descent with different learning rate


and epoch
Use the code segment below to try out.
1. At learning rate of 0.01, what epoch should be sufficient? If increase in epoch, does not
reduce the cost significantly further.
2. When you increase the learning rate, the epoch to reach the previous cost is reduced. How
far can you increase the learning rate before the gradient descent stop converging.
3. Plot all the set of the cost output vs epoch as shown:

11
[10]: #1) Set learning rate 0.01 and epoch to 10000

#learning_rate = ________________________________________________

#epoch = ________________________________________________

#2) Call gradient function to calculate and display a, b and cost

#a, b, cost = ________________________________________________

#print ("a - " + str(a), "\nb - " + str(b))

#3) Plot different range of cost data

#plt.plot(cost[5000:6000], color='orange')
#plt.plot(cost[6000:7000], color='green')
#plt.plot(cost[7000:8000], color='black')
#plt.plot(cost[8000:9000], color='red')
#plt.plot(cost[9000:10000], color='blue')
#plt.show()

a - 0.7875370424239243
b - 0.10677062212005925

12
3 Exercise 2:
This dataset auto-mpg-data.csv contains the Miles-Per-Gallon (fuel consumption) of 398 cars
model. There are 8 input variables X (features), and target variable Y is MPG (label). Using
simple linear regression (1 variable) to predict the fuel consumption of car based on weight.
1. Read and load data from auto-mpg.data.csv
2. Plot the data MPG and Weight
3. Standardise the data using max, min formula
4. Calculate coefficient a, b and cost with learning rate 0.01 and epoch 10000
5. Use a and b to get predicted Y
6. Calculate R-square and Adjustted R-square to evaluate model

[2]: import matplotlib.pyplot as plt


import numpy as np

##1. Read and load data from auto-mpg.data.csv

#file = "dataset/auto-mpg-clean.csv"

#data =
,→____________________________________________________________________________

#print(data.shape)
#print(data.dtype)
#print(data[:10])

#X = ______________________________________________________________

#Y = ______________________________________________________________

##2. Plot scatter diagram with MPG and Weight

#_______________________________________________________________

#plt.ylabel("mpg")
#plt.xlabel("weight")
#plt.show()

(392,)
[('mpg', '<f8'), ('cylinders', '<i4'), ('displacement', '<f8'), ('horsepower',
'<i4'), ('weight', '<i4'), ('acceleration', '<f8'), ('model_year', '<i4'),
('origin', '<i4'), ('car_name', '<U36')]
[(26. , 4, 97., 46, 1835, 20.5, 70, 2, 'volkswagen 1131 deluxe sedan')
(26. , 4, 97., 46, 1950, 21. , 73, 2, 'volkswagen super beetle')
(43.1, 4, 90., 48, 1985, 21.5, 78, 2, 'volkswagen rabbit custom diesel')
(44.3, 4, 90., 48, 2085, 21.7, 80, 2, 'vw rabbit c (diesel)')

13
(43.4, 4, 90., 48, 2335, 23.7, 80, 2, 'vw dasher (diesel)')
(29. , 4, 68., 49, 1867, 19.5, 73, 2, 'fiat 128')
(31. , 4, 76., 52, 1649, 16.5, 74, 3, 'toyota corona')
(29. , 4, 85., 52, 2035, 22.2, 76, 1, 'chevrolet chevette')
(32.8, 4, 78., 52, 1985, 19.4, 78, 3, 'mazda glc deluxe')
(44. , 4, 97., 52, 2130, 24.6, 82, 2, 'vw pickup')]

3.1 Normalise(scale) X and Y


Standardise the data using max, min formula

[12]: ##3. Standardise the data using max, min formula

normX, normY = 0 , 0

#normX = ____________________________________________________________

#normY = ____________________________________________________________

3.2 Perform gradient descent for the linear model and plot the cost graph
• Plot the cost graph vs epoch as shown

14
[13]: ## Set the learning and epoch

#learning_rate = ________________________________________________

#epoch = ________________________________________________

##4. Call function to calculate and display coefficients a, intercept b and cost

#a, b, cost = ________________________________________________

#________________________________________________

## Plot the cost

#plt.plot(cost)
#plt.show()

a - -0.7173044078461626
b - 0.66172173887927

15
[15]: ## 5. Use a and b to get predicted Y

#Y_pred = ________________________________________________

## 6. Calculate and display R-square and Adjustted R-square to evaluate model

#r2, adjusted_r2 = ________________________________________________

#print(r2)
#print(adjusted_r2)

0.6926304308718862
0.6918423037715578

3.3 Plot the linear model overlaying on the dataset


Calculate prediction value of min and max value and plot the regression chart as shown:

16
[14]: ## Uncomment following lines and fill in wiht your answer
## Use a and b to get min and max of Y prediction

Y_pred_min, Y_pred_max = 0,0

Y_pred_min = a*min(normX) + b
Y_pred_max = a*max(normX) + b

plt.scatter(normX, normY)
plt.plot([min(normX), max(normX)], [Y_pred_min, Y_pred_max], color='red') #
,→regression line

#plt.plot([min(X), max(X)], [min(Y_pred), max(Y_pred)], color='red') #


,→regression line

plt.show()

17
5. Evaluate the model and display error output.
We can see that this time, the r-squared is not very high even when we cannot reduce the cost
further. Maybe, simple linear regression is not suitable for this dataset. We will try another model
for the next lab.

3.4 References:
• https://matplotlib.org/tutorials/introductory/pyplot.html
• https://numpy.org/doc/stable/reference/generated/numpy.ndarray.min.html
• https://numpy.org/doc/stable/reference/generated/numpy.ptp.html
• https://numpy.org/doc/stable/reference/generated/numpy.zeros.html

18

You might also like