Linear_Regression
Linear_Regression
file = "./dataset/Income3.csv"
data = np.genfromtxt(file, dtype=None, names=True, delimiter=",", encoding =
,→None)
print (data.shape)
print (data.dtype)
print (data[:10])
(20,)
[('Observation', '<i4'), ('Years_of_Higher_Education_x', '<i4'), ('Income_y',
'<i4')]
[( 1, 6, 89617) ( 2, 0, 39826) ( 3, 6, 79894) ( 4, 3, 56547)
( 5, 4, 64795) ( 6, 0, 31007) ( 7, 5, 76229) ( 8, 6, 81044)
( 9, 0, 39467) (10, 6, 83546)]
1
Standard the data:
(X - min(X)) / max(X) - min(X)
Peak to peak function returns the difference between maximum and minimum
• scale X and Y
– normX = (X - X.min(0)) / X.ptp(0)
– normY = (Y - Y.min(0)) / Y.ptp(0)
X = data[data.dtype.names[1]]
Y = data[data.dtype.names[2]]
plt.scatter(X,Y)
plt.ylabel(data.dtype.names[2])
plt.xlabel(data.dtype.names[1])
plt.show()
##scale X and Y
normX = (X - X.min(0)) / X.ptp(0)
normY = (Y - Y.min(0)) / Y.ptp(0)
plt.scatter(normX,normY)
plt.ylabel("Norm " + data.dtype.names[2])
plt.xlabel("Norm " + data.dtype.names[1])
plt.show()
2
As you can see from the scatter plot, the data seems to have a linear relationship between the x
and y.
3
[3]: # Building the model
#calculate predicted value, cost and change
#X and Y are original data and label
#Y_pred = ________________________________________________
## Calculate Derivative a, b
#D_a = ___________________________________________________
#D_b = ___________________________________________________
return {
"Y_pred" : Y_pred,
"cost" : cost,
"D_a" : D_a,
"D_b" : D_b
}
4
1.4 Perform the Gradient Descent
1. initialise the model variables to zero
2. initialise the learning rate to a small number eg0.001
3. initialise the number of iterations to perform gradient descent
4. plot the cost vs epochs (number of iterations)
cost = np.zeros(epoch)
for i in range(epoch):
parameters = {}
parameters = forward(a, b, X, Y)
a = a - learning_rate * parameters["D_a"] # Update a
b = b - learning_rate * parameters["D_b"] # Update b
cost[i] = parameters["cost"]
return a, b, cost
1.4.1 Calculate the gradient and plot the cost from the gradient as per epoch as shown in the
figure:
5
[5]: # 4. plot the cost vs epochs (number of iterations)
# Performing Gradient Descent / Training model
a - 0.7147201151839904
b - 0.15562350762122823
We can see from the cost vs epoch plot that the cost/error decrease over time (epoch)
We can stop the gradient descent after the cost reduction is minimum even with increased epoch
6
Note: Make sure you have completed the forward and gradient function steps
[6]: #1) Plot the normalised original data with the regression line
Y_pred_min = a*min(normX) + b
Y_pred_max = a*max(normX) + b
print(Y_pred_min , Y_pred_max)
print(min(normX), max(normX))
plt.scatter(normX, normY)
plt.plot([min(normX), max(normX)], [Y_pred_min, Y_pred_max], color='red') #
,→regression line
plt.show()
0.15562350762122823 0.8703436228052186
0.0 1.0
7
Observed that the model fits well to the data points.
We can also plot it based on the original value range.
[7]: #2) Plot the original data with the regression line
print(Y_pred_min , Y_pred_max)
plt.scatter(X, Y)
plt.plot([min(X), max(X)], [Y_pred_min, Y_pred_max], color='red') # regression
,→line
plt.show()
40128.09378168019 82017.83973261385
8
1.6 Evaluating the model using R squared & Adjusted R squared
When performing regression, we can determine whether the model is “well” by checking the mean
error. For the standard error of the estimate, the smaller the error the better the model. One other
statistical measurement that can be used to determine the performance of the regression model is
Coefficient of determination, R-squared.
Another measurement called the Adjusted R-squared, is a better measurement which takes into
consideration of the number of variables used for the model.
Both measurement should be within the range of 0 to 1. However, values out of this range is
possible. This will show that the model is not appropriate for this dataset.
Varience explained by model (Sum of squares residual errors):
9
ss_err, ss_tot, r2, adjusted_r2 = 0,0,0,0
#ss_err = ________________________________________________
#ss_tot = ________________________________________________
r2 = 1 - (ss_err/ss_tot)
Y_pred = a * normX + b
#print(normX)
#print(Y_pred)
r2, adjusted_r2 = r_squared(normY, Y_pred, 1)
print (r2)
print (adjusted_r2)
0.9457699888541073
0.9427572104571134
We can see that we have a high r-squared and adjusted r-squared. This means that the model fits
well to the dataset.
[9]: # 2) Display both original and predicted result
a - 7738.283872313206
b - 37043.63924056803
Original Predicted
0 89617 83473.342474
1 39826 37043.639241
2 79894 83473.342474
10
3 56547 60258.490858
4 64795 67996.774730
5 31007 37043.639241
6 76229 75735.058602
7 81044 83473.342474
8 39467 37043.639241
9 83546 83473.342474
10 68852 67996.774730
11 79357 75735.058602
12 68901 67996.774730
13 49198 52520.206985
14 47407 44781.923113
15 89594 83473.342474
16 56696 52520.206985
17 55078 52520.206985
18 70756 75735.058602
19 79069 83473.342474
11
[10]: #1) Set learning rate 0.01 and epoch to 10000
#learning_rate = ________________________________________________
#epoch = ________________________________________________
#plt.plot(cost[5000:6000], color='orange')
#plt.plot(cost[6000:7000], color='green')
#plt.plot(cost[7000:8000], color='black')
#plt.plot(cost[8000:9000], color='red')
#plt.plot(cost[9000:10000], color='blue')
#plt.show()
a - 0.7875370424239243
b - 0.10677062212005925
12
3 Exercise 2:
This dataset auto-mpg-data.csv contains the Miles-Per-Gallon (fuel consumption) of 398 cars
model. There are 8 input variables X (features), and target variable Y is MPG (label). Using
simple linear regression (1 variable) to predict the fuel consumption of car based on weight.
1. Read and load data from auto-mpg.data.csv
2. Plot the data MPG and Weight
3. Standardise the data using max, min formula
4. Calculate coefficient a, b and cost with learning rate 0.01 and epoch 10000
5. Use a and b to get predicted Y
6. Calculate R-square and Adjustted R-square to evaluate model
#file = "dataset/auto-mpg-clean.csv"
#data =
,→____________________________________________________________________________
#print(data.shape)
#print(data.dtype)
#print(data[:10])
#X = ______________________________________________________________
#Y = ______________________________________________________________
#_______________________________________________________________
#plt.ylabel("mpg")
#plt.xlabel("weight")
#plt.show()
(392,)
[('mpg', '<f8'), ('cylinders', '<i4'), ('displacement', '<f8'), ('horsepower',
'<i4'), ('weight', '<i4'), ('acceleration', '<f8'), ('model_year', '<i4'),
('origin', '<i4'), ('car_name', '<U36')]
[(26. , 4, 97., 46, 1835, 20.5, 70, 2, 'volkswagen 1131 deluxe sedan')
(26. , 4, 97., 46, 1950, 21. , 73, 2, 'volkswagen super beetle')
(43.1, 4, 90., 48, 1985, 21.5, 78, 2, 'volkswagen rabbit custom diesel')
(44.3, 4, 90., 48, 2085, 21.7, 80, 2, 'vw rabbit c (diesel)')
13
(43.4, 4, 90., 48, 2335, 23.7, 80, 2, 'vw dasher (diesel)')
(29. , 4, 68., 49, 1867, 19.5, 73, 2, 'fiat 128')
(31. , 4, 76., 52, 1649, 16.5, 74, 3, 'toyota corona')
(29. , 4, 85., 52, 2035, 22.2, 76, 1, 'chevrolet chevette')
(32.8, 4, 78., 52, 1985, 19.4, 78, 3, 'mazda glc deluxe')
(44. , 4, 97., 52, 2130, 24.6, 82, 2, 'vw pickup')]
normX, normY = 0 , 0
#normX = ____________________________________________________________
#normY = ____________________________________________________________
3.2 Perform gradient descent for the linear model and plot the cost graph
• Plot the cost graph vs epoch as shown
14
[13]: ## Set the learning and epoch
#learning_rate = ________________________________________________
#epoch = ________________________________________________
##4. Call function to calculate and display coefficients a, intercept b and cost
#________________________________________________
#plt.plot(cost)
#plt.show()
a - -0.7173044078461626
b - 0.66172173887927
15
[15]: ## 5. Use a and b to get predicted Y
#Y_pred = ________________________________________________
#print(r2)
#print(adjusted_r2)
0.6926304308718862
0.6918423037715578
16
[14]: ## Uncomment following lines and fill in wiht your answer
## Use a and b to get min and max of Y prediction
Y_pred_min = a*min(normX) + b
Y_pred_max = a*max(normX) + b
plt.scatter(normX, normY)
plt.plot([min(normX), max(normX)], [Y_pred_min, Y_pred_max], color='red') #
,→regression line
plt.show()
17
5. Evaluate the model and display error output.
We can see that this time, the r-squared is not very high even when we cannot reduce the cost
further. Maybe, simple linear regression is not suitable for this dataset. We will try another model
for the next lab.
3.4 References:
• https://matplotlib.org/tutorials/introductory/pyplot.html
• https://numpy.org/doc/stable/reference/generated/numpy.ndarray.min.html
• https://numpy.org/doc/stable/reference/generated/numpy.ptp.html
• https://numpy.org/doc/stable/reference/generated/numpy.zeros.html
18