Topic_7_Linear_regression
Topic_7_Linear_regression
ISE791
Table of Contents
1 Regression
1.1 Learning Outcomes
1.1.1 What is the difference between correlation and regression?
1.2 Regression vs Correlation
2 Linear Regression
2.1 Linear Regression Overview
3 Single Input Single Output (SISO) Linear regression
3.1 Solution Method for SISO Linear Regression
3.1.1 Example: SISO Linear Regression
3.1.2 Gradient Descent for Linear Regression
3.1.3 Example-2: SISO Linear Regression
3.1.3.1 How to find the value β1 and β0 ?
3.2 How does the error function looks for our data?
3.3 What are the best values for β1 and β0 ?
3.4 Where do the best values for β1 and β0 fall in the error function?
3.4.1 Gradient descent algorithm
4 Multiple Input Single Output (MISO) Linear Regression
4.1 Gradient Descent for MISO Linear Regression
4.2 Closed Form Solution for MISO Linear Regression
4.2.1 Example MISO Linear Regression
4.3 Penalized Linear Regression (PLR)
4.3.1 Solution of the Penalized Linear Regression
4.3.2 Example Penalized Regressions
4.3.3 Lasso vs Ridge: Summary
5 Cross-validation for Parameter Selection
6 Case Study
6.1 Case Study - 1
7 Concluding Remarks
7.1 Handling Big-Data
7.2 Concluding Remarks
8 References
8.1 Theory:
8.2 Data Sets:
Regression
A process for modeling the relationship between variables of interest.
Example: If you know the relationship between education and income (the more someone is educated, the more money they make), we could predict someone's income based on their education. Simply speaking, learning such a
relationship is regression.
Learning Outcomes
1. Implement linear and penalized regression methods (CLO-2)
2. Execute gradient descent algorithm for regression problem (CLO-2)
3. Process real data before applying regression (CLO-1)
4. Apply cross-validation and hyper-parameter selection (CLO-4)
1 65 129
2 67 126
3 68 143
4 70 156
5 71 161
6 72 158
7 72 168
8 73 166
9 73 182
10 75 201
In [1]: # 1. Calculate the correlation between *Attitude* and *Score* using python.
import pandas as pd
# import matplotlib.pyplot as plt
# import numpy as np
#corr = df.loc[:,['Attitude','Score']].corr()
#corr = df.iloc[:,[1,2]].corr()
display(corr)
print(f"The correlation between Attitude and Score is {corr.loc['Attitude','Score']}")
Attitude Score
1. If attitude decrease, then score will increase. True or False. Explain your answer based on Part-1.
False, since the direction of relationship is positive.
2. If attitude of a new student is 74 units, then can you estimated his score before the exam using correlation?
No, the correlation shows the strength and direction of the relationship. However, it cannot depict the actual relation.
Regression vs Correlation
Correlation:
Regression:
1. Regression may identify how one or more variables are related to an output variable.
2. Specifically, it will provide details of how input variables affects the output variable.
3. Beyond estimating a relationship, regression is a way of predicting an output variable from one or more input variables.
Linear Regression
The most common form of regression used in data analysis.
It assumes the relationship of the input variables and the output variables is linear (can be expressed as a line or hyperplane).
We will start with Single Input Single Output (SISO) linear regression.
Notations: Let x be the input variables, and let y be the output variable. The linear regression model can be stated as:
Model:
y = β0 + β1 x ,
where β1 represents the slope of the x, and β0 is the intercept for the equation.
Goal: Linear regression estimates the best values of β0 and β1 . So, when a new or previously unobserved data point x comes with unknown value of y, using the value of x, and estimated β0 and β1 values, one can find estimated
value of y, say y^. The goal of linear regression is to have y^ as close as possible to y.
sdy
β1 = r
sdx
and
β0 = ¯y¯¯ − β1¯x¯¯
where r is the Pearson's correlation coefficient, sdx and sdy represent the standard deviation of x and y variables respectively, ¯x
¯¯ and ¯y
¯¯ represent the means of x and y variables respectively.
1 65 129
2 67 126
3 68 143
4 70 156
5 71 161
6 72 158
7 72 168
8 73 166
9 73 182
10 75 201
1. Visualize the relationship between Attitude and Score (plot Attitude on x-axis, and Score on y-axis).
2. Is the relationship linear, comment.
3. Identify the linear relationship between Attitude and Score of the students using the formula. Then verify the result using python.
4. If a new participant with positive attitude of 78 is taking the exam, then what is the estimated score of the participant.
5. If new participants with positive attitude of 78, 74, 68 and 69 are taking the exam, then what are the estimated scores for the participants.
6. The estimated score and predicted score of student with attitude of 68 are different. Is there some error in the approach?
In [2]: # 1. Visualize the relationship between Attitude and Score (plot Attitude on x-axis, and Score on y-axis).
%matplotlib inline
# %matplotlib notebook
# %matplotlib qt
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
1. Identify the linear relationship between Attitude and Score of the students using the formula. Then verify the result using python.
Pearson’s correlation coefficient r = 0.94. Mean of Attitude = 70.6, mean of Score = 159.
S.D. of Attitude = 2.94, S.D. of Score = 21.64.
Thus,
21.64
β1 = 0.94 × = 6.93
2.94
## The following code computes betas using the formula and python,
## but without using the skliear library
# import numpy as np
# x=df['Attitude'].values
# y=df['Score'].values
# r = np.corrcoef(x, y)[0,1]
# sdx = np.std(x)
# sdy = np.std(y)
# beta1 = r * (sdy / sdx)
# beta0 = np.mean(y) - beta1*np.mean(x)
# print(r,np.mean(x),np.mean(y),sdx,sdy,beta1,beta0)
1. If a new participant with positive attitude of 78 is taking the exam, then what is the estimated score of the participant.
β1 = 6.93
β0 = −330.46
In [4]: x_new = 78
y_new = reg.predict(np.array(x_new).reshape(1, -1)) #single value or single record
print(f'The estimated score for the student with attiude of {x_new} is {np.round(y_new,2)}.')
In [5]: # 5. If new participants with positive attitude of 78,74,68 and 69 are taking the exam,
# then what are the estimated scores for the participants.
x_new = [78,74,68,69]
y_new = reg.predict(np.array(x_new).reshape(-1, 1)) #single column
print(f'The estimated scores for the students with attiudes of {x_new} are {np.round(y_new,2).tolist()} respectively .')
The estimated scores for the students with attiudes of [78, 74, 68, 69] are [210.3, 182.57, 140.97, 147.91] respectively .
1. The estimated score and predicted score of student with attitude of 68 are different. Is there some error in the approach?
From the plot, we can see that the relationship is not perfectly linear. Thus, the estimated values will not give the perfect results. There will be some error. The linear regression tries to minimize the mean squared error of the
input data points. We will see all these details in the following slides.
import pandas as pd
df = pd.read_csv('data/Regression-2.csv', delimiter =',')
display(df.T)
# df.describe().T
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
y 3 4 8 4 6 9 8 12 15 26 35 40 45 54 49 59 60 62 63 68
1. Can we model the data using a linear relationship? If yes, then suggest the linear model, and highlight the unknowns.
y = β1 x + β0
where β1 is the line's slope and β0 is the line's y-intercept are the two unknowns.
Idea:
Mechanism:
1. define an error function (a.k.a cost function), that measures how good a given line is,
2. for a given value of β1 and β0 , the error function returns an error value based on how well the line fits our data.
1 n
ϵ= ∑((β1 xi + β0 ) − yi )2 .
n
i=1
Solution Method:
1. values of value of β1 and β0 that minimizes the above function will give best line fitted for our data.
2. minimization of the above function can be done using different algorithms.
3. We will focus on gradient descent algorithm.
%matplotlib inline
# %matplotlib qt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
b1Grid = np.linspace(0,8)
b0Grid = np.linspace(-20,10)
def error_function(x,y,b0,b1,e=0):
for i,yi in enumerate(y):
e += np.square((b1*x[i]+b0) - yi)
return e/len(y)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(mv, bv, z, color='b', alpha=0.3)
ax.contour(mv, bv, z, 100, offset =-1000)
ax.set_xlabel('b1')
ax.set_ylabel('b0')
ax.set_zlabel('Z')
ax.view_init(elev=40, azim=120)
plt.show()
reg = LinearRegression(fit_intercept=True)
reg.fit(df[['x']], df['y']) #single column
# reg.fit(df['x'].values.reshape(-1,1), df['y']) #single column
best_b1 = reg.coef_[0]
best_b0 = reg.intercept_
print(f'The best values for b1 and b0 are {np.round(best_b1,2)} and {np.round(best_b0,2)} respectively.')
The best values for b1 and b0 are 3.98 and -10.26 respectively.
Where do the best values for β1 and β0 fall in the error function?
The values of β1 and β0 fall at the lowest point of the error function.
1 n
ϵ= ∑((β1 xi + β0 ) − yi )2 .
n
i=1
∂ϵ 2 n
= ∑((β1 xi + β0 ) − yi )xi ,
∂β1 n
i=1
∂ϵ 2 n
= ∑((β1 xi + β0 ) − yi ).
∂β0 n
i=1
Algorithm:
Initialize:
Start with random β1 and β0 values, say β old and β old .
1 0
Calculate ϵold
repeat {
Update:
β1new = β1old − λ n2 ∑ni=1 ((β1old xi + β0old ) − yi )xi
β0new = β0old − λ n2 ∑ni=1 ((β1old xi + β0old ) − yi )
Calculate ϵnew
} until (ϵnew >= ϵold )
In the above algorithm λ > 0 is step size, and is usually selected as a small number. Here we do not have to consider the second order information because the function is convex.
def error_function(x,y,b0,b1):
e=0
for i,yi in enumerate(y):
e += ((b1*x[i]+b0) - yi)**2
return e/len(y)
def partial_b1(x,y,b0,b1):
e=0
for i,yi in enumerate(y):
e += ((b1*x[i]+b0) - yi)*x[i]
return 2*e/len(y)
def partial_b0(x,y,b0,b1):
e=0
for i,yi in enumerate(y):
e += ((b1*x[i]+b0) - yi)
return 2*e/len(y)
# Given
#initialize
b1_Old = 10
b0_Old = 10
mse_Old = error_function(x,y,b0_Old,b1_Old)
lam = 0.005
print('Initial Values:', b0_Old, b1_Old, mse_Old,'\n')
##repeate
while True:
b1_New = b1_Old - lam * partial_b1(x,y,b0_Old,b1_Old)
b0_New = b0_Old - lam * partial_b0(x,y,b0_Old,b1_Old)
mse_New = error_function(x,y,b0_New,b1_New)
if mse_New >= mse_Old:
b0_New,b1_New,mse_New = b0_Old,b1_Old,mse_Old
break
else:
b0_Old,b1_Old,mse_Old = b0_New,b1_New,mse_New
So far we have seen one input variable's relationship with one output variable.
In many practical applications, we will be dealing with multiple input variables.
To incorporate multiple variables, consider the following notations:
Let the n observations be represented by (xi , yi ), for i = 1, … , n
where xi is a vector in RP , and P are the number of input variables.
Let β0 , the intercept
Let βj be the coefficients of variable xi , for j = 1, … , P , and for i = 1, … , n
j
For ease of notation, we introduce xi0 = 1 for i = 1, … , n
P
h(xi ) = ∑ βj xij ∀i = 1, … , n.
j=0
1 n 2
J(β) = ∑ (h(xi ) − yi ) .
n
i=0
∂ 2 n
J(β) = ∑ (h(xi ) − yi ) xij
∂βj n
i=0
2 n
βj := βj − λ ∑ (h(xi ) − yi ) xij
n
i=0
In the above algorithm λ > 0 is step size, and is usually selected as a small number. Here we do not have to consider the second order information because the J(β) function is convex.
β∗ = (XT X)−1 XT y,
where X ∈ Rn×(P +1) , each column of X represent an input variable, the first column of X contains all ones, and y ∈ Rn represents the column containing output variable.
# display(corr)
In [7]: # 3. Identify the top three correlated input variables to the output variable.
corr=corr.apply(lambda x: np.abs(x))
sorted_corr = sorted_corr['y'].index
print('The top three correlated input variables are: ',sorted_corr[1:4].tolist())
print(sorted_corr)
The top three correlated input variables are: ['x1', 'x5', 'x6']
Index(['y', 'x1', 'x5', 'x6', 'x2', 'x8', 'x7', 'x4', 'x3'], dtype='object')
In [8]: #4: Calculate the coefficient estimates using OLS closed form.
# import numpy as np
Xo = df.iloc[:,0:-1].values
y = df.iloc[:,-1].values
X = np.c_[np.ones(len(y)), Xo]
#print(Xo)
#print(y)
#print(X)
The closed form estimates are: [-0.0, 0.58, 0.23, -0.14, 0.12, 0.27, -0.13, 0.03, 0.11]
In [15… # 5: Calculate the coefficient estimates using scikit learn LinearRegression module.
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(Xo, y)
best_beta = np.round(reg.coef_,2)
best_beta_0 = np.round(reg.intercept_,2)
print(f'The best values for the estimates are :', best_beta_0, best_beta.tolist())
The best values for the estimates are : -0.0 [0.58, 0.23, -0.14, 0.12, 0.27, -0.13, 0.03, 0.11]
The coefficients obtained by minimizing the squared error of the training data, may not perform in general good on the new data.
This phenomenon is called as over-fitting.
In order to control over-fitting, a penalized term (or regularization term) is added to the error function.
The aim of the penalized function is to set some of the coefficients to zero.
1 n
∑ (h(xi ) − yi ) + αP (β),
2
J(β) =
n
i=1
where P (β) is the penalty or regularized term, and α is the regularization coefficient.
In our class, we will look at the following two popular penalization linear regression models:
Ridge regression:
1 n P
∑ (h(x ) − yi ) + α ∑ βj2
i 2
n
i=1 j=0
LASSO regression:
1 n P
∑ (h(x ) − yi ) + α ∑ |βj |
i 2
n
i=1 j=0
β∗ = (XT X + αI)−1 XT y,
LASSO regression:
No closed form solution, but many efficient algorithms exists.
Note: gradient descent can be used for solving the above penalized linear regression problems.
1. Find the coefficient estimates using Ridge regression for alpha = 500, using sci-kit learn.
2. Find the coefficient estimates using Lasso regression for alpha = 0.5, using sci-kit learn.
3. Are the coefficients different from OLS. What can you infer.
In [16… #1. Find the coefficient estimates using Ridge regression for 𝑎𝑙𝑝ℎ𝑎=500 , using sci-kit learn.
import pandas as pd
df = pd.read_csv('data/Regression-3.csv', delimiter =',')
Xo = df.iloc[:,0:-1].values
y = df.iloc[:,-1].values
The best values for the estimates are : -0.0 [0.1, 0.06, 0.01, 0.02, 0.07, 0.06, 0.04, 0.04]
In [17… #2. Find the coefficient estimates using Lasso regression for 𝑎𝑙𝑝ℎ𝑎=0.5 , using sci-kit learn.
from sklearn.linear_model import Lasso
regl = Lasso(alpha=0.5)
regl.fit(Xo, y)
best_beta = np.round(regl.coef_,2)
best_beta_0 = np.round(regl.intercept_,2)
print(f'The best values for the estimates are :', best_beta_0, best_beta.tolist())
The best values for the estimates are : -0.0 [0.23, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
1. Are the coefficients different from OLS. What can you infer.
Yes, the coefficients are different due to penalized objective. Following can be inferred:
OLS vs Ridge: OLS and Ridge have all non-zero coefficients. However, the coefficients given by Ridge are less in terms of absolute value (shrunken coefficients)
OLS vs Lasso: OLS has all non-zero coefficients, whereas Lasso has ONE non-zero coefficients (selected coefficient). Thus, Lasso claims that only the first column is relevant to the output variable, for alpha=0.5.
Train-Testing:
Cross-validation (CV):
Case Study
Let's test the power of linear regression
Case Study - 1
Data:
Historical data of real estate prices per unit area are collected for a city. Following 6 input features are recorded over the time:
X1=the transaction date
X2=the house age (unit: year)
X3=the distance to the nearest MRT station (unit: meter)
X4=the number of nearby convenience stores (integer)
X5=the geographic coordinate, latitude. (unit: degree)
X6=the geographic coordinate, longitude. (unit: degree)
Values of the 6 input features for (one observation) is written in a row. Furthermore, the corresponding real estate price per unit area (output variable) values are stored under the column Y in the corresponding row. The data is given
in Regression-4.csv file.
Hypothesis:
Our underlying hypothesis is that the input variables are linearly related with the output variable.
Objective:
The objective of this case study is to identify the input variables' relationship with the output variable. Specifically, conduct a regression analysis, and estimate the best coefficients that capture the underlying relationship. Note: It is
not necessary that all the input variables are related to the output variable.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 X1 transaction date 414 non-null float64
1 X2 house age 414 non-null float64
2 X3 distance to the nearest MRT station 414 non-null float64
3 X4 number of convenience stores 414 non-null int64
4 X5 latitude 414 non-null float64
5 X6 longitude 414 non-null float64
6 Y house price of unit area 414 non-null float64
dtypes: float64(6), int64(1)
memory usage: 22.8 KB
None
Out[4]: count mean std min 25% 50% 75% max
X1 transaction date 414.0 2013.148971 0.281967 2012.66700 2012.917000 2013.16700 2013.417000 2013.58300
X2 house age 414.0 17.712560 11.392485 0.00000 9.025000 16.10000 28.150000 43.80000
X3 distance to the nearest MRT station 414.0 1083.885689 1262.109595 23.38284 289.324800 492.23130 1454.279000 6488.02100
X4 number of convenience stores 414.0 4.094203 2.945562 0.00000 1.000000 4.00000 6.000000 10.00000
Y house price of unit area 414.0 37.980193 13.606488 7.60000 27.700000 38.45000 46.600000 117.50000
scaler.fit(np.c_[X_train,y_train])
A_train = scaler.transform(np.c_[X_train,y_train])
X_train = A_train[:,:-1]
y_train = A_train[:,-1]
A_test = scaler.transform(np.c_[X_test,y_test])
X_test = A_test[:,:-1]
y_test = A_test[:,-1]
# print(A_train)
## OLS
from sklearn.linear_model import LinearRegression
reg1 = LinearRegression(fit_intercept=False).fit(X_train, y_train)
y_pred1 = reg1.predict(X_test)
print('The MSE using OLS is:', mean_squared_error(y_test, y_pred1))
## Ridge
from sklearn.linear_model import RidgeCV
reg2 = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3], fit_intercept=False,cv=10).fit(X_train, y_train)
y_pred2 = reg2.predict(X_test)
print('The MSE using Ridge is:', mean_squared_error(y_test, y_pred2))
## Lasso
from sklearn.linear_model import LassoCV
reg3 = LassoCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3],
fit_intercept=False,cv=10, random_state=0).fit(X_train, y_train)
y_pred3 = reg3.predict(X_test)
print('The MSE using Lasso is:', mean_squared_error(y_test, y_pred3))
#best_beta = np.round(reg3.coef_,2)
#best_beta_0 = np.round(reg3.intercept_,2)
#print(f'The best values for the estimates are :', best_beta_0, best_beta.tolist())
Lasso performs better than OLS and Ridge for this data.
From the regression analysis, we can say that input variable X6 seems to have no relationship with Y.
Given a new information related to X1, X2, X3, X4 & X5, using the above lasso reg3 model, one can predict the price (Y).
Concluding Remarks
Further knowledge in regression
Handling Big-Data
When n is very huge, the close form solution requires high computer memory.
Typically, gradient descent is preferred over closed form solution in such cases.
The gradient descent can be applied in following ways:
Batch mode: The gradient descent algorithm considers all n data points while calculating the gradient
Stochastic/Incremental mode: The gradient descent algorithm considers one of the n data points at a time while calculating the gradient
Mini-Batch mode: The gradient descent algorithm considers a sample from n data points at a time while calculating the gradient
When n is very huge, typically, Stochastic or Mini-batch mode is used.
Concluding Remarks
Accuracy
Training Time
Number of Parameters
Number of Features
Choosing the Right Estimator
ElasticNet
References
Theory:
1. Chirag Shah, "A Hands-On Introduction to Data Science," Cambridge University Press, 2020, Section 3.8.1, 8.3, 8.4, 12.4.
2. Bishop, Christopher M., "Pattern recognition and machine learning," Springer, 2006, Section 3.1.4.
Data Sets:
1. Regression-1 and Regression-2: Chirag Shah, "A Hands-On Introduction to Data Science," Cambridge University Press, 2020.
2. Regression-3: Stamey, Thomas A., et al. "Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients." The Journal of urology 141.5 (1989): 1076-1083.
3. Regression-4: Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.
In [ ]: