0% found this document useful (0 votes)
4 views

Assigment Regression

Uploaded by

Raj Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Assigment Regression

Uploaded by

Raj Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Introduction to Regression.

Please install scikit learn and matplotlib. We will use scikit learn for data analysis and matplotlib for
plotting. Pandas will be used for reading the data.

1 Linear Regression

Figure 1: Linear regression

Regression analysis is a conceptually simple method for investigating functional relationships among
variables. These variables are known as dependent/response variable (x and independent/predictor
variable (y).

y = a1 x1 + a2 x2 + ................ + an xn + b (1)
Pn
xi yi − ni=1 xi ni=1 yi
P P
n
a= i=1
(2)
n ni=1 xi 2 − ( ni=1 xi )2
P P

Pn Pn
i=1 yi − a i=1 xi
b= (3)
n
Please consider a dataset of marks distribution of a class as given below;

1
Listing 1: marks_btech.dat

Names Maths Phys Chem Bio Eng Soc-Science Comp-Science


Rishi 92 80 91 78 76 89 95
Arpit 91 77 74 81 83 91 96
Raj 95 92 78 91 78 79 99
Indra 78 72 68 81 84 89 82
Vinay 89 81 78 86 81 84 92
Ashok 48 39 37 38 51 57 61
Arya 38 27 31 38 38 51 44
Priti 91 90 91 95 98 99 98
Som 90 89 95 78 91 91 96
Akash 38 37 34 71 81 78 51
Sruti 98 92 78 91 78 68 100
Palash 79 72 68 81 84 89 86
Arun 89 81 78 86 81 80 92
Rajesh 80 99 97 88 81 84 84
Veer 88 27 31 38 38 45 92
Sarti 94 90 91 95 98 99 97
Somen 89 89 95 78 91 89 93
Aksh 88 77 74 71 81 84 92
Arpita 93 77 74 81 83 88 96

1.1 Single variable linear regression


In the program given below, we will study the regression between marks scored by students in maths
and computer science. The marks scored in maths comprise X. Marks scored in computer science
comprise y. First, we will split the data (X,y) in training (70 %) and test (30 %) sets. We will fit the
training data to calculate the regression coefficients and intercept. Finally, predict the line of best fit
for y.

The least square regression line for the set of n data points is given by y = ax + b

2
Listing 2: linear_regression

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
df= pd.read_table("marks_Btech.dat", sep="\s+")

X=df.iloc[:,1:2].values
print X
y=df.iloc[:,7].values
print y

X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=0)

print X_train
print X_test
print y_train
print y_test

reg = linear_model.LinearRegression()

reg.fit(X_train,y_train)

print reg.coef_
print reg.intercept_

y_pred= reg.predict(X_test)

plt.scatter(X_test, y_test, color='black', linewidth =3)


plt.scatter(X_train, y_train, color='red', linewidth =1)
plt.scatter(X_test, y_pred, color='blue', linewidth=3)
plt.show()

1.2 Multi variable linear regression


The dependent variable y depends on n independent variables x1 , x2 , ..,xn ;

y = a1 x1 + a2 x2 + ... + an xn + b (4)

Suppose, we want to calculate the linear regression between the marks scored in several/rest of the
subjects and computer science. We will perform multi variable linear regression;

3
Listing 3: Multi_linear

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
df= pd.read_table("marks_Btech.dat", sep="\s+")

X=df.iloc[:,1:-1].values
print X, len(X)
y=df.iloc[:,7].values
print y, len(y)

X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.2,random_state=0)

print X_train, len(X_train)


print X_test
print y_train, len(y_train)
print y_test

reg = linear_model.LinearRegression()
reg.fit(X_train,y_train)

print reg.coef_

y_pred= reg.predict(X_test)

print y_test

print reg.intercept_

print y_pred

2 Polynomial regression
y = a0 + a1 x + a2 x2 + ... + an xn + b (5)
The data consist of n observations of the dependent variable y on independent variable x.

In the program given below, we use polynomial regression of order 1 to 6 to study the regression
between marks scored by the students in maths and computer science.

4
Figure 2: Linear regression

Listing 4: Poly_regression

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
df= pd.read_table("marks_Btech.dat", sep="\s+")
X=df.iloc[:,1:2].values
print X
y=df.iloc[:,7].values
print y

X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=0)

reg = linear_model.LinearRegression()

from sklearn.preprocessing import PolynomialFeatures


poly_reg= PolynomialFeatures(degree=2)
x_poly = poly_reg.fit_transform(X)
reg.fit(x_poly, y)
y_pred = reg.predict(poly_reg.fit_transform(X_test))
print reg.coef_
print reg.intercept_

for i in range(1,6):
poly_reg= PolynomialFeatures(degree=i)
x_poly= poly_reg.fit_transform(X)
reg.fit(x_poly, y)
print('Degree of Equation:', i )
print('Coefficient:', reg.coef_ )
print('Intercept:', reg.intercept_ )
print('Accuracy Score', reg.score(poly_reg.fit_transform(X_test),y_test))

5
3 Assignment
• Please use matplotlib to plot the y_pred as function of x_test using polynomial regression of
order 1 to 6 and compare them in the same plot.

• In the program written above, we perform single variable polynomial regression. Please perform
multivariable polynomial regression to predict y, which is the marks scored in computer science
as a function of marks scored in all the subjects.

4 Dealing with collinearity


Multicollinearity arises if the independent variables (X) suffer non linear dependence among each other.
Multicollinearity can create inaccurate estimates of the regression coefficients, inflate the standard
errors of the regression coefficients, give false and nonsignificant coefficient values, and degrade the
predictability of the model.

4.1 Ridge
Ridge regression addresses some of the problems of ordinary least squares by imposing a penalty on
the size of coefficients. The ridge coefficients minimize a penalized residual sum of squares,

min(kY − X(β)k22 + λ kβk22 ) (6)


β

Here, λ is a complexity parameter that controls the amount of shrinkage: the larger the value of λ, the
greater the amount of shrinkage and thus the coefficients become more robust to collinearity.

1. It shrinks the parameters, therefore it is mostly used to prevent multicollinearity.

2. It reduces the model complexity by coefficient shrinkage

Listing 5: Ridge

import matplotlib.pyplot as plt


from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

df= pd.read_table("marks_Btech.dat", sep="\s+")


X=df.iloc[:,1:-1].values
print X
y=df.iloc[:,7].values
print y

X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=0)

model = Ridge(alpha=1.0, fit_intercept =True)


model.fit(X_train,y_train)
print model.coef_
y_pred = model.predict(X_test)
print y_test, y_pred
print model.intercept_

6
Figure 3: Ridge and LASSO

4.2 LASSO
LASSO (Least Absolute Shrinkage Selector Operator), is a similar technique to ridge. LASSO selects
only some feature while reduces the coefficients of others to zero. This property is known as feature
selection and which is absent in case of ridge. In LASSO instead of adding squares of β, we will add
absolute value of β.
min(kY − X(β)k22 + λ kβk1 ) (7)
β

1. It is generally used when we have more number of features, because it automatically does feature
selection.

It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values,
effectively reducing the number of variables upon which the given solution is dependent. For this
reason, the LASSO and its variants are fundamental to the field of compressed sensing.

Listing 6: Lasso

import matplotlib.pyplot as plt


from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso

df= pd.read_table("marks_Btech.dat", sep="\s+")


X=df.iloc[:,1:-1].values
print X
y=df.iloc[:,7].values
print y

X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=0)

l2_model = Lasso(alpha=1.0, fit_intercept= True)


l2_model.fit(X_train, y_train)
print l2_model.coef_
y_pred = l2_model.predict(X_test)
print y_test, y_pred
print l2_model.intercept_

7
4.3 ENet
Elastic net regression is basically a hybrid of ridge and lasso regression. This combination allows for
learning a sparse model, where few of the weights are non-zero like LASSO, while still maintaining
the regularization (keeping the same number of features, but reduce the magnitude of the coefficients)
properties of ridge.

min(kY − X(β)k22 + λ1 kβk1 + λ2 kβk22 ) (8)


β

Listing 7: ENet

import matplotlib.pyplot as plt


from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet

df= pd.read_table("marks_Btech.dat", sep="\s+")


X=df.iloc[:,1:-1].values
print X
y=df.iloc[:,7].values
print y

X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=0)


print X_train
print X_test
print y_train
print y_test

enet=ElasticNet(alpha=3.0, fit_intercept=True)
enet.fit(X_train, y_train)
print enet.coef_
y_pred = enet.predict(X_test)
print y_test, y_pred
print enet.intercept_

4.4 Feature reduction


We may reduce the number of features using the following program. Please note that in this case the
predicted values of y (y_pred) differ a lot from test values of y (y_test). Thus, feature reduction
generally works if the features are two alike, which is not so in this case.

8
Listing 8: Multi_linear

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
df= pd.read_table("marks_Btech.dat", sep="\s+")
X=df.iloc[:,1:-1].values
print X, len(X)
y=df.iloc[:,7].values
print y, len(y)
print X.shape

X_new = SelectKBest(chi2, k=2).fit_transform(X,y)


print X_new.shape
print X_new

X_train, X_test, y_train, y_test = train_test_split(X_new,y, test_size=0.2,random_state=0)

reg = linear_model.LinearRegression()
reg.fit(X_train,y_train)

print reg.coef_

y_pred= reg.predict(X_test)

print y_test

print reg.intercept_

print y_pred

5 Assignment
• Please also go through (i) support vector machine regression (ii) Kernel Ridge regression.

You might also like