0% found this document useful (0 votes)
10 views

Lab 6 - Linear Regression and Multiple Linear Regression

The document provides an overview of linear regression, including simple and multiple linear regression implementations in Python. It explains the assumptions of linear regression, such as linearity, multicollinearity, autocorrelation, and homoscedasticity, and discusses applications in various fields like economics and finance. Additionally, it includes code examples for implementing linear regression using datasets like Boston housing and diabetes data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lab 6 - Linear Regression and Multiple Linear Regression

The document provides an overview of linear regression, including simple and multiple linear regression implementations in Python. It explains the assumptions of linear regression, such as linearity, multicollinearity, autocorrelation, and homoscedasticity, and discusses applications in various fields like economics and finance. Additionally, it includes code examples for implementing linear regression using datasets like Boston housing and diabetes data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Linear Regression (Python Implementation)

Linear regression is a statistical method for modeling relationships between a dependent


variable with a given set of independent variables.
Note: We refer to dependent variables as responses and independent variables
as features for simplicity.

In order to provide a basic understanding of linear regression, we start with the most
basic version of linear regression, i.e. Simple linear regression.

Simple Linear Regression

Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear
function that predicts the response value(y) as accurately as possible as a function of the
feature or independent variable(x).

Let us consider a dataset where we have a value of response y for every feature x:

For generality, we define:

x as feature vector, i.e x = [x_1, x_2, …., x_n],


y as response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).

A scatter plot of the above dataset looks like:-


Code: Python implementation of above technique
1) Run IDLE
2) Click File>New File and type the following code and save the code as LRM.py
import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):


# number of observations/points
n = np.size(x)

# mean of x and y vector


m_x = np.mean(x)
m_y = np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients


b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x

return (b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response vector


y_pred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot


plt.show()

def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if __name__ == "__main__":
main()
3) Click Run>Run Module and observe the following output and model graph

Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like this:

4) Change X and Y with different values and run the LRM.py file and note down the output in
observation

5) Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing Linear
Regression modeling and note down the steps and outputs in your observation.
Multiple linear regression
Code: Python implementation of multiple linear regression techniques on
the Boston house pricing dataset using Scikit-learn.
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, metrics

# load the boston dataset


boston = datasets.load_boston(return_X_y=False)

# defining feature matrix(X) and response vector(y)


X = boston.data
y = boston.target

# splitting X and y into training and testing sets


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,

random_state=1)

# create linear regression object


reg = linear_model.LinearRegression()

# train the model using the training sets


reg.fit(X_train, y_train)

# regression coefficients
print('Coefficients: ', reg.coef_)

# variance score: 1 means perfect prediction


print('Variance score: {}'.format(reg.score(X_test, y_test)))

# plot for residual error


## setting plot style
plt.style.use('fivethirtyeight')

## plotting residual errors in training data


plt.scatter(reg.predict(X_train), reg.predict(X_train) - y_train,
color = "green", s = 10, label = 'Train data')

## plotting residual errors in test data


plt.scatter(reg.predict(X_test), reg.predict(X_test) - y_test,
color = "blue", s = 10, label = 'Test data')

## plotting line for zero residual error


plt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)

## plotting legend
plt.legend(loc = 'upper right')

## plot title
plt.title("Residual errors")

## method call for showing the plot


plt.show()

Output:
Coefficients:
[ -8.80740828e-02 6.72507352e-02 5.10280463e-02 2.18879172e+00
-1.72283734e+01 3.62985243e+00 2.13933641e-03 -1.36531300e+00
2.88788067e-01 -1.22618657e-02 -8.36014969e-01 9.53058061e-03
-5.05036163e-01]
Variance score: 0.720898784611

and Residual Error plot looks like this:


Exercise:
Use the diabetes data set from UCI and Pima Indians Diabetes data set and
perform multiple linear regression. Also compare the results of the above
analysis for the two data sets

In the above example, we determine the accuracy score using Explained


Variance Score.

We define:
explained_variance_score = 1 – Var{y – y’}/Var{y}

where y’ is the estimated target output, y the corresponding (correct) target


output, and Var is Variance, the square of the standard deviation.
The best possible score is 1.0, lower values are worse.

Assumptions:
Given below are the basic assumptions that a linear regression model makes
regarding a dataset on which it is applied:
 Linear relationship: Relationship between response and feature variables
should be linear. The linearity assumption can be tested using scatter plots.
As shown below, 1st figure represents linearly related variables whereas
variables in the 2nd and 3rd figures are most likely non-linear. So, 1st figure
will give better predictions using linear regression.
 Little or no multi-collinearity: It is assumed that there is little or no
multicollinearity in the data. Multicollinearity occurs when the features (or
independent variables) are not independent of each other.
 Little or no auto-correlation: Another assumption is that there is little or no
autocorrelation in the data. Autocorrelation occurs when the residual errors
are not independent of each other.
 Homoscedasticity: Homoscedasticity describes a situation in which the
error term (that is, the “noise” or random disturbance in the relationship
between the independent variables and the dependent variable) is the same
across all values of the independent variables. As shown below, figure 1 has
homoscedasticity while figure 2 has heteroscedasticity.
Applications:

 Trend lines: A trend line represents the variation in quantitative data with
the passage of time (like GDP, oil prices, etc.). These trends usually follow a
linear relationship. Hence, linear regression can be applied to predict future
values. However, this method suffers from a lack of scientific validity in
cases where other potential changes can affect the data.

 Economics: Linear regression is the predominant empirical tool in


economics. For example, it is used to predict consumer spending, fixed
investment spending, inventory investment, purchases of a country’s
exports, spending on imports, the demand to hold liquid assets, labor
demand, and labor supply.

 Finance: The capital price asset model uses linear regression to analyze
and quantify the systematic risks of an investment.

 Biology: Linear regression is used to model causal relationships between


parameters in biological systems.
References:

 https://www.geeksforgeeks.org/linear-regression-python-implementation/
 https://en.wikipedia.org/wiki/Linear_regression
 https://en.wikipedia.org/wiki/Simple_linear_regression
 http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
 http://www.statisticssolutions.com/assumptions-of-linear-regression/

You might also like