0% found this document useful (0 votes)

10 views

DSBDAL_Assignment no 4

This document outlines the process of creating a linear regression model using the Boston Housing Dataset to predict home prices based on features like the number of rooms and distance to employment centers. It explains both simple and multiple linear regression techniques using Python libraries such as statsmodels and scikit-learn, detailing the steps to fit the model, interpret results, and visualize predictions. The document also emphasizes the importance of understanding model performance through metrics like R-squared and coefficients.

Uploaded by

darshanpatil200219

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

DSBDAL_Assignment no 4

Uploaded by

darshanpatil200219

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Experiment no 4

4. Data Analytics I
Create a Linear Regression Model using Python/R to predict home prices using Boston
Housing Dataset (https://www.kaggle.com/c/boston-housing). The Boston Housing dataset
contains information about various houses in Boston through different parameters. There
are 506 samples and 14 feature variables in this dataset.

The objective is to predict the value of prices of the house using the given features.

Theory

Linear regression

Linear regression is an approach for modeling the

relationship between two (simple linear regression) or
more variables (multiple linear regression). In simple
linear regression, one variable is considered the predictor
or independent variable, while the other variable is viewed
as the outcome or dependent variable.

Here’s the linear regression equation:

where y is the dependent variable (target value), x1, x2, …

xn the independent variable (predictors), b0 the

intercept, b1, b2, ... bn the coefficients and n the number
of observations.

If the equation isn’t clear, the picture below might help.

Credit: Quora

In the picture, you can see a linear relationship. That is, if

one independent variable increases or decreases, the
dependent variable will also increase or decrease.

Linear regression can be used to make simple predictions

such as predicting exams scores based on the number of
hours studied, the salary of an employee based on years of
experience, and so on.

Enough theory! Let’s learn how to make a linear

regression in Python.

Linear Regression in Python

There are different ways to make linear regression in

Python. The 2 most popular options are using the
statsmodels and scikit-learn libraries.
First, let’s have a look at the data we’re going to use to
create a linear model.

The Data

To make a linear regression in Python, we’re going to use a

dataset that contains Boston house prices. The original
dataset comes from the sklearn library, but I simplified it,
so we can focus on building our first linear regression.

You can download this dataset on my Github or on Google

Drive. Make sure to leave this CSV file in the same
directory where your Python script is located.

Let’s have a look at this dataset. To do so, import pandas

and run the code below.
import pandas as pd
df_boston = pd.read_csv('Boston House Prices.csv')
df_boston
Image by author

There are 3 columns. The “Value” column contains the

median value of owner-occupied homes in $1000’s (this is
what we want to predict, that is, our target value). The
“Rooms” and “Distance” columns contain the average
number of rooms per dwelling and weighted distances to
five Boston employment centers (both are the predictors)

To sum it up, we want to predict home values based on the

number of rooms a home has and its distance to
employment centers.
Linear Regression with Statsmodels

Statsmodels is a module that helps us conduct statistical

tests and estimate models. It provides an extensive list of
results for each estimator.

If you have installed Python through Anaconda, you

already have statsmodels installed. If not, you can install it
either with conda or pip.
# pip
pip install statsmodels# conda
conda install -c conda-forge statsmodels

Once you have statsmodel installed, import it with the

following line of code.
import statsmodels.api as sm

The first thing to do before creating a linear regression is

to define the dependent and independent variables. We’ve
already discussed them in the previous section. The
dependent variable is the value we want to predict and is
also known as the target value. On the other hand, the
independent variable(s) is the predictor.

In our dataset we have 2 predictors, so we can use any or

both of them.

Let’s start with a simple linear regression. A simple linear

regression estimates the relationship between one
independent variable and one dependent variable.
Simple Linear Regression

For this example, I’ll choose “Rooms” as our

predictor/independent variable.

 Dependent variable: “Value”

 Independent variable: “Rooms”

Let’s define the dependent and independent variables in

our code as well.
y = df_boston['Value'] # dependent variable
x = df_boston['Rooms'] # independent variable

Throughout this guide, I’ll be using linear algebra notation

— lower case letters will be used for vectors and upper
case letters will be used for matrices.

Fitting the model

Now it’s time to fit the model. To explain to you what
fitting a model means, consider the following generic
equation used for simple linear regression.
𝑦 = 𝑎𝑥 + 𝑏
Credit: Quora

Fitting the model means finding the optimal values

of a and b, so we obtain a line that best fits the data points.
A model that is well-fitted produces more accurate
outcomes, so only after fitting the model, we can predict
the target value using the predictors.

Now let’s fit a model using statsmodels. First, we add a

constant before fitting a model (sklearn adds it by default)
and then we fit the model using the .fit() method.
x = sm.add_constant(x1) # adding a constant
lm = sm.OLS(y,x).fit() # fitting the model

“lm” stands for linear model and represents our fitted

model. This variable will help us predict our target value.
>>> lm.predict(x)0 25.232623
1 24.305975
2 31.030253
3 29.919727
4 31.231138
...
501 24.603318
502 20.346831
503 27.822178
504 26.328552
505 19.661029

The code above predicts home values (printed output)

based on the data inside the “Room” column.

The Regression Table

Although we could predict the target values, the analysis

isn’t done yet. We need to know how this linear model
performs. The regression table can help us with that. This
table provides an extensive list of results that reveal how
good/bad is our model.

To obtain the regression table run the code below:

lm.summary()

You will obtain this table:

Image by author

The table is titled “OLS Regression Results.” OLS stands

for Ordinary Least Squares and this is the most common
method to estimate linear regression.

Let’s have a look at some important results in the first and

second tables.

 Dep. Variable: This is the dependent variable (in our

example “Value” is our target value)
 R-squared: Takes values from 0 to 1. R-squared values
close to 0 correspond to a regression that explains none
of the variability of the data, while values close to 1
correspond to a regression that explains the entire
variability of the data. The r-squared obtained is
telling us that the number of rooms explains
48.4% of the variability in house values.

 Coef: These are the coefficients (a, b) we’ve seen in the

model equation before.

 Std error: Represents the accuracy of the prediction.

The lower the standard error, the better prediction.

 t, P>t (p-value): The t scores and p-values are used for

hypothesis test. The “Rooms” variable has a statistically
significant p-value. Also, we can say at a 95% percent
confidence level that the value of “Rooms” is between
8.279 to 9.925.

Linear Regression Equation

From the table above, let’s use the coefficients (coef) to
create the linear equation and then plot the regression line
with the data points.
# Constant coef: - 34.6706# Linear equation: 𝑦 = 𝑎𝑥 + 𝑏
# Rooms coef: 9.1021

y_pred = 9.1021 * x['Rooms'] - 34.6706

where y_pred (also known as yhat) is the predicted value of

y (the dependent variable) in the regression equation.

Linear Regression Plot

To plot the equation let’s use seaborn.
import seaborn as sns
import matplotlib.pyplot as plt# plotting the data points
sns.scatterplot(x=x['Rooms'], y=y)#plotting the line
sns.lineplot(x=x['Rooms'],y=y_pred, color='red')#axes
plt.xlim(0)
plt.ylim(0)
plt.show()

The code above produces the following plot.

Image by author

The red plot is the linear regression we built using Python.

We can say that this is the line that best fits the blue data
points.
Congratulation! You’ve just built your first simple linear
regression in Python. If you’re up for a challenge, check
how to make a multiple linear regression.

Multiple Linear Regression

Now that you already know the core concepts of linear

regression, we can easily create a multiple linear
regression.

Let’s start by setting the dependent and independent

variables. In this case, we’re going to use 2 independent
variables.

 Dependent variable: “Value”

 Independent variable: “Rooms” and “Distance”

Let’s define the dependent and independent variables in

our code as well.
y = df_boston['Value'] # dependent variable
X = df_boston[['Rooms', 'Distance']] # independent variable

Now let’s add a constant and fit the model.

X = sm.add_constant(X) # adding a constant
lm = sm.OLS(y, X).fit() # fitting the model

Let’s have a look at the results.

lm.summary()
Image by author

The r-squared increased a bit. Also, there’s a new line in

the second table that represents the parameters for the
“Distance” variable. The analysis of this table is similar to
the simple linear regression, but if you have any questions,
feel free to let me know in the comment section.

Linear Regression with sklearn

Scikit-learn is the standard machine learning library in
Python and it can also help us make either a simple linear
regression or a multiple linear regression.

Since we deeply analyzed the simple linear regression

using statsmodels before, now let’s make a multiple linear
regression with sklearn.

First, let’s install sklearn. If you have installed Python

through Anaconda, you already have sklearn installed. If
not, you can install it either with conda or pip.
# pip
pip install scikit-learn# conda
conda install -c conda-forge scikit-learn

Now let’s import linear_model from the sklearn library.

from sklearn import linear_model

The dependent and independent variables will be the

following.
y = df_boston['Value'] # dependent variable
X = df_boston[['Rooms', 'Distance']] # independent variable

Now we have to fit the model (note that the order of

arguments in the fit method using sklearn is different from
statsmodels)
lm = linear_model.LinearRegression()
lm.fit(X, y) # fitting the model

Similarly to statsmodels, we use the predict method to

predict the target value in sklearn.
lm.predict(X)
However, unlike statsmodels we don’t get a summary table
using .summary(). Instead, we have to call each element one
by one.
>>> lm.score(X, y)
0.495>>> lm.coef_
array([8.80141183, 0.48884854])>>> lm.intercept_
-34.636050175473315

The results are the same as the table we obtained with

statsmodels.

Note that we didn’t split the data into training and test for
the sake of simplicity. Splitting the data before building
the model is a popular approach to avoid overfitting.

Simple Linear Regression in Machine Learning
No ratings yet
Simple Linear Regression in Machine Learning
7 pages
Machine Learning 2
No ratings yet
Machine Learning 2
45 pages
2.1 ML (Implementation of Simple Linear Regression in Python)
No ratings yet
2.1 ML (Implementation of Simple Linear Regression in Python)
8 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
Home Ai Machine Learning Dbms Java Blockchain Control System Selenium HTML Css Javascript Ds
No ratings yet
Home Ai Machine Learning Dbms Java Blockchain Control System Selenium HTML Css Javascript Ds
11 pages
2.3 ML (Implementation of Polynomial Regression Using Python)
No ratings yet
2.3 ML (Implementation of Polynomial Regression Using Python)
9 pages
Linear Regression - Numpy and Sklearn
No ratings yet
Linear Regression - Numpy and Sklearn
7 pages
Lecture-17-Linear Regression Using Sklearn
No ratings yet
Lecture-17-Linear Regression Using Sklearn
32 pages
Linear Regression
No ratings yet
Linear Regression
46 pages
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
No ratings yet
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
9 pages
UNIT-1 Polynomial Regression
No ratings yet
UNIT-1 Polynomial Regression
7 pages
Sales and Advertising
No ratings yet
Sales and Advertising
14 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
Linear Regression - Jupyter Notebook
100% (3)
Linear Regression - Jupyter Notebook
56 pages
AI lab7
No ratings yet
AI lab7
13 pages
A) The Least-Squares Method
No ratings yet
A) The Least-Squares Method
19 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
32 pages
Banking Risk Management
No ratings yet
Banking Risk Management
57 pages
Python Exploratory Data Analysis
No ratings yet
Python Exploratory Data Analysis
24 pages
Advanced - Linear Regression
No ratings yet
Advanced - Linear Regression
57 pages
Simple Linear Regression Homework Solutions
100% (1)
Simple Linear Regression Homework Solutions
6 pages
Everything You Need To Know About Linear Regression
No ratings yet
Everything You Need To Know About Linear Regression
19 pages
Ridge and Lasso Regression in Python
No ratings yet
Ridge and Lasso Regression in Python
18 pages
Final Answer Bank
No ratings yet
Final Answer Bank
10 pages
5_AML Lecture 5_Linear regression
No ratings yet
5_AML Lecture 5_Linear regression
56 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
Assignment 2
No ratings yet
Assignment 2
5 pages
ML Lecture - 3
No ratings yet
ML Lecture - 3
47 pages
ML Unit-2
No ratings yet
ML Unit-2
24 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
Hands On Bayesian Statistics With Python
No ratings yet
Hands On Bayesian Statistics With Python
12 pages
UNIt-3 TY
No ratings yet
UNIt-3 TY
67 pages
Aakash S Project Report
No ratings yet
Aakash S Project Report
12 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
10 pages
UNIT I Notes
No ratings yet
UNIT I Notes
23 pages
UNIT I Notes-1
No ratings yet
UNIT I Notes-1
18 pages
Predictive Modeling-Handouts
No ratings yet
Predictive Modeling-Handouts
11 pages
Linear Regression
No ratings yet
Linear Regression
18 pages
Lab 11,12 - Copy
No ratings yet
Lab 11,12 - Copy
7 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Introduction To Polynomial Regression
No ratings yet
Introduction To Polynomial Regression
5 pages
Regression Analysis Using R
No ratings yet
Regression Analysis Using R
17 pages
Chapter 6: How To Do Forecasting by Regression Analysis
No ratings yet
Chapter 6: How To Do Forecasting by Regression Analysis
7 pages
3-Polynomial Regression Using Python
No ratings yet
3-Polynomial Regression Using Python
14 pages
Module 4 - Supervised Learning - First ML Model
No ratings yet
Module 4 - Supervised Learning - First ML Model
23 pages
How to Perform Simple Linear Regression in Python
No ratings yet
How to Perform Simple Linear Regression in Python
8 pages
ML Regression Documentation
No ratings yet
ML Regression Documentation
7 pages
LINEAR REGRESSION IN R
No ratings yet
LINEAR REGRESSION IN R
6 pages
ML Using Python Unit3 pdf
No ratings yet
ML Using Python Unit3 pdf
8 pages
Sayan Pal Business Report Advance Statistics Assignment PDF
No ratings yet
Sayan Pal Business Report Advance Statistics Assignment PDF
13 pages
DS203-2025-01-29-and-31-MLR_f56746870c67f19907c526f8f9355656
No ratings yet
DS203-2025-01-29-and-31-MLR_f56746870c67f19907c526f8f9355656
27 pages
Polynomial Regression
No ratings yet
Polynomial Regression
6 pages
Polynomial Regression From Scratch in Python - by Rashida Nasrin Sucky - Towards Data Science
No ratings yet
Polynomial Regression From Scratch in Python - by Rashida Nasrin Sucky - Towards Data Science
1 page
Regression Dataset Example
No ratings yet
Regression Dataset Example
14 pages
Nonlinear Model
No ratings yet
Nonlinear Model
3 pages
04 - Notebook4 - Additional Information
No ratings yet
04 - Notebook4 - Additional Information
5 pages
Unit 2 Supervised Learning and Applications
No ratings yet
Unit 2 Supervised Learning and Applications
13 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Process Control and Instrumentation
50% (2)
Process Control and Instrumentation
22 pages
Get Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking 4th Edition Harvey Motulsky free all chapters
100% (1)
Get Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking 4th Edition Harvey Motulsky free all chapters
65 pages
2017-Umukoro & Adejuwon
No ratings yet
2017-Umukoro & Adejuwon
8 pages
Does The Female-Headed Household Suffer More Than The Male-Headed From Covid-19 Impact On Food Security
No ratings yet
Does The Female-Headed Household Suffer More Than The Male-Headed From Covid-19 Impact On Food Security
9 pages
1 PB
No ratings yet
1 PB
13 pages
Effects of Rainfall and Runoff-Yield Conditions On
No ratings yet
Effects of Rainfall and Runoff-Yield Conditions On
6 pages
Distributional Regression Modeling Via Generalized Additive Models For Location Scale and Shape
No ratings yet
Distributional Regression Modeling Via Generalized Additive Models For Location Scale and Shape
22 pages
Week 2 Watermark
No ratings yet
Week 2 Watermark
84 pages
Meta-Analyses of Experimental Data in Animal Nutrition : D. Sauvant, P. Schmidely, J. J. Daudin and N. R. St-Pierre
No ratings yet
Meta-Analyses of Experimental Data in Animal Nutrition : D. Sauvant, P. Schmidely, J. J. Daudin and N. R. St-Pierre
12 pages
Internal Audit Quality and Its Impact On Public Sector Organizational Performance: Evidence From Sector Bureaus of Southern Ethiopia
No ratings yet
Internal Audit Quality and Its Impact On Public Sector Organizational Performance: Evidence From Sector Bureaus of Southern Ethiopia
14 pages
Multivariate Analysis With LISREL
100% (1)
Multivariate Analysis With LISREL
561 pages
Business Analytics: For Future Managers
No ratings yet
Business Analytics: For Future Managers
3 pages
Minitab Multiple Regression Analysis
100% (1)
Minitab Multiple Regression Analysis
6 pages
Coefficient of Determination - How To Calculate It and Interpret The Result
No ratings yet
Coefficient of Determination - How To Calculate It and Interpret The Result
1 page
Linear Regression Quiz
No ratings yet
Linear Regression Quiz
6 pages
Bayesian Multiple Index Models For Environmental M
No ratings yet
Bayesian Multiple Index Models For Environmental M
32 pages
Introductory Econometrics A Modern Approach 5th Edition Jeffrey M. Wooldridge - The ebook is available for instant download, no waiting required
100% (1)
Introductory Econometrics A Modern Approach 5th Edition Jeffrey M. Wooldridge - The ebook is available for instant download, no waiting required
57 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
54 pages
Mat 20679 Project Three
No ratings yet
Mat 20679 Project Three
11 pages
Gaussian Random Field Models For Spatial Data
No ratings yet
Gaussian Random Field Models For Spatial Data
47 pages
Report Logistic Regression
No ratings yet
Report Logistic Regression
17 pages
Assignment of Econometrics
No ratings yet
Assignment of Econometrics
10 pages
The Relationship Between Parental Mediation and Internet Addiction Among Adolescents, and The Association With Cyberbullying and Depression
No ratings yet
The Relationship Between Parental Mediation and Internet Addiction Among Adolescents, and The Association With Cyberbullying and Depression
12 pages
IPL - Prediction - Model - Training - Final - Ipynb - Colab
No ratings yet
IPL - Prediction - Model - Training - Final - Ipynb - Colab
8 pages
wear rate - L9 taguchi orthogonal array
No ratings yet
wear rate - L9 taguchi orthogonal array
9 pages
O180421 Summer Internship Report
No ratings yet
O180421 Summer Internship Report
33 pages
Response Dependent Variable, Predictors Explanatory Independent Variables
No ratings yet
Response Dependent Variable, Predictors Explanatory Independent Variables
9 pages
Machine Learning
No ratings yet
Machine Learning
22 pages
Inventory CH - 5
No ratings yet
Inventory CH - 5
39 pages
Download ebooks file Applied Multivariate Research Design and Interpretation 3rd Edition Lawrence S. Meyers all chapters
100% (14)
Download ebooks file Applied Multivariate Research Design and Interpretation 3rd Edition Lawrence S. Meyers all chapters
60 pages