Simple Regression Model Fitting

This document discusses simple linear regression modeling. It begins by introducing linear regression as a supervised learning technique for modeling continuous data. It then walks through the steps of applying linear regression to the Boston housing dataset using R, including data splitting, model building, and assessing model fit and variable significance. Multiple regression models are created and evaluated, and some variables are removed to improve model fit. Non-linearity is also explored by adding squared terms to some variables.

Uploaded by

Aparna Singh

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Simple Regression Model Fitting

Uploaded by

Aparna Singh

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Simple Regression Model fitting

https://datascienceplus.com/how-to-apply-linear-regression-in-r/
Machine Learning (ML) is a field of study that provides the capability to a Machine to
understand data and to learn from the data. ML is not only about analytics modeling but it
is end-to-end modeling that broadly involves following steps:
– Defining problem statement
– Data collection.
– Exploring, Cleaning and transforming data.
– Making the analytics model.
– Dashboard creation & deployment of the model.
Machine learning has two distinct field of study – supervised learning and unsupervised learning.
Supervised learning technique generates a response based on the set of input features. Unsupervised
learning does not have any response variable and it explores the association and interaction between
input features. In the following topic, we will discuss linear regression that is an example of
supervised learning technique.
Supervised Learning & Regression
Linear Regression is a supervised modelling technique for continuous data. The model fits a line that
is closest to all observation in the dataset. The basic assumption here is that functional form is the
line and it is possible to fit the line that will be closest to all observation in the dataset. Please note
that if the basic assumption about the linearity of the model is away from reality then there is bound
to have an error (bias towards linearity) in the model however best one will try to fit the model.
Let’s analyse the basic equation for any supervised learning algorithm
Y = F(x) + ε
The above equation has three terms:
Y – define the response variable.
F(X) – defines the function that is dependent on set of input features.
ε – defines the random error. For ideal model, this should be random and should not be dependent
on any input.
In linear regression, we assume that functional form, F(X) is linear and hence we can write the
equation as below. Next step will be to find the coefficients (β0, β1..) for below model.
Y = β0 + β1 X + ε ( for simple regression )
Y = β0 + β1 X1 + β2 X2+ β3 X3 + …. + βp Xp + ε ( for multiple regression )
How to apply linear regression
The coefficient for linear regression is calculated based on the sample data. The basic assumption
here is that the sample is not biased. This assumption makes sure that the sample does not
necessarily always overestimate or underestimate the coefficients. The idea is that a particular
sample may overestimate or underestimate but if one takes multiple samples and try to estimate
the coefficient multiple times, then the average of co-efficient from multiple samples will be spot
on.
Extract the data and create the training and testing sample
For the current model, let’s take the Boston dataset that is part of the MASS library in R Studio.
Following are the features available in Boston dataset. The problem statement is to predict ‘medv’
based on the set of input features.

library(MASS)
library(ggplot2)
Boston # calling data set
names(Boston) # Names of the variables in a data set
str(Boston) # Structure of the variables of a data set
set.seed(1) # It is syntax used for getting same output
Split the sample data and make the model
Split the input data into training and evaluation set and make the model for the training dataset. It
can be seen that training dataset has 404 observations and testing dataset has 102 observations
based on 80-20 split.
row.number=sample(1:nrow(Boston),0.8*nrow(Boston))# Splitting data
train=Boston[row.number,]
test=Boston[-row.number,]
dim(train)
dim(test)
Explore the response variable
Let’s check for the distribution of response variable ‘medv’. The following figure shows the three
distributions of ‘medv’ original, log transformation and square root transformation. We can see that
both ‘log’ and ‘sqrt’ does a decent job to transform ‘medv’ distribution closer to normal. In the
following model, we have selected ‘log’ transformation but it is also possible to try out ‘sqrt’
transformation.

ggplot(train, aes(log(medv))) + geom_density(fill="blue")

ggplot(train, aes(sqrt(medv))) + geom_density(fill="blue")

Model Building – Model 1
Now as a first step we will fit the multiple regression models. We will start by taking all input
variables in the multiple regression.
#Let’s make default model.
model1 = lm(log(medv)~., data=train)
summary(model1)
par(mfrow=c(2,2))
plot(model1)

Observation from summary (model1)
Is there a relationship between predictor and response variables?
We can answer this using F stats. This defines the collective effect of all predictor variables on the
response variable. In this model, F=102.3 is far greater than 1, and so it can be concluded that there
is a relationship between predictor and response variable.
Which of the predictor variables are significant?
Based on the ‘p-value’ we can conclude on this. The lesser the ‘p’ value the more significant is the
variable. From the ‘summary’ dump we can see that ‘zn’, ‘age’ and ‘indus’ are less significant
features as the ‘p’ value is large for them. In next model, we can remove these variables from the
model.
Is this model fit?
We can answer this based on R2 (multiple-R-squared) value as it indicates how much variation is
captured by the model. R2 closer to 1 indicates that the model explains the large value of the
variance of the model and hence a good fit. In this case, the value is 0.7733 (closer to 1) and hence
the model is a good fit.
Observation from the plot
Fitted vs Residual graph

Residuals plots should be random in nature and there should not be any pattern in the graph. The
average of the residual plot should be close to zero. From the above plot, we can see that the red
trend line is almost at zero except at the starting location.

Normal Q-Q Plot

Q-Q plot shows whether the residuals are normally distributed. Ideally, the plot should be on the
dotted line. If the Q-Q plot is not on the line then models need to be reworked to make the residual
normal. In the above plot, we see that most of the plots are on the line except at towards the end.

Scale-Location

This shows how the residuals are spread and whether the residuals have an equal variance or not.

Residuals vs Leverage

The plot helps to find influential observations. Here we need to check for points that are outside the
dashed line. A point outside the dashed line will be influential point and removal of that will affect
the regression coefficients.
Model Building – Model 2
As the next step, we can remove the four lesser significant features (‘zn’, age’ and ‘indus’ ) and
check the model again.
# remove the less significant feature
model2 = update(model1, ~.-zn-indus-age)
summary(model2)

Observation from summary (model1)
Is there a relationship between predictor and response variable?
F=131.2 is far greater than 1 and this value is more than the F value of the previous model. It can be
concluded that there is a relationship between predictor and response variable.
Which of the variable are significant?
Now in this model, all the predictors are significant.
Is this model fit?
R2 =0.7696 is closer to 1 and so this model is a good fit. Please note that this value has decreased a
little from the first model but this should be fine as removing three predictors caused a drop from
0.7733 to 0.7696 and this is a small drop. In other words, the contribution of three predictors
towards explaining the variance is an only small value (0.0037) and hence it is better to drop the
predictor.
Observation of the plot
All the four plots look similar to the previous model and we don’t see any major effect.
Check for predictor vs Residual Plot
In the next step, we will check the residual graph for all significant features from Model 2. We need
to check if we see any pattern in the residual plot. Ideally, the residual plot should be random plot
and we should not see a pattern. In the following plots, we can see some non-linear pattern for
features like ‘crim’, ‘rm’, ‘nox’ etc.
attach(train)
require(gridExtra)
plot1 = ggplot(train, aes(crim, residuals(model2))) + geom_point() + geom_smooth()
plot2=ggplot(train, aes(chas, residuals(model2))) + geom_point() + geom_smooth()
plot3=ggplot(train, aes(nox, residuals(model2))) + geom_point() + geom_smooth()
plot4=ggplot(train, aes(rm, residuals(model2))) + geom_point() + geom_smooth()
plot5=ggplot(train, aes(dis, residuals(model2))) + geom_point() + geom_smooth()
plot6=ggplot(train, aes(rad, residuals(model2))) + geom_point() + geom_smooth()
plot7=ggplot(train, aes(tax, residuals(model2))) + geom_point() + geom_smooth()
plot8=ggplot(train, aes(ptratio, residuals(model2))) + geom_point() + geom_smooth()
plot9=ggplot(train, aes(black, residuals(model2))) + geom_point() + geom_smooth()
plot10=ggplot(train, aes(lstat, residuals(model2))) + geom_point() + geom_smooth()
grid.arrange(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10,ncol=5,nrow=2)

Model Building – Model 3 & Model 4
We can now enhance the model by adding a square term to check for non-linearity. We can first try
model3 by introducing square terms for all features ( from model 2). And in the next iteration, we
can remove the insignificant feature from the model.
#Model Building – Model 3 & Model 4
#Lets make default model and add square term in the model.
model3 = lm(log(medv)~crim+chas+nox+rm+dis+rad+tax+ptratio+
              black+lstat+ I(crim^2)+ I(chas^2)+I(nox^2)+ I(rm^2)+ I(dis^2)+
              I(rad^2)+ I(tax^2)+ I(ptratio^2)+ I(black^2)+ I(lstat^2), data=train)
summary(model3)
Output:
(Intercept)   8.273e+00 8.749e-01   9.456 < 2e-16 ***
crim         -3.291e-02 4.505e-03 -7.306 1.61e-12 ***
chas          1.124e-01 3.223e-02   3.487 0.000546 ***
nox          -6.286e-01 1.074e+00 -0.585 0.558693
rm           -8.026e-01 1.324e-01 -6.063 3.20e-09 ***
dis          -1.202e-01 2.452e-02 -4.900 1.41e-06 ***
rad           1.628e-02 9.436e-03   1.726 0.085217 .
tax          -3.393e-04 5.300e-04 -0.640 0.522477
ptratio      -1.592e-01 7.163e-02 -2.222 0.026843 *
black         1.314e-03 5.115e-04   2.568 0.010594 *
lstat        -5.419e-02 5.487e-03 -9.876 < 2e-16 ***
I(crim^2)     2.961e-04 6.690e-05   4.426 1.25e-05 ***
I(chas^2)            NA         NA      NA       NA
I(nox^2)     -2.450e-01 8.002e-01 -0.306 0.759664
I(rm^2)       6.752e-02 1.036e-02   6.520 2.22e-10 ***
I(dis^2)      6.899e-03 1.936e-03   3.564 0.000411 ***
I(rad^2)      2.739e-04 3.730e-04   0.734 0.463258
I(tax^2)     -4.613e-07 6.474e-07 -0.712 0.476601
I(ptratio^2) 3.751e-03 2.040e-03   1.839 0.066742 .
I(black^2)   -2.355e-06 1.129e-06 -2.085 0.037695 *
I(lstat^2)    7.380e-04 1.520e-04   4.854 1.77e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1627 on 384 degrees of freedom
Multiple R-squared: 0.8526,Adjusted R-squared: 0.8453
F-statistic: 116.9 on 19 and 384 DF, p-value: < 2.2e-16
##Removing the insignificant variables.
model4=update(model3, ~.-nox-rad-tax-I(crim^2)-I(chas^2)-I(rad^2)-
                I(tax^2)-I(ptratio^2)-I(black^2))
summary(model4)
par(mfrow=c(2,2))

plot(model4)

pred1 <- predict(model4, newdata = test)
rmse <- sqrt(sum((exp(pred1) - test$medv)^2)/length(test$medv))
c(RMSE = rmse, R2=summary(model4)$r.squared)
Output
RMSE        R2
4.8235100 0.8265999

par(mfrow=c(1,1))
plot(test$medv, exp(pred1))

Conclusion
The example shows how to approach linear regression modelling. The model that is created still has
scope for improvement as we can apply techniques like Outlier detection, Correlation detection to
further improve the accuracy of more accurate prediction. One can as well use an advanced technique
like Random Forest and Boosting technique to check whether the accuracy can be further improved
for the model. A piece of warning is that we should refrain from overfitting the model for training
data as the test accuracy of the model will reduce for test data in case of overfitting.

Vungle A B Test
No ratings yet
Vungle A B Test
1 page
The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
Untitled Document
No ratings yet
Untitled Document
6 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
Linear Regression
No ratings yet
Linear Regression
46 pages
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
No ratings yet
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
9 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
9 pages
Ridge and Lasso Regression in Python
No ratings yet
Ridge and Lasso Regression in Python
18 pages
Sales and Advertising
No ratings yet
Sales and Advertising
14 pages
Metamodeling Scilab
No ratings yet
Metamodeling Scilab
13 pages
In Sem 2 Study Material
No ratings yet
In Sem 2 Study Material
19 pages
How to Perform Simple Linear Regression in Python
No ratings yet
How to Perform Simple Linear Regression in Python
8 pages
Predictive Modeling-Handouts
No ratings yet
Predictive Modeling-Handouts
11 pages
Linear Regression
No ratings yet
Linear Regression
4 pages
Linear Regression - Numpy and Sklearn
No ratings yet
Linear Regression - Numpy and Sklearn
7 pages
ML Fundamentals
No ratings yet
ML Fundamentals
38 pages
Chapter_2_Linear and Logistic Regression
No ratings yet
Chapter_2_Linear and Logistic Regression
34 pages
Week 6 - Model Assumptions in Linear Regression
No ratings yet
Week 6 - Model Assumptions in Linear Regression
17 pages
Intro To Forecasting
No ratings yet
Intro To Forecasting
15 pages
04 - Notebook4 - Additional Information
No ratings yet
04 - Notebook4 - Additional Information
5 pages
Regression
No ratings yet
Regression
45 pages
Machine Learning QB
No ratings yet
Machine Learning QB
32 pages
Whole ML PDF 1614408656
100% (1)
Whole ML PDF 1614408656
214 pages
CPSC 540 Assignment 1 (Due January 19)
No ratings yet
CPSC 540 Assignment 1 (Due January 19)
9 pages
Ge 501 Notes
No ratings yet
Ge 501 Notes
34 pages
Final Answer Bank
No ratings yet
Final Answer Bank
10 pages
Linear regression
No ratings yet
Linear regression
1 page
Introduction To Linear Regression
No ratings yet
Introduction To Linear Regression
5 pages
Chapter 6: How To Do Forecasting by Regression Analysis
No ratings yet
Chapter 6: How To Do Forecasting by Regression Analysis
7 pages
DSBDAL_Assignment no 4
No ratings yet
DSBDAL_Assignment no 4
15 pages
Lecture 3
No ratings yet
Lecture 3
42 pages
AI lab7
No ratings yet
AI lab7
13 pages
Linear Regression: An Approach For Forecasting
No ratings yet
Linear Regression: An Approach For Forecasting
12 pages
Everything You Need To Know About Linear Regression
No ratings yet
Everything You Need To Know About Linear Regression
19 pages
Mindanao State University General Santos City: Simple Linear Regression
No ratings yet
Mindanao State University General Santos City: Simple Linear Regression
12 pages
Piecewise Linear Regression Examples (Lesson 1) Truncated
No ratings yet
Piecewise Linear Regression Examples (Lesson 1) Truncated
4 pages
Python Unit 4
No ratings yet
Python Unit 4
43 pages
Model Selection
No ratings yet
Model Selection
11 pages
ML-UNIT-3
No ratings yet
ML-UNIT-3
46 pages
Isn't Linear Regression From Statistics?
No ratings yet
Isn't Linear Regression From Statistics?
4 pages
Lesson Week 13
No ratings yet
Lesson Week 13
6 pages
Bias Variance Ridge Regression
No ratings yet
Bias Variance Ridge Regression
4 pages
ML-UNIT-3-1
No ratings yet
ML-UNIT-3-1
57 pages
unit5_R
No ratings yet
unit5_R
5 pages
9 Types of Regression Analysis
No ratings yet
9 Types of Regression Analysis
16 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
Predict and Co
No ratings yet
Predict and Co
6 pages
Exercise 7 Submission Group 12
No ratings yet
Exercise 7 Submission Group 12
22 pages
An Introduction To Gradient Descent and Linear Regression
No ratings yet
An Introduction To Gradient Descent and Linear Regression
8 pages
Advanced Regression Assignment
No ratings yet
Advanced Regression Assignment
5 pages
Simple Linear Regression in Machine Learning
No ratings yet
Simple Linear Regression in Machine Learning
7 pages
Mars PDF
No ratings yet
Mars PDF
15 pages
Linear Regression and Classification
No ratings yet
Linear Regression and Classification
8 pages
Karthik Nambiar 60009220193
No ratings yet
Karthik Nambiar 60009220193
9 pages
Dependent Independent Variable (S) : Regression: What Is Regression
No ratings yet
Dependent Independent Variable (S) : Regression: What Is Regression
15 pages
Lab 9
No ratings yet
Lab 9
2 pages
SC&RP - Unit 5
No ratings yet
SC&RP - Unit 5
36 pages
Semc 3 Q
No ratings yet
Semc 3 Q
9 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Trifocal Tensor: Exploring Depth, Motion, and Structure in Computer Vision
From Everand
Trifocal Tensor: Exploring Depth, Motion, and Structure in Computer Vision
Fouad Sabry
No ratings yet
Aparna Singh: Work Experience
No ratings yet
Aparna Singh: Work Experience
1 page
CASE-Indian Staffing Industry (SWOT Analysis) : Submitted By: - Aparna Singh - 19021141023 M.B.A. (2019-21)
No ratings yet
CASE-Indian Staffing Industry (SWOT Analysis) : Submitted By: - Aparna Singh - 19021141023 M.B.A. (2019-21)
6 pages
Class Exercise For Chapter 3
No ratings yet
Class Exercise For Chapter 3
2 pages
Solution of Sarvodaya Samiti Case Study
No ratings yet
Solution of Sarvodaya Samiti Case Study
29 pages
Speculation and Postponement New
No ratings yet
Speculation and Postponement New
7 pages
IMC Marketing Cottle Taylor Case
No ratings yet
IMC Marketing Cottle Taylor Case
22 pages
18 Customer Relationship Marketing in The Airline Industry: Reinhold Rapp
No ratings yet
18 Customer Relationship Marketing in The Airline Industry: Reinhold Rapp
2 pages
Material DA 7
No ratings yet
Material DA 7
3 pages
Business Analytics Introduction
No ratings yet
Business Analytics Introduction
8 pages
BA GROUP ASSIGNMENT 3 (FOR Histogram On Mtcars and Iris)
No ratings yet
BA GROUP ASSIGNMENT 3 (FOR Histogram On Mtcars and Iris)
21 pages
Assignment 2
No ratings yet
Assignment 2
17 pages
Instant Download Discrete Problems in Nature Inspired Algorithms First Edition Shukla PDF All Chapters
100% (3)
Instant Download Discrete Problems in Nature Inspired Algorithms First Edition Shukla PDF All Chapters
62 pages
Chapter2 Mutex BasicTopics
No ratings yet
Chapter2 Mutex BasicTopics
99 pages
Improve ARDL
100% (5)
Improve ARDL
7 pages
Fibonacci Sequences: (Natural Models)
No ratings yet
Fibonacci Sequences: (Natural Models)
32 pages
IA - DCA2102 - DBMS - Set 1 and 2 - Dec2023
No ratings yet
IA - DCA2102 - DBMS - Set 1 and 2 - Dec2023
1 page
Cambridge O Level: Additional Mathematics 4037/13 October/November 2022
No ratings yet
Cambridge O Level: Additional Mathematics 4037/13 October/November 2022
10 pages
Well Models For Production Optimization: Marte Arianson
No ratings yet
Well Models For Production Optimization: Marte Arianson
73 pages
Assignment CTCT 1
No ratings yet
Assignment CTCT 1
2 pages
Managerial Economics in A Global Economy: Bab 5 Peramalan Permintaan (Demand Forecasting
No ratings yet
Managerial Economics in A Global Economy: Bab 5 Peramalan Permintaan (Demand Forecasting
14 pages
KR20 and KR21 Handouts
No ratings yet
KR20 and KR21 Handouts
7 pages
Pilz Pss Programmable Safety Controller PDF
No ratings yet
Pilz Pss Programmable Safety Controller PDF
74 pages
Stack PDF
No ratings yet
Stack PDF
29 pages
PEQs 231 Algorithms
No ratings yet
PEQs 231 Algorithms
281 pages
Elementsmark of Numerical Analysis of Numerical Analysis Second Edition Radhey S
100% (1)
Elementsmark of Numerical Analysis of Numerical Analysis Second Edition Radhey S
16 pages
CV Lecture 8
No ratings yet
CV Lecture 8
26 pages
Almost All Primes Can Be Quickly Certified - S. Goldwasser, J. Kilian (1986)
No ratings yet
Almost All Primes Can Be Quickly Certified - S. Goldwasser, J. Kilian (1986)
14 pages
LQR
100% (1)
LQR
14 pages
Ch. 4 (B) Transportation Problems
No ratings yet
Ch. 4 (B) Transportation Problems
19 pages
Prerna and Sharma 2024
No ratings yet
Prerna and Sharma 2024
18 pages
Introduction To Computational Finance and Financial Econometrics
No ratings yet
Introduction To Computational Finance and Financial Econometrics
54 pages
Ensemble Methods
No ratings yet
Ensemble Methods
12 pages
50 Most Important CNN Interview Questions
No ratings yet
50 Most Important CNN Interview Questions
18 pages
Implementation of V Um at
No ratings yet
Implementation of V Um at
24 pages
BE368 Lecture 4
No ratings yet
BE368 Lecture 4
28 pages
Lab 4-Image Segmentation Using U-Net
No ratings yet
Lab 4-Image Segmentation Using U-Net
9 pages
Subband Coding: Presented by DR.R Murugan NIT Silchar
No ratings yet
Subband Coding: Presented by DR.R Murugan NIT Silchar
11 pages
Buy ebook Analysis And Design Of Algorithms 2nd Edition Amrinder Arora cheap price
100% (2)
Buy ebook Analysis And Design Of Algorithms 2nd Edition Amrinder Arora cheap price
71 pages
2021
No ratings yet
2021
20 pages