Untitled Document

The document describes applying simple linear regression to predict housing prices using the Boston housing dataset. It first fits a full model (model1) including all features and finds it has high R-squared and significant F-statistic, indicating a good fit and relationship between predictors and response. It then fits a reduced model (model2) removing non-significant features, which has almost identical high R-squared and larger F-statistic, showing it is also a good fit. The conclusion states there is still room to improve the model through techniques like outlier detection and using more advanced algorithms like random forests.

Uploaded by

Aparna Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views6 pages

Untitled Document

Uploaded by

Aparna Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Aparna Singh

19021141023

Simple Regression Model fitting

Description
Machine Learning (ML) is a field of study that provides the capability to a
Machine to understand data and to learn from the data. ML is not only about
analytics modeling but it is end-to-end modeling that broadly involves following steps:
– Defining problem statement
– Data collection.
– Exploring, Cleaning and transforming data.
– Making the analytics model.
– Dashboard creation & deployment of the model.
Machine learning has two distinct field of study – supervised learning and unsupervised
learning. Supervised learning technique generates a response based on the set of input
features. Unsupervised learning does not have any response variable and it explores the
association and interaction between input features. In the following topic, we will discuss
linear regression that is an example of supervised learning technique.
Supervised Learning & Regression
Linear Regression is a supervised modelling technique for continuous data. The model fits a
line that is closest to all observation in the dataset. The basic assumption here is that
functional form is the line and it is possible to fit the line that will be closest to all observation
in the dataset. Please note that if the basic assumption about the linearity of the model is
away from reality then there is bound to have an error (bias towards linearity) in the model
however best one will try to fit the model.
Let’s analyse the basic equation for any supervised learning algorithm
Y = F(x) + ε
The above equation has three terms:
Y – define the response variable.
F(X) – defines the function that is dependent on set of input features.
ε – defines the random error. For ideal model, this should be random and should not be
dependent on any input.
In linear regression, we assume that functional form, F(X) is linear and hence we can write
the equation as below. Next step will be to find the coefficients (β0, β 1..) for below model.
Y = β0 + β1 X + ε ( for simple regression )
Y = β0 + β1 X1 + β2 X2+ β3 X3 + …. + βp Xp + ε ( for multiple regression )
How to apply linear regression
The coefficient for linear regression is calculated based on the sample data. The basic
assumption here is that the sample is not biased. This assumption makes sure that the
sample does not necessarily always overestimate or underestimate the coefficients. The
idea is that a particular sample may overestimate or underestimate but if one takes multiple
samples and try to estimate the coefficient multiple times, then the average of co-efficient
from multiple samples will be spot on.
Extract the data and create the training and testing sample
For the current model, let’s take the Boston dataset that is part of the MASS library in R
Studio. Following are the features available in Boston dataset. The problem statement is to
predict ‘medv’ based on the set of input features.

Syntax-

library(ggplot2)
Toyota
names(Toyota)
str(Toyota)
set.seed(023)
#remove variables
Toyo=Toyota[,-4]
Toyo
Toyo=Toyota[,-6]
Toyo
Toyo=Toyota[,-7]
Toyo
#split sample data to make models
row.number=sample(1:nrow(Toyota),0.9*nrow(Toyota))
train=Toyota[row.number,]
test=Toyota[-row.number,]
dim(train)
dim(test)
ggplot(train, aes(Price))+ geom_density(fill="green")
ggplot(train, aes(log(Price)))+ geom_density(fill="green")
ggplot(train, aes(sqrt(Price)))+ geom_density(fill="green")
#model fit
model1=lm(log(Price)~.,data=train)
summary(model1)
par(mfrow=c(2,2))
plot(model1)
#model building for model2
#remove less significant
model2=update(model1,~.-CC-Doors)
summary(model2)
plot(model2)
test
pred=predict(model2,newdata=test)
pred
test$predicted_values=exp(pred)
test
OUTPUT
For model1

Call:
lm(formula = log(Price) ~ ., data = train)

Residuals:
Min 1Q Median 3Q Max
-0.78672 -0.06608 0.00443 0.07479 0.46791

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.623e+00 1.205e-01 71.577 < 2e-16 ***
Age -1.064e-02 2.437e-04 -43.667 < 2e-16 ***
KM -1.724e-06 1.223e-07 -14.094 < 2e-16 ***
FuelTypeDiesel 1.108e-01 4.878e-02 2.272 0.02327 *
FuelTypePetrol 7.334e-02 3.171e-02 2.313 0.02090 *
HP 2.955e-03 5.358e-04 5.515 4.22e-08 ***
MetColor 3.097e-03 7.099e-03 0.436 0.66273
Automatic 4.400e-02 1.477e-02 2.979 0.00294 **
CC -7.363e-05 5.123e-05 -1.437 0.15092
Doors 6.961e-03 3.777e-03 1.843 0.06556 .
Weight 9.604e-04 1.102e-04 8.715 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1181 on 1281 degrees of freedom

Multiple R-squared: 0.8483, Adjusted R-squared: 0.8471
F-statistic: 716.4 on 10 and 1281 DF, p-value: < 2.2e-16

FOR MODEL2-

Call:
lm(formula = log(Price) ~ Age + KM + FuelType + HP + MetColor +
Automatic + Weight, data = train)

Residuals:
Min 1Q Median 3Q Max
-0.77026 -0.06684 0.00552 0.07342 0.45475

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.526e+00 1.110e-01 76.845 < 2e-16 ***
Age -1.067e-02 2.430e-04 -43.929 < 2e-16 ***
KM -1.722e-06 1.214e-07 -14.187 < 2e-16 ***
FuelTypeDiesel 5.699e-02 3.519e-02 1.620 0.10552
FuelTypePetrol 7.765e-02 3.168e-02 2.451 0.01439 *
HP 2.288e-03 3.162e-04 7.237 7.87e-13 ***
MetColor 3.378e-03 7.084e-03 0.477 0.63353
Automatic 4.006e-02 1.467e-02 2.730 0.00641 **
Weight 1.035e-03 1.036e-04 9.994 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1182 on 1283 degrees of freedom

Multiple R-squared: 0.8478, Adjusted R-squared: 0.8468
F-statistic: 893 on 8 and 1283 DF, p-value: < 2.2e-16

> plot(model2)
>
Observation from summary (model1)
Is there a relationship between predictor and response variable?
F=716.4 is far greater than 1 and this value is more than the F value of the previous model.
It can be concluded that there is a relationship between predictor and response variable
Is this model fit?
R2 =0.8471 is closer to 1 and so this model is a good fit.
. Observation from summary (model2)
Is there a relationship between predictor and response variable?
F=893 is far greater than 1 and this value is more than the F value of the previous model. It
can be concluded that there is a relationship between predictor and response variable
Is this model fit?
R2 =0.8468 is closer to 1 and so this model is a good fit.

Conclusion
The example shows how to approach linear regression modelling. The model that is created
still has scope for improvement as we can apply techniques like Outlier detection,
Correlation detection to further improve the accuracy of more accurate prediction. One can
as well use an advanced technique like Random Forest and Boosting technique to check
whether the accuracy can be further improved for the model. A piece of warning is that we
should refrain from overfitting the model for training data as the test accuracy of the model
will reduce for test data in case of overfitting.

Simple Linear Regression With Jupyter Notebook: Dr. Alvin Ang
No ratings yet
Simple Linear Regression With Jupyter Notebook: Dr. Alvin Ang
16 pages
Basic Statistics by Muhammad Saleem Akhtar PDF
68% (37)
Basic Statistics by Muhammad Saleem Akhtar PDF
324 pages
Research I: Quarter 3 - Module 2: Probability and Non-Probability Sampling
75% (8)
Research I: Quarter 3 - Module 2: Probability and Non-Probability Sampling
33 pages
Module 5b - Pre and Post Testing - PowerPoint
No ratings yet
Module 5b - Pre and Post Testing - PowerPoint
16 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
Business Analytics Unit - V Notes - 60637708 - 2025 - 05 - 15 - 02 - 16
No ratings yet
Business Analytics Unit - V Notes - 60637708 - 2025 - 05 - 15 - 02 - 16
37 pages
Linear Regression
No ratings yet
Linear Regression
16 pages
6 - Classification and Regression Tasks
No ratings yet
6 - Classification and Regression Tasks
115 pages
hw16 109090023
No ratings yet
hw16 109090023
22 pages
Unit 5
No ratings yet
Unit 5
18 pages
Module 2 Modified
No ratings yet
Module 2 Modified
67 pages
R Module 11 - Statistics
No ratings yet
R Module 11 - Statistics
35 pages
2 Modele Lineare
No ratings yet
2 Modele Lineare
43 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
Week 10 - Lecture 10
No ratings yet
Week 10 - Lecture 10
59 pages
DSEnd
No ratings yet
DSEnd
30 pages
Graded Homework 1 Solutions
No ratings yet
Graded Homework 1 Solutions
19 pages
Lecture3 Supervised Learning I
No ratings yet
Lecture3 Supervised Learning I
84 pages
Class 9 After
No ratings yet
Class 9 After
38 pages
Building Regression Models
No ratings yet
Building Regression Models
22 pages
Lecture-2 Unit 2
No ratings yet
Lecture-2 Unit 2
56 pages
Week 7 and 8
No ratings yet
Week 7 and 8
32 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
Unit 2
No ratings yet
Unit 2
28 pages
Simple Linear Regression Sample
No ratings yet
Simple Linear Regression Sample
55 pages
Lecture 4
No ratings yet
Lecture 4
3 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
The Impact of School Resources On Student Enrollment
No ratings yet
The Impact of School Resources On Student Enrollment
25 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
INSY446 - 02 - Linear Model Part 1
No ratings yet
INSY446 - 02 - Linear Model Part 1
27 pages
ISLP - Website 135 200
No ratings yet
ISLP - Website 135 200
66 pages
Machine Learning-Lecture 1 (Student)
No ratings yet
Machine Learning-Lecture 1 (Student)
14 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
ISLP - Website-135-200 (1) - 1-60
No ratings yet
ISLP - Website-135-200 (1) - 1-60
60 pages
Unit-4 DS Student
No ratings yet
Unit-4 DS Student
43 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
Session7 LinearRegression
No ratings yet
Session7 LinearRegression
52 pages
Model Evaluation
No ratings yet
Model Evaluation
80 pages
Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
No ratings yet
Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
8 pages
Normal-Distribution PDF
No ratings yet
Normal-Distribution PDF
16 pages
DS Exp6
No ratings yet
DS Exp6
5 pages
DAAI - Lecture - 04 - With - Solutions - 10oct22
No ratings yet
DAAI - Lecture - 04 - With - Solutions - 10oct22
84 pages
ML Unit
No ratings yet
ML Unit
23 pages
Statistics Theoretical
No ratings yet
Statistics Theoretical
17 pages
Ass-2 Ds
No ratings yet
Ass-2 Ds
29 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Cbsnews Post-Speech 20250304
100% (3)
Cbsnews Post-Speech 20250304
3 pages
Higgins 2002
No ratings yet
Higgins 2002
20 pages
Data Science Chapitre 2
No ratings yet
Data Science Chapitre 2
98 pages
Dissicusion 1
100% (1)
Dissicusion 1
3 pages
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
No ratings yet
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
10 pages
Machine Learning 2
No ratings yet
Machine Learning 2
45 pages
Correlation and Regression With Just Excel
No ratings yet
Correlation and Regression With Just Excel
4 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Fall 2023-2024 IE 451 Homework 3 Solutions
No ratings yet
Fall 2023-2024 IE 451 Homework 3 Solutions
15 pages
Introduction To Statistical Learning: With Applications in R
No ratings yet
Introduction To Statistical Learning: With Applications in R
13 pages
Article Module 4
No ratings yet
Article Module 4
8 pages
Untitled Document
No ratings yet
Untitled Document
3 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Lecture 3
No ratings yet
Lecture 3
90 pages
R Workshop PART 2
No ratings yet
R Workshop PART 2
36 pages
Npar Tests: Npar Tests /K-W Hasil by Perlakuan (1 2) /missing Analysis
No ratings yet
Npar Tests: Npar Tests /K-W Hasil by Perlakuan (1 2) /missing Analysis
59 pages
Solution of Sarvodaya Samiti Case Study
No ratings yet
Solution of Sarvodaya Samiti Case Study
29 pages
Experiment No 2
No ratings yet
Experiment No 2
2 pages
100 Most Difficult Data Analyst Interview Q&A
No ratings yet
100 Most Difficult Data Analyst Interview Q&A
26 pages
Module 5. T-Test One Sample Test
No ratings yet
Module 5. T-Test One Sample Test
5 pages
Quality Control of Concrete
No ratings yet
Quality Control of Concrete
13 pages
20BCE1205 Lab3
No ratings yet
20BCE1205 Lab3
9 pages
BA GROUP ASSIGNMENT 3 (FOR Histogram On Mtcars and Iris)
No ratings yet
BA GROUP ASSIGNMENT 3 (FOR Histogram On Mtcars and Iris)
21 pages
Exercises 2 Unfinished
No ratings yet
Exercises 2 Unfinished
8 pages
Business Analytics - Training Curriculum - SKOLAR
No ratings yet
Business Analytics - Training Curriculum - SKOLAR
16 pages
Palomaria - Module 3
No ratings yet
Palomaria - Module 3
9 pages
EViews Script520170510203001
No ratings yet
EViews Script520170510203001
17 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
14 Variable Sampling Plan
100% (3)
14 Variable Sampling Plan
59 pages
IMC Marketing Cottle Taylor Case
No ratings yet
IMC Marketing Cottle Taylor Case
22 pages
Basic Statistics: Formulas Sheet For
No ratings yet
Basic Statistics: Formulas Sheet For
7 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
11 pages
Aparna Singh: Work Experience
No ratings yet
Aparna Singh: Work Experience
1 page
Material DA 7
No ratings yet
Material DA 7
3 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Speculation and Postponement New
No ratings yet
Speculation and Postponement New
7 pages
Class Exercise For Chapter 3
No ratings yet
Class Exercise For Chapter 3
2 pages
Mindanao State University General Santos City: Simple Linear Regression
No ratings yet
Mindanao State University General Santos City: Simple Linear Regression
12 pages
Vungle A B Test
No ratings yet
Vungle A B Test
1 page
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
18 Customer Relationship Marketing in The Airline Industry: Reinhold Rapp
No ratings yet
18 Customer Relationship Marketing in The Airline Industry: Reinhold Rapp
2 pages
CASE-Indian Staffing Industry (SWOT Analysis) : Submitted By: - Aparna Singh - 19021141023 M.B.A. (2019-21)
No ratings yet
CASE-Indian Staffing Industry (SWOT Analysis) : Submitted By: - Aparna Singh - 19021141023 M.B.A. (2019-21)
6 pages
The University of Auckland: Second Semester, 2004 Campus: City
No ratings yet
The University of Auckland: Second Semester, 2004 Campus: City
23 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
3 Chapter 3. Methodology
100% (6)
3 Chapter 3. Methodology
41 pages
Assignment
No ratings yet
Assignment
9 pages
Seminar 1 Solutions
No ratings yet
Seminar 1 Solutions
9 pages
Business Analytics Introduction
No ratings yet
Business Analytics Introduction
8 pages
Anova Table
No ratings yet
Anova Table
5 pages
ISLR
No ratings yet
ISLR
9 pages
MJC/2011 JC2 Preliminary Exam Paper 2/9740
No ratings yet
MJC/2011 JC2 Preliminary Exam Paper 2/9740
4 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Structural Equation Modelling
No ratings yet
Structural Equation Modelling
36 pages
2007 AP Statistics Multiple Choice Exam
No ratings yet
2007 AP Statistics Multiple Choice Exam
17 pages
Regressi On
No ratings yet
Regressi On
16 pages

Untitled Document

Uploaded by

Untitled Document

Uploaded by

Aparna Singh

Simple Regression Model fitting

Residual standard error: 0.1181 on 1281 degrees of freedom

Residual standard error: 0.1182 on 1283 degrees of freedom

You might also like