0% found this document useful (0 votes)
71 views6 pages

Untitled Document

The document describes applying simple linear regression to predict housing prices using the Boston housing dataset. It first fits a full model (model1) including all features and finds it has high R-squared and significant F-statistic, indicating a good fit and relationship between predictors and response. It then fits a reduced model (model2) removing non-significant features, which has almost identical high R-squared and larger F-statistic, showing it is also a good fit. The conclusion states there is still room to improve the model through techniques like outlier detection and using more advanced algorithms like random forests.

Uploaded by

Aparna Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views6 pages

Untitled Document

The document describes applying simple linear regression to predict housing prices using the Boston housing dataset. It first fits a full model (model1) including all features and finds it has high R-squared and significant F-statistic, indicating a good fit and relationship between predictors and response. It then fits a reduced model (model2) removing non-significant features, which has almost identical high R-squared and larger F-statistic, showing it is also a good fit. The conclusion states there is still room to improve the model through techniques like outlier detection and using more advanced algorithms like random forests.

Uploaded by

Aparna Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Aparna Singh

19021141023

Simple Regression Model fitting

Description
Machine Learning (ML) is a field of study that provides the capability to a
Machine to understand data and to learn from the data. ML is not only about
analytics modeling but it is end-to-end modeling that broadly involves following steps:
– Defining problem statement
– Data collection.
– Exploring, Cleaning and transforming data.
– Making the analytics model.
– Dashboard creation & deployment of the model.
Machine learning has two distinct field of study – supervised learning and unsupervised
learning. Supervised learning technique generates a response based on the set of input
features. Unsupervised learning does not have any response variable and it explores the
association and interaction between input features. In the following topic, we will discuss
linear regression that is an example of supervised learning technique.
Supervised Learning & Regression
Linear Regression is a supervised modelling technique for continuous data. The model fits a
line that is closest to all observation in the dataset. The basic assumption here is that
functional form is the line and it is possible to fit the line that will be closest to all observation
in the dataset. Please note that if the basic assumption about the linearity of the model is
away from reality then there is bound to have an error (bias towards linearity) in the model
however best one will try to fit the model.
Let’s analyse the basic equation for any supervised learning algorithm
Y = F(x) + ε
The above equation has three terms:
Y – define the response variable.
F(X) – defines the function that is dependent on set of input features.
ε – defines the random error. For ideal model, this should be random and should not be
dependent on any input.
In linear regression, we assume that functional form, F(X) is linear and hence we can write
the equation as below. Next step will be to find the coefficients (β0, β 1..) for below model.
Y = β0 + β1 X + ε ( for simple regression )
Y = β0 + β1 X1 + β2 X2+ β3 X3 + …. + βp Xp + ε ( for multiple regression )
How to apply linear regression
The coefficient for linear regression is calculated based on the sample data. The basic
assumption here is that the sample is not biased. This assumption makes sure that the
sample does not necessarily always overestimate or underestimate the coefficients. The
idea is that a particular sample may overestimate or underestimate but if one takes multiple
samples and try to estimate the coefficient multiple times, then the average of co-efficient
from multiple samples will be spot on.
Extract the data and create the training and testing sample
For the current model, let’s take the Boston dataset that is part of the MASS library in R
Studio. Following are the features available in Boston dataset. The problem statement is to
predict ‘medv’ based on the set of input features.

Syntax-

library(ggplot2)
Toyota
names(Toyota)
str(Toyota)
set.seed(023)
#remove variables
Toyo=Toyota[,-4]
Toyo
Toyo=Toyota[,-6]
Toyo
Toyo=Toyota[,-7]
Toyo
#split sample data to make models
row.number=sample(1:nrow(Toyota),0.9*nrow(Toyota))
train=Toyota[row.number,]
test=Toyota[-row.number,]
dim(train)
dim(test)
ggplot(train, aes(Price))+ geom_density(fill="green")
ggplot(train, aes(log(Price)))+ geom_density(fill="green")
ggplot(train, aes(sqrt(Price)))+ geom_density(fill="green")
#model fit
model1=lm(log(Price)~.,data=train)
summary(model1)
par(mfrow=c(2,2))
plot(model1)
#model building for model2
#remove less significant
model2=update(model1,~.-CC-Doors)
summary(model2)
plot(model2)
test
pred=predict(model2,newdata=test)
pred
test$predicted_values=exp(pred)
test
OUTPUT
For model1

Call:
lm(formula = log(Price) ~ ., data = train)

Residuals:
Min 1Q Median 3Q Max
-0.78672 -0.06608 0.00443 0.07479 0.46791

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.623e+00 1.205e-01 71.577 < 2e-16 ***
Age -1.064e-02 2.437e-04 -43.667 < 2e-16 ***
KM -1.724e-06 1.223e-07 -14.094 < 2e-16 ***
FuelTypeDiesel 1.108e-01 4.878e-02 2.272 0.02327 *
FuelTypePetrol 7.334e-02 3.171e-02 2.313 0.02090 *
HP 2.955e-03 5.358e-04 5.515 4.22e-08 ***
MetColor 3.097e-03 7.099e-03 0.436 0.66273
Automatic 4.400e-02 1.477e-02 2.979 0.00294 **
CC -7.363e-05 5.123e-05 -1.437 0.15092
Doors 6.961e-03 3.777e-03 1.843 0.06556 .
Weight 9.604e-04 1.102e-04 8.715 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1181 on 1281 degrees of freedom


Multiple R-squared: 0.8483, Adjusted R-squared: 0.8471
F-statistic: 716.4 on 10 and 1281 DF, p-value: < 2.2e-16

FOR MODEL2-

Call:
lm(formula = log(Price) ~ Age + KM + FuelType + HP + MetColor +
Automatic + Weight, data = train)

Residuals:
Min 1Q Median 3Q Max
-0.77026 -0.06684 0.00552 0.07342 0.45475

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.526e+00 1.110e-01 76.845 < 2e-16 ***
Age -1.067e-02 2.430e-04 -43.929 < 2e-16 ***
KM -1.722e-06 1.214e-07 -14.187 < 2e-16 ***
FuelTypeDiesel 5.699e-02 3.519e-02 1.620 0.10552
FuelTypePetrol 7.765e-02 3.168e-02 2.451 0.01439 *
HP 2.288e-03 3.162e-04 7.237 7.87e-13 ***
MetColor 3.378e-03 7.084e-03 0.477 0.63353
Automatic 4.006e-02 1.467e-02 2.730 0.00641 **
Weight 1.035e-03 1.036e-04 9.994 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1182 on 1283 degrees of freedom


Multiple R-squared: 0.8478, Adjusted R-squared: 0.8468
F-statistic: 893 on 8 and 1283 DF, p-value: < 2.2e-16

> plot(model2)
>
Observation from summary (model1)
Is there a relationship between predictor and response variable?
F=716.4 is far greater than 1 and this value is more than the F value of the previous model.
It can be concluded that there is a relationship between predictor and response variable
Is this model fit?
R2 =0.8471 is closer to 1 and so this model is a good fit.
. Observation from summary (model2)
Is there a relationship between predictor and response variable?
F=893 is far greater than 1 and this value is more than the F value of the previous model. It
can be concluded that there is a relationship between predictor and response variable
Is this model fit?
R2 =0.8468 is closer to 1 and so this model is a good fit.

Conclusion
The example shows how to approach linear regression modelling. The model that is created
still has scope for improvement as we can apply techniques like Outlier detection,
Correlation detection to further improve the accuracy of more accurate prediction. One can
as well use an advanced technique like Random Forest and Boosting technique to check
whether the accuracy can be further improved for the model. A piece of warning is that we
should refrain from overfitting the model for training data as the test accuracy of the model
will reduce for test data in case of overfitting.

You might also like