0% found this document useful (0 votes)

5 views12 pages

Lec 37

This lecture focuses on the assessment of simple linear regression models, particularly the identification and handling of outliers. It explains how to use residual analysis and the 'identify' function in R to detect outliers, and demonstrates the impact of removing outliers on the model's R-squared value. The lecture concludes by summarizing the steps taken to build and refine a linear regression model, emphasizing the importance of model assessment and outlier treatment.

Uploaded by

meghanaalluri2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views12 pages

Lec 37

Uploaded by

meghanaalluri2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Data Science for Engineers

Department of Computer Science and Engineering

Indian Institute of Technology, Madras

Lecture-37
Simple Linear Regression Model Assessment

(Refer Slide Time: 00:21)

Welcome to the third lecture on implementation of simple linear regression using R. In

the last lecture, we looked at the first level of model assessment, we saw how good is a
linear model that we built and we also saw, how to identify the significant coefficients in
the linear model.
(Refer Slide Time: 00:35)

In this lecture, we are going to look at the second level of model assessment as a part of
this, we are going to see, if we can improvise the quality of the linear model and can we
identify bad measurements and by bad measurements, we mean outliers.

(Refer Slide Time: 00:52)

So, let us see, what outliers are? So, outliers are points, which do not conform to the bulk
of the data. Now, a point is considered an outlier, if the corresponding standardized,
residual falls outside, minus 2 and plus 2 at 5 percent significance level.
(Refer Slide Time: 01:10)

Now, let us see how to handle these outliers, even if we have several outliers, which lie
outside the confidence region, we are going to identify only one at a time, at every
iteration and after doing. So, we are going to apply a linear model on the reduced sample.

Now, we are going to iterate, till we detect no more outliers. Now, let us see how to
handle these outliers we are going to start with the residual analysis.

(Refer Slide Time: 01:36)

So, you can see a plot function on the left hand side. Now, for the residual plot, I am
going to plot fitted values on the x axis and the standardized residual from the model, we
have built. Now, we built a linear model called bonds mod and we are going to calculate
the standardized residuals for it.

So, that becomes my y. I am also giving a title like i earlier said the title for this plot is
residual plot and my x label is nothing, but predicted values for bid price. Similarly, my y
label is standardized residual. Now, after doing; so, we need to set the confidence region.
So, let us see how to do that, we again use the same command a b line. Now, a b line is
what we have used to fit the linear model onto the plot.

Now, with the same command, we can give the confidence region as well now, I am
going to set the height, which is equal to h here, as 2 and the line type as 2. So, height is
at which you want the line to be drawn and line type is nothing, but how you want the
line to be drawn. So, you can have dashed lines solid line ah, dashed and a dot. So, you
have several options in there similarly, I also need a lower confidence limit. So, I am
setting that to be equal to minus 2 and for the same limit I am setting the line type to be
equal to 2.

Now, let us see how the plot looks on the right hand side. I have the plot. So, we can see
that there are two lines drawn at plus and minus 2 that defines the confidence level. Now,
on the y axis, I have standardized residuals and on the x axis, I have predicted values for
bid price. Now, from the plot, we can see that there are two outliers, which are early
farther, there is one, which is close to the upper confidence limit. And there is one, which
is exactly almost close to the lower confidence limit.

So, let us see ah, how to identify these. So, from the plot, we may not be able to tell
which points are these? By points, I mean in the row IDs. We are going to use another
function called identify, that will help us identify the indices of these samples. Now, let
us see what identify function does?
(Refer Slide Time: 03:53)

So, it treats the position of the graphic pointer, when the mouse button is pressed it, then
searches the coordinates given an x and y for the point closest to the pointer. Now, if the
point is close enough to the pointer, it is index will be returned as a part of the value
code.

Now, let us look at the syntax for it. So, identify is a function and x and y are my input
parameters. So, what are my x and y, they are the coordinates of the points in the scatter
plot. Now, let us see how to use this function to identify the indices on my left.

(Refer Slide Time: 04:27)

I have the same commands for the residual plot. So, this is what we saw in the last slide .
I now, use the identify function to identify the indices. Now, my input for this is fitted
values of the bonds model and the standardized residuals the reason. I am giving fitted
values is, because on the plot, I want the indices to be found. So, the plot has fitted
values from the model and the standardized residuals from the model.

So, on this plot I want my indices to be identified. So, I give the same inputs that I have
used for the plot command, for identify function. So, again here, if you see I have fitted
values from the bonds model and I am plotting for the y parameter. I am plotting the
standardized residuals. Now, once you execute the command, you will not get the output
immediately. What will be displayed is the following snippet on the left, you will see a
finish button and you will see a message being displayed. Now, on this plot, we will need
to click and identify each of the points. Now, let us see how to do that.

(Refer Slide Time: 05:33)

So, I am displaying the command above to remind, you of the fact that we are using
fitted values and standardized residuals to identify. Now, click it near a point, adds it to
the list of the identified points. Now, if I am going to click near this point. It is going to
identify this point and store it. Now, all these points can be identified only once. Now, if
a point has already been identified and you still click near it, then you will get the
following message. It will be a warning, which reads as nearest point already identified.
Now, if you do not click near any of the points, then a message is displayed, which says
that no point is identified within 0.25 inches. So, if I click here, then I do not have any
points closest to it. So, it will display a message saying no point within 0.25 inches.

(Refer Slide Time: 06:29)

Now, once you have identified all the outliers, you need to click the finish button that is
present on the top right, corner of the graphical window, you can also press escape to
finish; now, after terminating the indices are displayed on the console and on the plot
now. So, you can see on the console, I have the indices being displayed as 4 13 34 35,
but this will give you only the value. So, now, to know where your outliers lie on the
plot, I am going to look at the plot.

So, now, I know the 13th point of the sample is the farthest, which is here followed by
the 35th sample, followed by the sample 4 and then I have one more sample, which is
here, which is the 34th sample. So, after identifying these outliers, we are going to start
by removing one, at a time and we are going to build a new model. Now, let us see how
to do that. Now, I will start by removing one point at a time.
(Refer Slide Time: 07:32)

The first point that I am going to remove is sample 13 that is the 13th point, because it is
the farthest in the plot. So, to start with, I am going to create a new data frame called
bonds new and it will have all rows of bonds except the 13th row. So, then I am going to
create another object called bonds mod one, which is the linear model that is being built
for the new data. So, I am going to regress bid price from the new data frame bonds new
with coupon rate from the same data set. Now, after building the new linear model,
which does not contain the 13th point, that is an outlier.

We are going to repeat the same process again that is on the residual plot, we are going to
identify the outliers for the new data; so, on my, right. I already have the residual plot
with the outliers being identified. So, from the snippet, we can see that for the new data. I
have my 4th point, 33rd point and 34th point being, are being identified as outliers. So,
now, this new data will contain only 34 data observations, because we have already
removed one observation. So, the indices for the new data will change.

So, the farthest point in this data is the 34th point and after that I have the 4th point and
followed by that, there is also one point on the line, which is the 33rd point, for this new
data. Now, we can see that, if you compare this plot and the earlier plot this point, which
is located here was below this line and that is, because we had an extreme outlier in the
previous case, that had a smearing effect on the remaining points. Now, after building
this new linear model let us take a look at the summary.
(Refer Slide Time: 09:21)

On the left is the summary of the old model bonds mod that contains all the points. On
the right, I have the summary of the new model, which does not contain the 13th sample.
So, from the R squared values of the two model, we can see that there is a drastic change
by just removing one extreme point. So, from 0.17516, the R square improves to 0.8077.
So, that is a quite drastic change. Now, let us remove all the other points one by one and
let us see how the R squared value changes now.

(Refer Slide Time: 10:00)

(Refer Slide Time: 10:04)

I am removing the remaining points one by one. So, earlier I started by removing 13th
point. Now, I am going to remove the 35th. So, let us see, what the a square value is,
after removing the 35th point. So, the R square changes from 0.80 to 0.88. So, there is a
quite big leap here as well.

So, after removing the 35th point, I am able to see a pretty good change in the R squared
value. Now, let us look at what the R squared value is? If I remove the fourth point; so,
the R squared value improves from 0.88 to 0.98. So, that is also pretty good jump. So,
now, these indices are for the old data. So, I also have one more index to remove, which
is index 34. Now, let us see what happens, if we remove this.

So, after I remove the 34th point my R squared slightly increases; that is also from the
3rd decimal place. So, from 0.9852 it increases to 0.9891. So, the difference is not huge.
We need not treat this point as an outlier by itself, because it does not improvise the
model any further. So, now, after removing all these four points, we are going to plot the
new regression line over the data.
(Refer Slide Time: 11:27)

So, on the left you can see, that I have removed the 4 index basically, 4 13 34 and 35.
These points I have removed from my data and similarly, for bid price also, I have
removed these points and I am going to fit the new model. So, bonds mod one does not
have any outliers now and I am going to plot the regression line over the data. So, our
regression line fits the data pretty well, though there are some points, which are really
away, but it does not change the nature of the slope drastically. So, this is a pretty good
model and we have removed all the possible outliers that we thought were influencing
the regression line.

(Refer Slide Time: 12:19)

So, to summarize in this three lectures, we looked at the steps, which are taken in
building a simple linear regression model, we saw how to interpret the results from the
summary, for these models we looked at residual analysis. So, we looked at answering
some of the question as how to treat outliers; we also saw how to identify significant
coefficients in our model and how good our model is. We also saw the need for checking
for refinement of existing models and then we built a refine model without any outliers.

Thank you.

ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
No ratings yet
ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
35 pages
Statistical Methods For Business and Economics
No ratings yet
Statistical Methods For Business and Economics
888 pages
CH02 - Wooldridge - 7e PPT - 2pp
100% (3)
CH02 - Wooldridge - 7e PPT - 2pp
40 pages
Machine Learning (MCQ)
No ratings yet
Machine Learning (MCQ)
16 pages
Unit-2 Ak
No ratings yet
Unit-2 Ak
106 pages
Bedan Thesis 22-03-2023 Final
100% (1)
Bedan Thesis 22-03-2023 Final
181 pages
Thesis Using Linear Regression
100% (2)
Thesis Using Linear Regression
7 pages
Econometrics Jimma 1
No ratings yet
Econometrics Jimma 1
216 pages
Mostly Harmless Statistics
No ratings yet
Mostly Harmless Statistics
506 pages
07 Solutions Regression
50% (2)
07 Solutions Regression
74 pages
Lec 05 2 - Time Series Regression Model
No ratings yet
Lec 05 2 - Time Series Regression Model
75 pages
PG R 23 M.tech CSE Syllabus
No ratings yet
PG R 23 M.tech CSE Syllabus
127 pages
CIOT-701 Lab Manual DATA SCIENCE
No ratings yet
CIOT-701 Lab Manual DATA SCIENCE
64 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Lecture 04
No ratings yet
Lecture 04
66 pages
Week 2
No ratings yet
Week 2
66 pages
Unit 4 - R Programming
No ratings yet
Unit 4 - R Programming
26 pages
L3 Demo - Building A Linear Regression
No ratings yet
L3 Demo - Building A Linear Regression
60 pages
Unit 6
No ratings yet
Unit 6
36 pages
Lec 05 - Time Series Regression Model
No ratings yet
Lec 05 - Time Series Regression Model
32 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
Linear Regression
No ratings yet
Linear Regression
59 pages
Sharda Dss11e Ch03
No ratings yet
Sharda Dss11e Ch03
70 pages
Fet402 Lec02 2023 Econometrics
No ratings yet
Fet402 Lec02 2023 Econometrics
60 pages
UnivariateRegression Summary
No ratings yet
UnivariateRegression Summary
36 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Multicollinearity and Oaxaca - Tutorial
No ratings yet
Multicollinearity and Oaxaca - Tutorial
35 pages
Stats101A - Chapter 3
No ratings yet
Stats101A - Chapter 3
54 pages
Chap03 4
No ratings yet
Chap03 4
49 pages
NP Completeness
No ratings yet
NP Completeness
18 pages
311 MCom Computer Applications 23-24 F
No ratings yet
311 MCom Computer Applications 23-24 F
37 pages
SVM 3
No ratings yet
SVM 3
11 pages
Tutorial5 Logic
No ratings yet
Tutorial5 Logic
21 pages
Double Bow Residual
No ratings yet
Double Bow Residual
32 pages
Lec 35
No ratings yet
Lec 35
11 pages
STAT22209 - Chapter 02-Regression Analyisis - 2022
No ratings yet
STAT22209 - Chapter 02-Regression Analyisis - 2022
41 pages
Lec 40
No ratings yet
Lec 40
12 pages
02 Regression and Classification Problems
No ratings yet
02 Regression and Classification Problems
7 pages
LR Assumptions - 05
No ratings yet
LR Assumptions - 05
12 pages
03 Parametric Families of Distributions
No ratings yet
03 Parametric Families of Distributions
4 pages
00000chen - Linear Regression Analysis3
No ratings yet
00000chen - Linear Regression Analysis3
252 pages
Review of Sessions 1-7 PUBH 614 Spring 2019
No ratings yet
Review of Sessions 1-7 PUBH 614 Spring 2019
68 pages
Lec 34
No ratings yet
Lec 34
15 pages
Lecture 4
No ratings yet
Lecture 4
12 pages
05 Pictorial and Tabular Methods in Descriptive Inference
No ratings yet
05 Pictorial and Tabular Methods in Descriptive Inference
5 pages
CH 02b Cost Estimation
100% (1)
CH 02b Cost Estimation
91 pages
Descriptive Statistics - Excel Exercises
No ratings yet
Descriptive Statistics - Excel Exercises
3 pages
Info 4652 5652 Final Project Report 1
No ratings yet
Info 4652 5652 Final Project Report 1
13 pages
Regression Notes - Part-1
No ratings yet
Regression Notes - Part-1
17 pages
Residual Vs Fitted Plot
No ratings yet
Residual Vs Fitted Plot
2 pages
Week 09 Class Exercise
No ratings yet
Week 09 Class Exercise
3 pages
Lec 13
No ratings yet
Lec 13
10 pages
Regression Analysis With Scilab
No ratings yet
Regression Analysis With Scilab
57 pages
LR Assumptions
No ratings yet
LR Assumptions
9 pages
Regression For Everyone Vol. 1
No ratings yet
Regression For Everyone Vol. 1
25 pages
01 Hidden Markov Models
No ratings yet
01 Hidden Markov Models
3 pages
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
No ratings yet
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
7 pages
Regression and Correlation (Ch.14) )
No ratings yet
Regression and Correlation (Ch.14) )
7 pages
Assumptions in Regression Model PDF
No ratings yet
Assumptions in Regression Model PDF
12 pages
Experiment 1
No ratings yet
Experiment 1
17 pages
Oulier in R
No ratings yet
Oulier in R
8 pages
Summary of Quizzes
No ratings yet
Summary of Quizzes
16 pages
Worksheet 3
No ratings yet
Worksheet 3
10 pages
Residual Analysis For Simple Linear Regression: X B B y N e N e
No ratings yet
Residual Analysis For Simple Linear Regression: X B B y N e N e
15 pages
Basic Regression Analysis 3
No ratings yet
Basic Regression Analysis 3
6 pages
Predictive Analytics Group Assignment
No ratings yet
Predictive Analytics Group Assignment
21 pages
03 Principal Components Analysis
No ratings yet
03 Principal Components Analysis
3 pages
06 Fitting Matching
No ratings yet
06 Fitting Matching
13 pages
R Plotting Code Outputs
No ratings yet
R Plotting Code Outputs
1 page
Regression Analysis: Prof. Prema Muthuswamy KCT, Coimbatore
No ratings yet
Regression Analysis: Prof. Prema Muthuswamy KCT, Coimbatore
22 pages
Linear Regression
No ratings yet
Linear Regression
1 page
Demand Estimation
No ratings yet
Demand Estimation
4 pages
Sta1610 2023 TL 101 0 B
No ratings yet
Sta1610 2023 TL 101 0 B
30 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
Lab 5
No ratings yet
Lab 5
6 pages
Econometrics Mock Exam - Solutions
No ratings yet
Econometrics Mock Exam - Solutions
3 pages
Experiment - 8
No ratings yet
Experiment - 8
3 pages
Regression Analysis Linear and Multiple Regression
No ratings yet
Regression Analysis Linear and Multiple Regression
6 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Lecture 20: Outliers and Influential Points
No ratings yet
Lecture 20: Outliers and Influential Points
11 pages
Regression Diagnostics With R: Anne Boomsma
No ratings yet
Regression Diagnostics With R: Anne Boomsma
23 pages
Chapter2 (Simple Linear Regression)
No ratings yet
Chapter2 (Simple Linear Regression)
11 pages
Complete Business Statistics: Simple Linear Regression and Correlation
No ratings yet
Complete Business Statistics: Simple Linear Regression and Correlation
50 pages
Regression Equation For SI
No ratings yet
Regression Equation For SI
12 pages
CH 2
No ratings yet
CH 2
31 pages
Relationship of Nurses' Assessment of Organizational Culture, Job Satisfaction, and Patient Satisfaction With Nursing Care
No ratings yet
Relationship of Nurses' Assessment of Organizational Culture, Job Satisfaction, and Patient Satisfaction With Nursing Care
6 pages
7 OLS Assumptions
No ratings yet
7 OLS Assumptions
37 pages
Microsoft Word - PSPM 7SSMM405 - Course Outline 2014-15
No ratings yet
Microsoft Word - PSPM 7SSMM405 - Course Outline 2014-15
17 pages
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
Chapter 7
No ratings yet
Chapter 7
8 pages
Effects of Work Environment and Engagement On Nurses Organizational Commitment in Public Hospitals Lahore, Pakistan
No ratings yet
Effects of Work Environment and Engagement On Nurses Organizational Commitment in Public Hospitals Lahore, Pakistan
6 pages
Regression in The Toolbar of Minitab's Help
No ratings yet
Regression in The Toolbar of Minitab's Help
9 pages
Least Squares Line Fitting: I I I I
No ratings yet
Least Squares Line Fitting: I I I I
6 pages
How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable
No ratings yet
How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable
6 pages
Multiple Regression
No ratings yet
Multiple Regression
7 pages
Course Number and Name Class Time and Location
No ratings yet
Course Number and Name Class Time and Location
3 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Diagnostics in R Commander
No ratings yet
Diagnostics in R Commander
2 pages
Basic Estimation Techniques: Eighth Edition
No ratings yet
Basic Estimation Techniques: Eighth Edition
16 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Lec 37

Uploaded by

Lec 37

Uploaded by

Data Science for Engineers

Department of Computer Science and Engineering

(Refer Slide Time: 00:21)

Welcome to the third lecture on implementation of simple linear regression using R. In

(Refer Slide Time: 00:52)

(Refer Slide Time: 01:36)

(Refer Slide Time: 04:27)

(Refer Slide Time: 05:33)

(Refer Slide Time: 06:29)

(Refer Slide Time: 10:00)

(Refer Slide Time: 10:04)

(Refer Slide Time: 12:19)

You might also like