0% found this document useful (0 votes)
68 views38 pages

Bus 173 - Lecture 5

Linear regression with one regressor can be used to model the relationship between a dependent variable (Y) and a single independent variable (X). It assumes a linear relationship between X and Y, where the slope represents the effect of a one-unit change in X on Y. The population regression line defines this true relationship, but the parameters are unknown and must be estimated from a sample of data. Ordinary least squares (OLS) regression estimates the intercept and slope by finding the line that minimizes the sum of squared errors between predicted and actual Y values. It provides unbiased, efficient estimates if its assumptions are met. An example analyzed the relationship between test scores (Y) and student-teacher ratios (X) using O

Uploaded by

Sabab Munif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views38 pages

Bus 173 - Lecture 5

Linear regression with one regressor can be used to model the relationship between a dependent variable (Y) and a single independent variable (X). It assumes a linear relationship between X and Y, where the slope represents the effect of a one-unit change in X on Y. The population regression line defines this true relationship, but the parameters are unknown and must be estimated from a sample of data. Ordinary least squares (OLS) regression estimates the intercept and slope by finding the line that minimizes the sum of squared errors between predicted and actual Y values. It provides unbiased, efficient estimates if its assumptions are met. An example analyzed the relationship between test scores (Y) and student-teacher ratios (X) using O

Uploaded by

Sabab Munif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Lecture 5

Linear regression with one regressor

Niza
Talukder
Introduction

• A state implements tough new penalties on drunk drivers. What is the effect
on highway fatalities?
• A school district cuts the sizes of its elementary classes: What is the effect on its students'
standardized test scores?
• You successfully complete one more year of college classes: What is the effect on your future
earnings?

All these statements show an unknown effect of changing one variable X on another variable, Y

• This model postulates a linear relationship between X and Y: the slope of the line relating X and
Y is the effect of a one-unit change in X on Y. Just as the mean of Y is an unknown
characteristic of the population distribution of Y, the slope of the line relating X and Y is an
unknown characteristic of the population joint distribution of X and Y. The econometric problem
is to estimate this slope-that is to estimate the effect on Y of a unit change in X-using a sample
of data on these two variables.
•  Consumption function

Y= +

Y= consumption (dependent variable)


X = Income (independent or explanatory variable)
and are parameters of the model

the intercept

= slope coefficient

Relationship between variables are generally inexact. To allow for this, econometricians modify the above
equation as

Y= + +µ

µ = disturbance or error term. This is a random (stochastic) variable that has well defined probabilistic
properties. This error term takes into account all the factors that affect consumption but is not considered in the
model explicitly
• This is an example of linear regression model which hypothesizes that the dependent variable Y
(consumption) is linearly related to the explanatory variable X (income) but relation between them is
inexact and subject to individual variation.

• Regression versus correlation

Correlation coefficient measures the strength of association. But in regression, we try to


estimate or predict the value of one variable on the basis of the fixed value of another
variable.

• Regression versus causation


 Population Regression Model

Y= + +µ - population regression line

+ is the population regression line or population regression function. This is the relationship that
holds between X and Y on average over the entire population. This, if you know the value of X
according to this population regression, you could predict that the value of the dependent variable Y is
+ . Note that only has statistical meaning but no real world meaning.
Different names for dependent and independent variable in applied statistics.

Dependent variable Independent variable

Explained variable Explanatory variable

Predictand Predictor

Regressand Regressor

Response Stimulus

Endogenous Exogeneous

Controlled variable Control variable

Outcome covariate
 Estimating coefficients of a linear regression model

Topic: Analysis of the relationship between class size and performance of students

Test scores = + + other factors


Do we know the population value of ? They are unknown. How do we estimate this?

Is it possible to consider the entire population to test if test scores are affected by class size? No
We need to use a random sample of data drawn from the population.

Let’s use the data set named copy of caschool that is available in Google Classroom. This dataset contains data
on test performance, school characteristics and student demographic backgrounds. The
data used here are from all 420 K-6 and K-8 districts in California with data available for
1998 and 1999. Test scores are the average of the reading and math scores on the
Stanford 9 standardized test administered to 5th grade students.
Variables of interest for a simple regression
• Test score: district wide average of reading and math scores for fifth graders
• Class size: number of students divided by number of teachers i.e. the student teacher ratio

• Before doing regression, we first need to describe the data through descriptive statistics. We also
analyze the graphical presentation of the data.
• Figure: scatter plot of test score versus student teacher ratio
Correlation: -0.23
In the 10th percentile, the student teacher ratio is 17.3 meaning 10% of districts
have student teacher ratio below 17.3
•  The scatter plot reflects a weak negative relationship. Although larger classes tend to have
lower test scores, there are other factors affecting test scores that keep the observations from
falling perfectly along a straight line.

• If we want to draw the best fit line through these data, then the slope of this line would be an
estimate of based on these data. The problem is that different people will create different
estimated lines. How do we choose among the many possible lines? The most common method
used is to choose a line that produces the ‘least squares’ fit to these data i.e. use the ordinary
least squares (OLS) estimator.

• OLS: Developed by Gauss in 1795. OLS estimator chooses the regression coefficients so that
the estimated regression line is as close as possible to the observed data, where closeness is
measured by the sum of squared mistakes made in predicting Y given X
OLS

• We want to draw a line that approximates all the lines. Error is the vertical distance
between the actual data point and the line. What we want to do is minimize the
squared errors to get the best approximate line.
 • Suppose there are four observations from a data set for X and Y. You need to find the estimates of the
intercept and slope, and to find estimates of intercept denoted as and slope coefficient denoted as .
Fitted line will be written as

Equation 1
Please note: caret above Y simply indicates that it’s a fitted value or predicted value of Y
corresponding to X, not the actual value. In class, I have denoted this with a small letter, y.
 We discussed before how drawing free hand line will leave a lot of subjective judgment. Hence, we need a good
method to calculate efficient estimates of and
Algebraically,

The first step is to understand residual – estimate of the error term. It is the difference between the actual value
Y and the fitted value given by the regression line. It will be denoted by

equation 2
 Substituting equation 1 into 2 gives

Equation 3

Residual of each observation depends on our choice of and e want to choose the estimates in
such a way that the residuals will be as small as possible. The way to do this is minimize the
sum of squares of the residuals

The smaller the RSS, the better the fit, leading to unbiased and efficient estimates.
Note: (a – b – c)2 = a2 + b2 + c2 – 2ab – 2ac + 2bc
OLS technique: The workout is shown in chapter 2 of the book by Dougherty.
 The Least Square Assumption:

1) The mean of the error terms has an expected value of zero given values for the independent variable. In
short, the expected value of the error term is zero.

E(
E(

It simply means that the error terms have no relationship with the independent variable X. In other words,
the other factors affecting the dependent are not correlated with each other.

2) The observations of X and Y are independently and identically distributed.


If the observations are drawn by simple random sampling from a single large population, then (X, Y) , i= 1,
2……n are i.i.d (identically, independently distributed)
Example: Recall the age and income example discussed in class
3) Large outliers : observations with values of X and Y far outside the usual range of the data are
unlikely. It leads to misleading results. It would be important to examine those outliers and make sure
they are correctly recorded
 If the assumptions hold, then the estimates will be unbiased and efficient.

• Efficiency : how reliable your estimates are

Biased: Bias is the difference between the parameter and the expected value of the estimator of
the parameter.
Bias ( = E( -
It follows that bias of an unbiased estimator is 0.

• Biased Estimator: An estimator whose expectation, or sampling mean, is different from the
population value it is supposed to be estimating.

Why least squares? Two useful characteristics of least squares:

- It has desirable properties; subject to some assumptions, it is efficient, consistent and unbiased.
- Algebra is comparatively straightforward
 
After using the OLS technique to find the estimates of the population intercept and population slope, the
regression equation can be written as:

where,

is the estimate of

is the estimate of

Residual is the estimate of the error term. We cannot observe the error term but we can observe the
residual.

The test score and STR equation will now look like this:

= – STR
 
OLS estimates of the relationship between test scores and student teacher ratio:
When OLS is used to estimate a line relating the student teacher ratio to test scores using 420
observations, the estimated slope is -2.28 and the estimated intercept is 698.9. Accordingly, OLS
regression line for these 420 observations is

= 698.9 – 2.28 STR

where test score is the average test score in the district and STR is the student teacher ratio. The symbol
indicates that this is a predicted value based on the OLS regression line. The scatter plot below shows the
OLS regression line superimposed over it.
Interpretation:

The slope, -2.28 means that an increase student teacher ratio by one student per class is on average
associated with a decline in district wide test scores by 2.28 points on the test. A decrease in STR by 2
students per class is on average associated with an increase in test scores by 4.56 points.

-2 (-2.28)
The negative slope indicates that more students per teacher i.e. a larger class is associated with poorer
performance on the test.

Now given the student teacher ratio, it is possible to predict the district wide test scores.

Question: what happens if STR increases by 20 students per class?

what is the predicted score for a district with 20 students per teacher?
• Predicted test score is : 698.9 – 2.28 (20) = 653.3

However this prediction will not always be right because of the other factors that
determine the district’s performance. But the regression line give a prediction (OLS
prediction) of what test scores would be for that district based on their STR absent those
other factors.
 Measures of Goodness of Fit

Having estimated a linear regression, you might want to find out how well the regression line
describes the data. Does the regressor account for much or little of the variation in the dependent
variable? Are the observations tightly clustered around the regression line or are they spread out?

, also known as coefficient of determination, explains the variability of X in Y. In other words, it


measures the fraction of variance in Y that is explained by X. It shows how well the OLS
regression fits the data.
 Properties of

0≤ ≤1

• = 1 means perfect fit. All data points fall exactly on regression line.

• = 0 means X does not have any explanatory power for Y whatsoever

• Bigger values of imply that X has more explanatory power for Y.

• is equal to correlation between X and Y squared


 Y =+

• In this notation, the is the ratio of the sample of variance to the sample variance of Y.
mathematically, can be written as the explained sum of squares to the total sum of squares.

• ESS (within variation) = - )^2


• TSS =

• =

• TSS = ESS +SSR. Thus, can also be expressed as 1-


 • Another measure for goodness of fit is the adjusted

SER: A measure of accuracy

The standard error of regression is an estimator of the standard deviation of the regression error.
SER provides some measure of the precision of the estimates. Conveniently, it tells you about the
spread of the observations around the regression line, measured in units of the dependent variable.
Smaller values are better because it indicates that the observations are closer to the fitted line.

If the units of the dependent variable are in dollars, then the SER measures the magnitude of a typical
deviation from the regression line – that is, the magnitude of a typical regression error in dollars.
 Application to the test score data:

• Using California test score data, it has been estimated that R squared is 0.051 or 5.1 %. It
means that STR explains 5.1% of the variance of the dependent variable, test score. Or it
accounts for 5% of the variability in test score.

Only 5%. What happened to the rest i.e. 95%?what explains the rest of the variability in test
score?

• SER of 18.6 means that the standard deviation of the regression residuals is 18.6. because
the standard deviation is a measure of spread, SER of 18.6 means that there is a large spread
of the scatter plot around the regression line measured in points on the test. The large spread
means that the prediction of test scores made using only the student teacher ratio for that
district will often be wrong by a large amount.

• What should we make of this low and high SER?


Example: Analyzing the relationship between educational attainment and earnings

• Data used for analysis between earnings measured in dollars and years of schooling : data on
earnings and years of schooling collected from a survey in 1992. The following is another
example of a simple regression. Simple regression is when you have one independent
variable.
 First step to interpretation is to write the equation: (discussed in class several times)

• Adjusted R-squared adjusts the statistic based on the number of independent variables in the
model.

Testing hypothesis about the slope : test for the significance of regression coefficients

Now suppose I challenge your theory. I claim that level of education has no effect on earnings,
meaning that the regression line is flat, i.e. years of schooling is 0 (coefficient of years of
schooling is 0). I ask you if there is any evidence that the slope is zero. Can you say that my
hypothesis that years of schooling = 0? Or should you accept it?

We will now discuss tests of hypothesis about the slope and intercept of population regression
line.
 = 0
≠0

Three steps to test the two sided hypothesis:


1) Compute the standard error
2) The t statistic

t = estimator – hypothesized value


standard error

Check if it is less than the critical value or not

3) Compute the p value which is the smallest value – smallest significance level at which the null
hypothesis is rejected based on the test statistic actually observed
 Interpretation:

If from the regression output you see that the t-stat is greater than the critical value for a
given significance level, you reject the null. Let’s take the regression output in slide 27
as an example:

T-stat of 8.12 is greater than the critical value 1.96 at 5% significance level. So, we
have enough evidence to reject the null such that the coefficient is significantly
different from 0. Thus, years of education has explanatory power over earnings
meaning it’s a significant factor that should be included in the regression model. The p-
value of 0 further shows evidence that years of schooling affects earnings of
individuals.

suggests that schooling explain 10.36% of variability in earnings.


Standard error of regression:

- Refer to slide 27.

- Beside the coefficients column, you can see ‘std. error’ column. Here, standard error is an
estimate of the standard deviation of the coefficient. It can be thought of as a measure of the
precision with which the regression coefficient is measured.

- Standard error also includes the standard deviation of the residuals in the model. In STATA
output, this is shown by the Root MSE.
 Regression when X is a binary variable:

• Discussion so far focused on the case where independent variable is continuous. Regression analysis can also
be used when the regressors are binary – that is, it takes only two values, 0 or 1. For example, X might be a
worker’s gender (=1 if female, = 0 if male) or whether a class size is small or large (=1 if small, = 0 if large).
Then there are dummy variables that we can also use to introduce qualitative variables in regression analysis
that utilizes numerical data.

• Interpretation:
The interpretation of the slope changes when there is a binary variable as an independent variable. Suppose we
have a variable D that equals either 0 or 1, depending on whether the student teacher ratio is less than 20.

= 1 if STR in the district < 20


0 if STR in the district is ≥ 20

What is the average test score when STR is greater than or equal to 20?

What is the average test score when STR is less than 20?
is simply the difference of sample average test scores between the two groups.
 • With a dummy/ binary variable it is not useful to think of as slope; indeed, because can take only two
values. So there is no slope, hence it will not make sense to talk about slope. Instead, we will simply
refer to as the coefficient of multiplying in this regression or the coefficient of .
Female and bachelor are examples of a dummy variable:

earning bachelor female

34.61538 1 0

19.23077 1 1

13.73626 0 1

19.23077 1 1

19.23077 1 0
Heteroskedasticity and Homoskedasticity:

• Homoscedasticity means that the variance of errors is the same across all levels of the
independent variable and the error in particular does not depend on the independent variable.  
An important assumption of OLS is that the variance in the residuals has to be homoskedastic
or constant. Residuals cannot vary for lower or higher values of independent variable.
• When the variance of errors differs at different values of the IV, heteroscedasticity is indicated. A
scatterplot of these variables will often create a cone-like shape as the scatter (or variability) of the
dependent variable widens or narrows as the value of the independent variable (IV) increases. When
heteroscedasticity is marked, it can lead to serious distortion of findings and seriously weaken the
analysis, thus increasing the possibility of a Type I error.

You might also like