0% found this document useful (0 votes)
26 views

Chapter 1 Simple Linear Regression

This document discusses simple linear regression models. It defines key terms like dependent and independent variables and the error term. It also explains how to estimate regression coefficients using ordinary least squares on sample data to find the line of best fit.

Uploaded by

Boutaina Az
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Chapter 1 Simple Linear Regression

This document discusses simple linear regression models. It defines key terms like dependent and independent variables and the error term. It also explains how to estimate regression coefficients using ordinary least squares on sample data to find the line of best fit.

Uploaded by

Boutaina Az
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

II.

Simple regression model


Definition
• y and x are two variables, representing some population, and we are interested in “explaining “y” in terms of
“x”,” or in “studying how “y” varies with changes in “x”.
• Examples: ‘y’ is soybean crop yield and “x” is amount of fertilizer - “y” is hourly wage and “x” is years of
education; “y” is a community crime rate and “x” is number of police officers.
• Simple linear regression model:

• It is also called the two-variable linear regression model or bivariate linear regression model because it
relates the two variables x and y
The meaning of each term (1)
• The variables y and x have several different names used
interchangeably,.
• y is called the dependent variable, the explained variable, the
response variable, the predicted variable, or the regressand.
• x is called the independent variable, the explanatory variable,
the control variable, the predictor variable, or the regressor
• The terms “dependent variable” and “independent variable” are
frequently used in econometrics
The meaning of each term (1)
• The variable “u”, called the error term or disturbance in the relationship, represents factors other
than “x” that affect “y”. A simple regression analysis effectively treats all factors affecting “y”
other than “x” as being unobserved. We can usefully think of “u” as standing for “unobserved.”
• Equation also addresses the issue of the functional relationship between “y”
and “x”. If the other factors in u are held fixed, so that the change in “u” is zero, , then “x” has a
linear effect on “y”:

• Thus, the change in “y” is simply multiplied by the change in “x”. This means that is the slope
parameter in the relationship between “y” and “x” holding the other factors in “u” fixed; it is of
primary interest in applied economics. The intercept parameter also has its uses, although it is
rarely central to an analysis.
Examples
EX1:
Suppose that Soybean yield is determined by the following model:
so that y = yield and x = fertilizer. The economist is interested in the effect of fertilizer on yield, holding other factors
fixed. This effect is given by . The error term “u” contains factors such as land quality, rainfall, and so on. The coefficient
measures the effect of fertilizer on yield, holding other factors fixed: fertilizer.
EX2:
A model relating a person’s wage to observed education and other unobserved factor is :
If wage is measured in dollars per hour and educ is years of education, then measures the change in hourly wage given
another year of education, holding all other factors fixed. Some of those factors include labor force experience, innate
ability, tenure with current employer, work ethics, and innumerable other things.
• Before we state the key assumption about how “x” and “u” are related, there is one assumption
about “u” that we can always make. As long as the intercept is included in the equation, nothing is
lost by assuming that the average value of “u” in the population is zero:
• Because “u” and “x” are random variables, we can define the conditional distribution of “u” given
any value of “x”. In particular, for any “x”, we can obtain the expected (or average) value of “u”
for that slice of the population described by the value of “x”. The crucial assumption is that the
average value of “u” does not depend on the value of “x”. We can write this as:
• This means, for any given value of “x”, the average of the unobservables is the same and therefore
must equal the average value of “u” in the entire population.
Deriving the OLS Estimates
• Now that we have discussed the basic ingredients of the simple regression model, we will address the important
issue of how to estimate the parameters and in equation:
• To do this, we need a sample from the population. Let {(xi,yi): i=1,...,n} denote a random sample of size “n” from
the population. Since these data come from , , we can write : +
• ui is the error term for observation i since it contains all factors affecting yi other than xi.
• Example: xi might be the annual income and yi the annual savings for family i during a particular year. If we have
collected data on 15 families, then n = 15. A scatter plot of such a data set is given in the following figure.
Practice: Estimating parameters
• Let us approach the topic of regression analysis with an example. A mail order business adds a new
summer dress to its collection. The purchasing manager needs to know how many dresses to buy so
that by the end of the season the total quantity purchased equals the quantity ordered by customers. To
prevent stock shortages (i.e. customers going without wares) and stock surpluses (i.e. the business is
left stuck with extra dresses), the purchasing managing decides to carry out a sales forecast.
• What’s the best way to forecast sales? The economist immediately thinks of several possible
predictors or influencing variables. How high are sales of a similar dress in the previous year? How
high is the price? How large is the image of the dress in the catalogue? How large is the advertising
budget for the dress? But we don’t only want to know which independent variables exert an influence;
we want to know how large the respective influence is. To know that catalogue image size has an
influence on the number of orders does not suffice. We need to find out the number of orders that can
be expected on average when the image size is, say, 50 sq cm.
Let us first consider the case where future
demand is estimated from the sales of a
similar dress from the previous year. The
following figure displays the association as
a scatterplot for 100 dresses of a given price
category, with the future demand plotted on
the y-axis and the demand from the previous
year plotted on the x-axis.

If all the plots lay on the angle bisector (an angle bisector divides an angle into two angles of
equal measure), the future demand of period (t) would equal the sold quantities of the previous
year (t -1). As is easy to see, this is only rarely the case. The scatterplot that results contains some
large deviations, producing a correlation coefficient of only r =0.42
Now if, instead of equivalent dresses from
the previous year, we take into account the
catalogue image size for the current season
(t), we arrive at the scatterplot in the new
following figure. We see immediately that
the data points lie much closer to the line,
which was drawn to best approximate the
course of the data. This line is more suited
for a sales forecast than a line produced
using the “equivalence method” in the
previous Figure.

The relatively large correlation coefficient of r=0.95, however, ultimately shows that the linear
association between these variables is stronger. The points lie much closer to the line, which means
that the sales forecast will result in fewer costs for stock shortages and stock surpluses. But, again,
this applies only for products of the same quality and in a specific price category.
The linear equation consists of two components:
1. The intercept is where the line crosses the y-axis. We call this point α. It determines the distance
of the line along the y-axis to the origin.
2. The slope coefficient (β) indicates the slope of the line. From this coefficient we can determine
to what extent catalogue image size impacts demand. If the slope of the lines is two, the value on
the y-axis changes by two units, while the value on the x-axis changes by one unit. In other words,
the flatter the slope, the less influence x values have on the y-axis.

The line in the scatterplot in our previous figure can be represented with the algebraic linear
equation:

This equation intersects the y-axis at the value 138, so that α = 138. Its slope is calculated from the
slope triangle (quotient) β=82/40= 2,1. When the image size increases by 10 sq cm, the demand
increases by 21 dresses. The total linear equation is:
For a dress with an image size of 50 sq cm, we can expect sales to be: dresses
With an image size of 70 sq cm, the equation is:
This linear estimation approximates the
average influence of x variables on y
variables using a mathematical function.
The estimated values are indicated by i, and
the realized values are indicated by yi.
Although the linear estimation runs
through the entire quadrant, the association
between the x and y variable is only
calculated for the area that contains data
points, referred to as the data range. If we
use the regression function for estimations To better illustrate this point, consider the figure above. The
outside this area (as part of a forecast, for marked data point corresponds to dress model 23, which was
instance), we must assume that the advertised with an image size of 47.4 sq cm and which was later
association identified outside the data sold 248 times. The linear regression estimates average sales of
range does not differ from the associations 238 dresses for this image size. The difference between actual
within the data range. sales and estimated sales is referred to as the residual or the
error term. It is calculated by: I )
For dress model 23 the residual is: 23 )=248-
237.5=10.5
In this way, every data point can be expressed as a
combination of the result of the linear regression
and its residual:
i +

We have yet to explain which rule applies for


determining this line and how it can be derived
algebraically. Up to now we only expected that
the line run as closely as possible to as many data
Since we want to prevent both, we can position the line so
points as possible and that deviations above and
that the sum of deviations between realized points and the
below the line be kept to a minimum and be
points on line is as close to zero as possible. The problem
distributed non systematically. The deviations in
with this approach is that a variety of possible lines with
the figure between actual demand and the
different qualities of fit all fulfil this condition. A selection
regression line create stock shortages when they
of possible lines is shown in the figure above
are located above and stock surpluses when they
are located below.
The reason for this is simple: the deviations above and below cancel each other out, resulting in a sum of zero.

All lines that run through the bivariate centroid—the value pair of the averages of x and y—fulfil the condition:

But in view of the differences in quality among the lines, the condition above makes little sense as a construction criterion.
Instead, we need a line that does not allow deviations to cancel each other yet still limits the total sum of errors. Frequently,
statisticians create a line that minimizes the sum of the squared deviations of the actual data points yi from the points on the
line . The minimization of the entire deviation error is:

This method of generating the regression line is called the ordinary least squares method, or OLS. It can be shown that
these lines also run through the bivariate centroid—i.e. the value pair —but this time we only have a single regression line,
which fulfils the condition of the minimal squared error. If we insert equation of the regression line for , we get:
The minimum can be achieved by using the necessary conditions for a minimum, deriving the function f(α; β)
once according to α and once according to β and setting both deviations equal to zero.

From what we know about a mean, we can write:


We should now rearrange the function and simplify it:

You might also like