Lecture 4.3 Regression-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Regression

What is Regression?
Relationships

In science, we frequently measure two or more variables on the same


individual (case, object, etc). We do this to explore the nature of the
relationship among these variables. There are two basic types of
relationships.

• Cause-and-effect relationships.
• Functional relationships.

Function: a mathematical relationship enabling us to predict what


values of one variable (Y) correspond to given values of another
variable (X).

• Y: is referred to as the dependent variable, the response


variable or the predicted variable.
• X: is referred to as the independent variable, the explanatory
variable or the predictor variable.2
Examples of variables
• The time needed to fill a soft drink • The number of cases needed to fill
vending machine the machine
• The tensile strength of wrapping • The percent of hardwood in the
paper pulp batch
• Percent germination of begonia • The intensity of light in an incubator
seeds
• The litter size
• The mean litter weight of test rats
• Maintenance cost of tractors
• The age of the tractor
• The repair time for a computer
• The number of components which
have to be changed

In each case, the statement can be read as; Y is a function of X.

Two kinds of explanatory variables:


Those we can control
Those over which we have little or no control.
3
Goal of regression
Develop a statistical model that can predict the
values of a dependent (response) variable
based upon the values of the independent
(explanatory) variables.
Simple Regression

A statistical model that utilizes one


quantitative independent variable
“X” to predict the quantitative
dependent variable “Y.”
Multiple Regression
A statistical model that utilizes two
or more quantitative and qualitative
explanatory variables (x1,..., xp) to
predict a quantitative dependent
variable Y.
Caution: have at least two or more quantitative
explanatory variables (rule of thumb)
An operations supervisor measured how long it takes one of her drivers to put 1, 2, 3 and 4
cases of soft drink into a soft drink machine. In this case the levels of the explanatory variable,
X are {1,2,3,4}, and she controls them. She might repeat the measurement a couple of times
at each level of X. A scatter plot of the resulting data might look like:

7
A forestry graduate student makes wrapping paper out of different percentages
of hardwood then measure its tensile strength. He has the freedom to choose at
the beginning of the study to have only five percentages to work with, say {5%,
10%, 15%, 20%, and 25%}. A scatter plot of the resulting data might look like:

8
A farm manager is interested in the relationship between litter size and average
litter weight (average newborn piglet weight). She examines the farm records
over the last couple of years and records the litter size and average weight for all
births. A plot of the data pairs looks like the following:

9
A farm operations student is interested in the relationship between maintenance
cost and age of farm tractors. He performs a telephone interview survey of the 52
commercial potato growers in Putnam County, FL. One part of the questionnaire
provides information on tractor age and 1995 maintenance cost (fuel, lubricants,
repairs, etc). A plot of these data might look like:

10
Questions needing answers.
• What is the association between Y and X?
• How can changes in Y be explained by changes in X?
• What are the functional relationships between Y and X?
A functional relationship is symbolically written as:

Eq: 1 Y  f(X)
Example: A proportional
relationship (e.g. fish weight to
length).

Y  b1 X
b1 is the slope of the line.

STA6166-RegBasics 11
Example: Linear relationship (e.g. Y=cholesterol
versus X=age)

Y  b0  b1 X
b0 is the intercept,
b1 is the slope.

STA6166-RegBasics 12
Interpretation of the Slope and the Intercept
• b0 is the estimated mean value of Y when the value of X is zero

• b1 is the estimated change in the mean value of Y as a result of


a one-unit increase in X
Simple Linear Regression Equation (Prediction
Line)
The simple linear regression equation provides an
estimate of the population regression line

Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i intercept

Value of X for

Ŷi  b0  b1Xi
observation i
Method of Least Squares
• The straight line that best fits the data.

• Determine the straight line for which the differences between the
actual values (Y) and the values that would be predicted from the
fitted line of regression (Y-hat) are as small as possible.
Measures of Variation
• Explained variation (sum of squares due to
regression)
• Unexplained variation (error sum of squares)
•Total sum of squares
Simple Linear Regression
Example
• A real estate agent wishes to examine the relationship between the
selling price of a home and its size (measured in square feet)

• A random sample of 10 houses is selected


• Dependent variable (Y) = house price in $1000s
• Independent variable (X) = square feet
Simple Linear Regression Example: Data
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Simple Linear Regression Example: Scatter Plot

House price model: Scatter Plot


450
400

House Price ($1000s) 350


300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Simple Linear Regression Example:
Interpretation of bo
house price  98.24833  0.10977 (square feet)

 b0 is the estimated mean value of Y when the value


of X is zero (if X = 0 is in the range of observed X
values)
 Because a house cannot have a square footage of 0,
b0 has no practical application
Multiple Regression (Linear Model)
Relationship between one dependent & two or
more independent variables is a linear function

Population Population Random


Y-intercept slopes error

Y   0  1 X 1   2 X 2     P X P  

Dependent Independent
(explanatory)
(response) variables
variable
Mini-Case
Predict the consumption of home heating oil during January for
homes located around Screne Lakes. Two explanatory variables are
selected - - average daily atmospheric temperature (oF) and the
amount of attic insulation (“).
Mini-Case
Develop a model for estimating O il (G a l) T e m p (0F) In su la tio n
heating oil used for a single family 2 7 5 .3 0 40 3
home in the month of January 3 6 3 .8 0 27 3
based on average temperature 1 6 4 .3 0 40 10
and amount of insulation in
4 0 .8 0 73 6
inches.
9 4 .3 0 64 6
2 3 0 .9 0 34 6
3 6 6 .7 0 9 6
3 0 0 .6 0 8 10
2 3 7 .8 0 23 10
1 2 1 .4 0 63 3
3 1 .4 0 65 10
2 0 3 .5 0 41 6
4 4 1 .1 0 21 3
3 2 3 .0 0 38 3
5 2 .5 0 58 10
Mini-Case
• What preliminary conclusions can home owners draw from the
data?

• What could a home owner expect heating oil consumption (in


gallons) to be if the outside temperature is 15 oF when the attic
insulation is 10 inches thick?
Multiple Regression Equation
[mini-case]

Y-hat = 562.15 - 5.44x1 - 20.01x2

where: x1 = temperature [degrees F]


x2 = attic insulation [inches]
Extrapolation
Y
Interpolation

Extrapolation Extrapolation

X
Relevant Range
Simple Linear Regression Example: Making
Predictions
• When using a regression model for prediction, only
predict within the relevant range of data
Relevant range for
interpolation
450
400
House Price ($1000s)

350
300
250
200
150
100
50 Do not try to
0
extrapolate
0 500 1000 1500 2000 2500 3000
beyond the range
Square Feet
of observed X’s
Assumptions of Regression
L.I.N.E
• Linearity
• The relationship between X and Y is linear
• Independence of Errors
• Error values are statistically independent
• Particularly important when data are collected over a period
of time
• Normality of Error
• Error values are normally distributed for any given value of X
• Equal Variance (also called homoscedasticity)
• The probability distribution of the errors has constant
variance
Evaluating the Model
• training and testing our model, we can understand how well it predicts by using some metrics. For regression
models, three evaluation metrics are mainly used:
• Mean Absolute Error (MAE): When we subtract the predicted values from the actual values, obtaining
the errors, sum the absolute values of those errors and get their mean. This metric gives a notion of the
overall error for each prediction of the model, the smaller (closer to 0) the better.

• R-squared (R2): This metric measures the proportion of variance in the target variable explained by the
model. An R2 score of 1 indicates a perfect fit, while a score of 0 indicates that the model is no better
than predicting the mean value of the target variable.
• Mean Squared Error (MSE): It is similar to the MAE metric, but it squares the absolute values of the errors.
Also, as with MAE, the smaller, or closer to 0, the better. The MSE value is squared so as to make large errors
even larger. One thing to pay close attention to, it that it is usually a hard metric to interpret due to the size
of its values and of the fact that they aren't in the same scale of the data.

• Root Mean Squared Error (RMSE): Tries to solve the interpretation problem raised with the MSE by getting
the square root of its final value, so as to scale it back to the same units of the data. It is easier to interpret
and good when we need to display or show the actual value of the data with the error. It shows how much
the data may vary, so, if we have an RMSE of 4.35, our model can make an error either because it added 4.35
to the actual value, or needed 4.35 to get to the actual value. The closer to 0, the better as well.

You might also like