Lecture 4.3 Regression-1
Lecture 4.3 Regression-1
Lecture 4.3 Regression-1
What is Regression?
Relationships
• Cause-and-effect relationships.
• Functional relationships.
7
A forestry graduate student makes wrapping paper out of different percentages
of hardwood then measure its tensile strength. He has the freedom to choose at
the beginning of the study to have only five percentages to work with, say {5%,
10%, 15%, 20%, and 25%}. A scatter plot of the resulting data might look like:
8
A farm manager is interested in the relationship between litter size and average
litter weight (average newborn piglet weight). She examines the farm records
over the last couple of years and records the litter size and average weight for all
births. A plot of the data pairs looks like the following:
9
A farm operations student is interested in the relationship between maintenance
cost and age of farm tractors. He performs a telephone interview survey of the 52
commercial potato growers in Putnam County, FL. One part of the questionnaire
provides information on tractor age and 1995 maintenance cost (fuel, lubricants,
repairs, etc). A plot of these data might look like:
10
Questions needing answers.
• What is the association between Y and X?
• How can changes in Y be explained by changes in X?
• What are the functional relationships between Y and X?
A functional relationship is symbolically written as:
Eq: 1 Y f(X)
Example: A proportional
relationship (e.g. fish weight to
length).
Y b1 X
b1 is the slope of the line.
STA6166-RegBasics 11
Example: Linear relationship (e.g. Y=cholesterol
versus X=age)
Y b0 b1 X
b0 is the intercept,
b1 is the slope.
STA6166-RegBasics 12
Interpretation of the Slope and the Intercept
• b0 is the estimated mean value of Y when the value of X is zero
Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i intercept
Value of X for
Ŷi b0 b1Xi
observation i
Method of Least Squares
• The straight line that best fits the data.
• Determine the straight line for which the differences between the
actual values (Y) and the values that would be predicted from the
fitted line of regression (Y-hat) are as small as possible.
Measures of Variation
• Explained variation (sum of squares due to
regression)
• Unexplained variation (error sum of squares)
•Total sum of squares
Simple Linear Regression
Example
• A real estate agent wishes to examine the relationship between the
selling price of a home and its size (measured in square feet)
Y 0 1 X 1 2 X 2 P X P
Dependent Independent
(explanatory)
(response) variables
variable
Mini-Case
Predict the consumption of home heating oil during January for
homes located around Screne Lakes. Two explanatory variables are
selected - - average daily atmospheric temperature (oF) and the
amount of attic insulation (“).
Mini-Case
Develop a model for estimating O il (G a l) T e m p (0F) In su la tio n
heating oil used for a single family 2 7 5 .3 0 40 3
home in the month of January 3 6 3 .8 0 27 3
based on average temperature 1 6 4 .3 0 40 10
and amount of insulation in
4 0 .8 0 73 6
inches.
9 4 .3 0 64 6
2 3 0 .9 0 34 6
3 6 6 .7 0 9 6
3 0 0 .6 0 8 10
2 3 7 .8 0 23 10
1 2 1 .4 0 63 3
3 1 .4 0 65 10
2 0 3 .5 0 41 6
4 4 1 .1 0 21 3
3 2 3 .0 0 38 3
5 2 .5 0 58 10
Mini-Case
• What preliminary conclusions can home owners draw from the
data?
Extrapolation Extrapolation
X
Relevant Range
Simple Linear Regression Example: Making
Predictions
• When using a regression model for prediction, only
predict within the relevant range of data
Relevant range for
interpolation
450
400
House Price ($1000s)
350
300
250
200
150
100
50 Do not try to
0
extrapolate
0 500 1000 1500 2000 2500 3000
beyond the range
Square Feet
of observed X’s
Assumptions of Regression
L.I.N.E
• Linearity
• The relationship between X and Y is linear
• Independence of Errors
• Error values are statistically independent
• Particularly important when data are collected over a period
of time
• Normality of Error
• Error values are normally distributed for any given value of X
• Equal Variance (also called homoscedasticity)
• The probability distribution of the errors has constant
variance
Evaluating the Model
• training and testing our model, we can understand how well it predicts by using some metrics. For regression
models, three evaluation metrics are mainly used:
• Mean Absolute Error (MAE): When we subtract the predicted values from the actual values, obtaining
the errors, sum the absolute values of those errors and get their mean. This metric gives a notion of the
overall error for each prediction of the model, the smaller (closer to 0) the better.
• R-squared (R2): This metric measures the proportion of variance in the target variable explained by the
model. An R2 score of 1 indicates a perfect fit, while a score of 0 indicates that the model is no better
than predicting the mean value of the target variable.
• Mean Squared Error (MSE): It is similar to the MAE metric, but it squares the absolute values of the errors.
Also, as with MAE, the smaller, or closer to 0, the better. The MSE value is squared so as to make large errors
even larger. One thing to pay close attention to, it that it is usually a hard metric to interpret due to the size
of its values and of the fact that they aren't in the same scale of the data.
• Root Mean Squared Error (RMSE): Tries to solve the interpretation problem raised with the MSE by getting
the square root of its final value, so as to scale it back to the same units of the data. It is easier to interpret
and good when we need to display or show the actual value of the data with the error. It shows how much
the data may vary, so, if we have an RMSE of 4.35, our model can make an error either because it added 4.35
to the actual value, or needed 4.35 to get to the actual value. The closer to 0, the better as well.