Regression
Regression
2
The New York Time: https://www.nytimes.com/2023/12/05/science/bordeaux-red-wine-estate-machine-learning.html
Building a Model
• Ashenfelter used a method called linear regression
• Predicts an outcome variable, or dependent variable
• Predicts using a set of independent variables
• Independent variables:
• Age – older wines are more expensive
• Weather
• Average Growing Season Temperature
• Harvest Rain
• Winter Rain
3
The Data (1952 – 1978)
8.5
8.5
(Logarithm of) Price
(Logarithm of) Price
7.5
7.5
6.5
6.5
5 10 15 20 25 30 15.0 15.5 16.0 16.5 17.0 17.5
Age of Wine (Years) Avg Growing Season Temp (Celsius)
8.5
8.5
(Logarithm of) Price
(Logarithm of) Price
7.5
7.5
6.5
6.5
50 100 150 200 250 300 400 500 600 700 800
Harvest Rain (mm) Winter Rain (mm)
4
The Expert’s Reaction
Robert Parker, the world's most
influential wine expert:
“Ashenfelter is an absolute
total sham”
5
One-Variable Linear Regression
(Logarithm of) Price
8.5
8.0
7.5
7.0
6.5
1
The Regression Model
• One-variable regression model
yi = Ø0 + Ø1xi + ⇥i
y i = dependent variable (wine price) for the ith observation
xi = independent variable (temperature) for the ith observation
i
= error term for the ith observation
Ø0 = intercept coefficient
Ø1 = regression coefficient for the independent variable
2
Selecting the Best Model
(Logarithm of) Price
8.5
8.0
7.5
7.0
6.5
3
Selecting the Best Model
8.5
SSE = 10.15
8. 0
(Logarithm of) Price
SSE = 6.03
7.5
SSE = 5.73
7.0
6.5
AAvvggGGroowwinngg SSeeaassoonnTTeemmpp((CCeelssiuuss))
4
Other Error Measures
• SSE can be hard to interpret
• Depends on N
• Units are hard to understand
5
R2
• Compares the best
8.5
model to a “baseline”
model
8.0
(Logarithm of) Price
variables
• Predicts same outcome
6.5
SSE = 5.73
8.0
SST = 10.15
(Logarithm of) Price
7.5
7.0
6.5
7
Interpreting R2
2
SSE
R = 1—
SST
• R2 captures value added from using a model
• R2 = 0 means no improvement over baseline
• R2 = 1 means a perfect predictive model
• Unitless and universally interpretable
• Can still be hard to compare between problems
• Good models for easy problems will have R2 ≈ 1
• Good models for hard problems can still have R2 ≈ 0
8
Available Independent Variables
• So far, we have only used the Average Growing
Season Temperature to predict wine prices
1
Multiple Linear Regression
• Using each variable on its own:
• R2 = 0.44 using Average Growing Season Temperature
• R2 = 0.32 using Harvest Rain
• R2 = 0.22 using France Population
• R2 = 0.20 using Age
• R2 = 0.02 using Winter Rain
2
The Regression Model
• Multiple linear regression model with k variables
yi = Ø0 + Ø1xi + Ø2xi + . . . + Økxi + ⇥i
1 2 k
3
Adding Variables
Variables R2
Average Growing Season Temperature (AGST) 0.44
AGST, Harvest Rain 0.71
AGST, Harvest Rain, Age 0.79
AGST, Harvest Rain, Age, Winter Rain 0.83
AGST, Harvest Rain, Age, Winter Rain, Population 0.83
4
Selecting Variables
• Not all available variables should be used
• Each new variable requires more data
• Causes overfitting: high R2 on data used to create model,
but bad performance on unseen data
5
Understanding the Model and Coefficients
We have remove variable not Sign
1
Correlation
A measure of the linear relationship between variables
+1 = perfect positive linear relationship
0 = no linear relationship
-1 = perfect negative linear relationship
2
Examples of Correlation
8.5
8.0
(Logarithm of) Price
7.5
7.0
6.5
3
Examples of Correlation
17.5
Avg Growing Season Temp (Celsius)
17.0
16.5
16.0
15.5
15.0
4
Examples of Correlation
52000
48000
44000
5 10 15 20 25 30
5
Testing the Assumptions of Multivariate Analysis: Some techniques are less affected by violating certain
assumptions, which is termed robustness, but in all cases meeting some of the assumptions will be critical to a
successful analysis. Thus, it is necessary to understand the role played by each assumption for every
multivariate technique.
➢In almost all instances, the multivariate procedures will estimate the multivariate model and produce results
even when the assumptions are severely violated. Thus, the researcher must be aware of any assumption
violations and the implications they may have for the estimation process or the interpretation of the results.
➢Multivariate analysis requires that the assumptions underlying the statistical techniques be tested twice: first
for the separate variables, akin to the tests for a univariate analysis, and second for the multivariate model
variate, which acts collectively for the variables in the analysis and thus must meet the same assumptions as
individual variables.
➢Four of them potentially affect every univariate and multivariate statistical technique.
25
Data Assumptions:
❑ Normality (Yes)
❑ Homoscedasticity (Yes)
❑ Linearity (Yes)
❑ Multicollinearity (No)
Normality
27
Distribution
Let’s take a look and try it out
28
Normality: The most fundamental assumption in multivariate analysis is normality, referring to the shape
of the data distribution for an individual metric variable and its correspondence to the normal distribution.
➢If the variation from the normal distribution is sufficiently large, all resulting statistical tests
are invalid, because normality is required to use the F and t statistics. Both the univariate
and the multivariate statistical methods discussed in this text are based on the assumption of
univariate normality, with the multivariate methods also assuming multivariate normality.
➢If a variable is multivariate normal, it is also univariate normal. However, the reverse is not
necessarily true (two or more univariate normal variables are not necessarily multivariate
normal).
➢Thus, a situation in which all variables exhibit univariate normality will help gain, although
not guarantee, multivariate normality. Multivariate normality is more difficult to test , but
specialized tests are available in the techniques most affected by departures from
multivariate normality.
29
➢In most cases assessing and achieving univariate normality for all variables is sufficient, and we will
address multivariate normality only when it is especially critical. Even though large sample sizes tend to
diminish the detrimental effects of nonnormality, the researcher should always assess the normality for
all metric variables included in the analysis.
➢In most programs, the skewness and kurtosis of a normal distribution are given values of zero. Then,
values above or below zero denote departures from normality. For example, negative kurtosis values
indicate a platykurtic (flatter) distribution, whereas positive values denote a leptokurtic (peaked)
distribution. Likewise, positive skewness values indicate the distribution shifted to the left, and the
negative values denote a rightward shift.
➢sample size has the effect of increasing statistical power by reducing sampling error. It results in a
similar effect here, in that larger sample sizes reduce the detrimental effects of non-normality. In small
samples of 50 or fewer observations, and especially if the sample size is less than 30 or so, significant
departures from normality can have a substantial impact on the results. as the sample sizes become large,
the researcher can be less concerned about non-normal variables, except as they might lead to other
assumption violations that do have an impact in other ways 30
Normality Testing Methods
(K-S) test.
31
Graphical Analyses: The simplest diagnostic test for normality is a visual check of
the histogram that compares the observed data values with a distribution
Although appealing because of its simplicity, this method is problematic for smaller
samples, where the construction of the histogram (e.g., the number of categories or
the width of categories) can distort the visual portrayal to such an extent that the
32
33
If either calculated z value exceeds the specified critical value, then the distribution is non-normal
in terms of that characteristic. The critical value is from a z distribution, based on the significance
level we desire. The most commonly used critical values are 62.58 (.01 significance level) and
61.96, which corresponds to a .05 error level. With these simple tests, the researcher can easily
assess the degree to which the skewness and peakedness of the distribution vary from the normal
distribution.
The two most common are the Shapiro-Wilks test and a modification of the Kolmogorov–
Smirnov test. Each calculates the level of significance for the differences from a normal
distribution. The researcher should always remember that tests of significance are less useful in
small samples (fewer than 30) and quite sensitive in large samples (exceeding 1,000
34
observations).
Normality Hypothesis: Analysis->DS->Explore->Normality Test
35
Skewness: -1 to +1 or -0.5 to +0.5, Three times of stand error
Kurtosis: -3 to +3 or -2 to + 2, , Three times of stand error
Tests for Skewness and KurtosisThe values for
asymmetry and
1
kurtosis
between -2 and
+2 are
considered
acceptable in
2 order to prove
normal
univariate
distribution
(George &
• Relaxed rule: Mallery, 2010)
• Skewness > 1 = positive (right) skewed
• Skewness < -1 = negative (left) skewed
• Skewness between -1 and 1 is fine
• Strict rule: 3
• Abs(Skewness) > 3*Std. error = Skewed
• Same for Kurtosis
36
A P-P and Q-Q plot are two
types of probability-
probability plots used to
compare the distribution of
random variables. The P-P
(probability-probs) plot
compares the empirical
distribution function (EDF) of
a dataset to a uniform
distribution, while the Q-Q
(quantile-quantile) plot
compares the EDF to a
standard normal
distribution. These plots can
help identify the underlying
distribution of data and
whether it is normally
distributed or not.
Tests for Normality
SPSS
1. Analyze
2. Explore
3. Plots
4. Normality
*Neither of these variables would
be considered normally
distributed according to the KS or
SW measures, but a visual
inspection shows that role
conflict (left) is roughly normal
and participation (right) is
positive skewed.
So, ALWAYS conduct visual
inspections!
40
➢Thus, the researcher should always use both the graphical plots and any statistical
distributions. This session confines the discussion to univariate normality tests and
41
Before and After Transformation
Negative Skewed Cubed
42
Normality Write Up
➢ In assessing the normality of auction prices for wine, both the Kolmogorov-Smirnov and
Shapiro-Wilk tests were utilized. The Kolmogorov-Smirnov test yielded a statistic of .094 with a
significance level of .200, while the Shapiro-Wilk test produced a statistic of .950 with a
significance level of .215. Neither test indicated significant deviations from normality, as p-
values were greater than the conventional alpha level of .05.
➢ The normal Q-Q plot further supports the conclusion of normality. Data points largely adhere
to the line of expected normal distribution, with only slight deviations, suggesting that the
auction prices do not significantly differ from a normal distribution.
➢ Lastly, the histogram of auction prices, with a mean of 7.04 and a standard deviation of .635
for the 27 observed wines, displays a fairly symmetrical distribution around the mean, which is
characteristic of a normal distribution. The overlaid normal curve fits the histogram reasonably
well, indicating that the data are approximately normally distributed.
➢ These statistical tests and visual inspections collectively suggest that the auction prices of wine
in this sample do not depart significantly from a normal distribution. It is, therefore,
reasonable to conclude that the auction prices for wine follow a normal distribution, allowing
for parametric statistical techniques that assume normality to be used in subsequent analyses.
➢ This interpretation should be treated with caution due to the small sample size (n=27), which
may not be representative of the population and could affect the power of the tests. 43
➢The histogram of regression standardized residuals appeared to be
normally distributed, with a mean of approximately 8.74E-15 (a value
close to zero) and a standard deviation of 0.920. The normal P-P plot
of regression standardized residuals showed points that closely
followed the expected line, suggesting that the residuals were
normally distributed. The normality of residuals suggested that the
assumptions for multiple regression were met, lending credibility to
the model. However, the relatively small sample size warrants caution
in generalizing these findings.
44
Homoscedasticity:
➢Dependent variable(s) exhibit equal levels of variance across the range
of predictor variable(s).
➢If data has normality and Linearity, This assumption will be fulfilled
45
➢If a variable has this property, it means that the DV exhibits consistent/equal variance across the range of
predictor variable(s).
➢A simple way to determine if a relationship is homoscedastic, is to do a scatter plot with the IV on the x-axis
and the DV on the y-axis. If the plot comes up with a linear pattern and has a substantial R-square, we have
homoscedasticity! If there is not a linear pattern, and the R-square is low, then the relationship is
heteroscedastic.
➢Homoscedasticity and heteroscedasticity are both related to linear regression. Homoscedasticity refers to the
condition where the variance of the error term around the regression line is constant for all levels of the
predictor. On the other hand, heteroscedasticity is the condition where the variance of the error term varies
across levels of the predictor. It's important to assess whether the residuals (the differences between the
observed and predicted values) of a regression model exhibit homoscedasticity or heteroscedasticity to ensure
its validity and reliability.
➢Homoscedasticity refers to the assumption that the variance of the errors is constant across all levels of the
independent variables. This is a key assumption in regression analysis because it ensures that the statistical
tests for significance are valid. 46
➢Imagine you are a researcher studying how different factors affect house prices. You
decide to use multiple regression, where the house price is the dependent variable, and
the factors you're looking at are the size of the house (in square feet), the number of
bedrooms, and the age of the house.
➢Homoscedasticity means that the spread or variability of house prices around the
predicted prices (based on your model) should be roughly the same, regardless of the
size of the house, the number of bedrooms, or the age of the house.
➢When you plot the residuals (the differences between the observed house prices and the
prices predicted by your model) against the predicted house prices, you should see a
random scatter of points, with no clear pattern and roughly equal spread across all levels
of predicted prices. Whether the predicted price is $100,000 or $500,000, the spread of
the residuals around each predicted value should be similar.
47
➢Example of Heteroscedasticity (the opposite of homoscedasticity):If you notice
that for cheaper houses (e.g., predicted price around $100,000), the residuals are
very tightly clustered (small variability), but for more expensive houses (e.g.,
predicted price around $500,000), the residuals are much more spread out (large
variability), this is a sign of heteroscedasticity. It means the error variance is not
constant
/Boxplot
*Z: Standardized
51
➢Nonmetric Independent Variables: In these analyses (e.g., ANOVA and MANOVA) the focus
now becomes the equality of the variance (single dependent variable) or the variance/covariance
matrices (multiple dependent variables) across the groups formed by the nonmetric independent
variables. The equality of variance/covariance matrices is also seen in discriminant analysis, but
in this technique the emphasis is on the spread of the independent variables across the groups
formed by the nonmetric dependent measure.
➢Graphical Tests of Equal Variance Dispersion: The test of homoscedasticity for two metric
variables is best examined graphically. Departures from an equal dispersion are shown by such
shapes as cones (small dispersion at one side of the graph, large dispersion at the opposite side)
or diamonds (a large number of points at the center of the distribution). Boxplots work well to
represent the degree of variation between groups formed by a categorical variable. The length of
the box and the whiskers each portray the variation of data within that group. Thus,
heteroscedasticity would be portrayed by substantial differences in the length of the boxes and
52
whiskers between groups representing the dispersion of observations in each group.
➢The statistical tests for equal variance dispersion assess the equality of variances within groups
formed by nonmetric variables. The most common test, the Levene test, is used to assess
whether the variances of a single metric variable are equal across any number of groups. If more
than one metric variable is being tested, so that the comparison involves the equality of
variance/covariance matrices, the Box’s M test is applicable..
➢Heteroscedastic variables can be remedied through data transformations similar to those used to
achieve normality. As mentioned earlier, many times heteroscedasticity is the result of non-
normality of one of the variables, and correction of the non-normality also remedies the unequal
dispersion of variance.
➢We should also note that the issue of heteroscedasticity can be remedied directly in some
statistical techniques without the need for transformation. For example, in multiple regression
the standard errors can be corrected for heteroscedasticity to produce heteroscedasticity-
consistent standard errors
53
Homoscedasticity Write Up
55
Residual as DV and Regression with selected IV then check ANOVA sig for above hypothesis
➢Linearity: An implicit assumption of all multivariate techniques based on
➢Many scatterplot programs can show the straight line depicting the linear
characteristics.
57
58
Linearity
• Linearity refers to the consistent slope of change that represents the relationship
• If the relationship between the IV and the DV is radically inconsistent, then it will
59
➢An alternative approach is to run a simple regression analysis and to examine the
residuals. A third approach is to explicitly model a nonlinear relationship by the
testing of alternative model specifications (also know as curve fitting) that reflect
the nonlinear elements.
➢If a nonlinear relationship is detected, the most direct approach is to transform one
or both variables to achieve linearity. A number of available transformations are
discussed later in this chapter. An alternative to data transformation is the creation
of new variables to represent the nonlinear portion of the relationship. The process
of creating and interpreting these additional variables, which can be used in all
linear relationships.
60
Linearity Testing Methods
1. Graphical Method: Scatter Plot/Matrix Scatter Plot with Trend Line
61
62
Ho: There is no significance difference between test data and Linearity
H1: There is a significance difference between test data and Linearity
Good
Bad
63
64
Linearity Write Up
A linearity test was conducted to examine the relationship between the auction
price of wine and the amount of winter rain. The analysis utilized an ANOVA
framework to assess the linearity of the relationship and the deviation from
linearity. The sample size for the test was 26. The results of the ANOVA indicated
that the linearity between the auction price of the wine and winter rain. the test
for deviation from linearity was not significant, F(24, 25) = 1.883, p = .527,
65
Multicollinearity
• Multicollinearity is not desirable in regressions (but desirable in factor analysis!).
• It means that independent variables are too highly correlated with each other and
share too much variance
• Influences the accuracy of estimates for DV and inflates error terms for DV
(Hair).
• How much unique variance does the black circle actually account for?
66
67
Linearity Testing Methods
1. Graphical Method: Scatter Plot
68
Detecting Multicollinearity
• IV to IV: above 0,7 Correlation is bad
• An easy way to check this is to calculate a Variable Inflation Factor (VIF) for each
independent variable after running a multivariate regression using one of the IVs as the
dependent variable, and then regressing it on all the remaining IVs. Then swap out the
IVs one at a time.
• The rules of thumb for the VIF are as follows:
• VIF < 3; no problem
• VIF > 3; potential problem
• VIF > 5; very likely problem
• VIF > 10; definitely problem
69
Handling Multicollinearity
Loyalty 2 and
loyalty 3 seem to
be too similar in
both of these test
Dropping Loyalty
2 fixed the
problem
70
Multicollinearity Write Up
In examining the predictors of the auction price of wine, a multiple regression analysis was
conducted. The variance inflation factor (VIF) and tolerance statistics were inspected to assess
multicollinearity among the independent variables. Generally, a VIF above 10 or tolerance below
0.1 might indicate serious multicollinearity concerns (James, Witten, Hastie, & Tibshirani, 2013).
In the current model, the VIF values ranged from 1.103 for Harvest Rain to 1.247 for Average
Growing Season Temperature, and tolerance values ranged from 0.802 to 0.907, suggesting that
multicollinearity is not a concern for this set of predictors. Each independent variable appears to
contribute unique information in predicting the auction price of the wine, without unduly
73
Transformations Related to Specific Relationship Types :
➢Log-log: a log of both X and Y provides the ratio of the percentage change of Y
74
➢The researcher is faced with what may seem to be an impossible task: satisfy all
note that even though these statistical assumptions are important, the researcher
must use judgment in how to interpret the tests for each assumption and when to
apply remedies.
➢Even analyses with small sample sizes can withstand small, but significant,
interest, striking a balance between the need to satisfy the assumptions versus the
➢Remove Highly Correlated Predictors: If two variables are highly correlated, consider removing one of them.
➢Principal Component Analysis (PCA): PCA or Factor Analysis can reduce the dimensionality of your data,
➢Regularization Techniques: Methods like Ridge Regression or Lasso can help in handling multicollinearity.
Autocorrelation:
➢Adding Lag Variables: In time series data, adding lagged versions of the dependent variable as predictors can
help.
➢Differencing: Apply differencing to the time series data (subtracting the previous observation from the current
observation).
➢Use Time Series Models: Consider time series-specific models like ARIMA, which are designed to handle
autocorrelation.
77
Overall Strategies:
➢Data Cleaning and Exploration: Ensure your data is clean and explore it thoroughly before
➢Model Selection: Sometimes, the choice of model itself can be the issue. Explore different types
➢Consult Domain Knowledge: Understanding the context and domain can sometimes provide
insights into why assumptions are violated and how to address them.
➢Always remember that when applying these remedies, it's important to understand their impact
on your model and interpretation. Each remedy can change the nature of your model and should
be chosen carefully based on both statistical reasoning and an understanding of your data and
research question.
78
Out-of-Sample R2
Model Test
Variables
R2 R2
Avg Growing Season Temp (AGST) 0.44 0.79
AGST, Harvest Rain 0.71 -0.08
AGST, Harvest Rain, Age 0.79 0.53
AGST, Harvest Rain, Age, Winter Rain 0.83 0.79
AGST, Harvest Rain, Age, Winter Rain, Population 0.83 0.76
• Better model R2 does not necessarily mean better test set R2
• Need more data to be conclusive
• Out-of-sample R2 can be negative!
2
Predictive Ability
• Our wine model had a value of R2 = 0.83
1
Harvestify : https://harvestify.herokuapp.com/
➢Harvestify : https://harvestify.herokuapp.com/
https://www.dreamstime.com/diseases-grape-leaves-caused-
parasite-insect-bites-living-vines-does-not-affect-grapes-leaf-
galls-look-like-image172727918
American Psychological Association (APA) Write Up
1
A linear regression analysis was conducted to predict auction prices of wine based on the age of the wine,
winter rain, harvest rain, and the average temperature during the growing season. The model significantly
predicted the auction price of wine, F(4, 22) = 26.390, p < .001, and accounted for approximately 82.8% of the
variance in auction prices (R² = .828, Adjusted R² = .796). The standard error of the estimate was .28.
Specifically, each unit increase in the average growing season temperature was associated with a .616 increase
in the auction price of wine (p < .001), holding all other variables constant. Conversely, each unit increase in
harvest rain was associated with a .004 decrease in the auction price of wine (p < .001). Additionally, winter rain
was positively correlated with auction price (B = .001, p = .024), and age of the wine was also positively
correlated with auction price (B = .024, p = .003).
The data assumption test through graphical and statistical methods indicated that non-linearity, non-normality,
heteroscedasticity and multicollinearity was not a concern,. These results suggest that the average growing
season temperature and the age of the wine are the strongest predictors of the auction price of wine, with the
average growing season temperature having the most substantial impact. However, it is important to note that
while this model is statistically significant, the practical significance should be considered alongside other
factors that might influence wine prices in auction settings.
The Results
• Parker:
• 1986 is “very good to sometimes exceptional”
• Ashenfelter:
• 1986 is mediocre
• 1989 will be “the wine of the century” and 1990 will be even better!
• In wine auctions,
• 1989 sold for more than twice the price of 1986
• 1990 sold for even higher prices!
• Later, Ashenfelter predicted 2000 and 2003 would be great
• Parker has stated that “2000 is the greatest vintage Bordeaux has ever produced”
The Analytics Edge
• A linear regression model with only a few variables can predict wine
prices well