0% found this document useful (0 votes)
6 views

Regression

Uploaded by

amatuer3293
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Regression

Uploaded by

amatuer3293
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

THE STATISTICAL SOMMELIER

An Introduction to Linear Regression


Bordeaux Wine

• Large differences in price and quality


between years, although wine is produced
in a similar way
• Meant to be aged, so hard to tell if wine
will be good when it is on the market
• Expert tasters predict which ones will be
good
• Can analytics be used to come up with a
different system for judging wine?
1
Predicting the Quality of Wine

• March 1990 - Orley


Ashenfelter, Princeton
University, Eeconomics
professor, claims he can
predict wine quality
without tasting the wine

2
The New York Time: https://www.nytimes.com/2023/12/05/science/bordeaux-red-wine-estate-machine-learning.html
Building a Model
• Ashenfelter used a method called linear regression
• Predicts an outcome variable, or dependent variable
• Predicts using a set of independent variables

• Dependent variable: typical price in 1990-1991 wine auctions (approximates


quality)

• Independent variables:
• Age – older wines are more expensive
• Weather
• Average Growing Season Temperature
• Harvest Rain
• Winter Rain
3
The Data (1952 – 1978)
8.5

8.5
(Logarithm of) Price
(Logarithm of) Price

7.5

7.5
6.5

6.5
5 10 15 20 25 30 15.0 15.5 16.0 16.5 17.0 17.5
Age of Wine (Years) Avg Growing Season Temp (Celsius)
8.5

8.5
(Logarithm of) Price
(Logarithm of) Price

7.5

7.5
6.5

6.5

50 100 150 200 250 300 400 500 600 700 800
Harvest Rain (mm) Winter Rain (mm)

4
The Expert’s Reaction
Robert Parker, the world's most
influential wine expert:

“Ashenfelter is an absolute
total sham”

“rather like a movie critic who


never goes to see the movie but
tells you how good it is based on
the actors and the director”

5
One-Variable Linear Regression
(Logarithm of) Price

8.5
8.0
7.5
7.0
6.5

15.0 15.5 16.0 16.5 17.0 17.5

Avg Growing Season Temp (Celsius)

1
The Regression Model
• One-variable regression model
yi = Ø0 + Ø1xi + ⇥i
y i = dependent variable (wine price) for the ith observation
xi = independent variable (temperature) for the ith observation
i
= error term for the ith observation
Ø0 = intercept coefficient
Ø1 = regression coefficient for the independent variable

• The best model (choice of coefficients) has the smallest error


terms

2
Selecting the Best Model
(Logarithm of) Price

8.5
8.0
7.5
7.0
6.5

15.0 15.5 16.0 16.5 17.0 17.5

Avg Growing Season Temp (Celsius)

3
Selecting the Best Model
8.5

SSE = 10.15
8. 0
(Logarithm of) Price

SSE = 6.03
7.5

SSE = 5.73
7.0
6.5

1155.00 1155.55 1166.00 1166.55 1177.00 1177.55

AAvvggGGroowwinngg SSeeaassoonnTTeemmpp((CCeelssiuuss))

4
Other Error Measures
• SSE can be hard to interpret
• Depends on N
• Units are hard to understand

• Root-Mean-Square Error (RMSE)


r
SSE
RMSE =
N
• Normalized by N, units of dependent variable

5
R2
• Compares the best
8.5

model to a “baseline”
model
8.0
(Logarithm of) Price

• The baseline model


7.5

does not use any


7.0

variables
• Predicts same outcome
6.5

(price) regardless of the


15.0 15.5 16.0 16.5 17.0 17.5 independent variable
Avg Growing Season Temp (Celsius) (temperature)
6
R2
8.5

SSE = 5.73
8.0

SST = 10.15
(Logarithm of) Price

7.5
7.0
6.5

15.0 15.5 16.0 16.5 17.0 17.5


Avg Growing Season Temp (Celsius)

7
Interpreting R2

2
SSE
R = 1—
SST
• R2 captures value added from using a model
• R2 = 0 means no improvement over baseline
• R2 = 1 means a perfect predictive model
• Unitless and universally interpretable
• Can still be hard to compare between problems
• Good models for easy problems will have R2 ≈ 1
• Good models for hard problems can still have R2 ≈ 0
8
Available Independent Variables
• So far, we have only used the Average Growing
Season Temperature to predict wine prices

• Many different independent variables could be used


• Average Growing Season Temperature
• Harvest Rain
• Winter Rain
• Age of Wine (in 1990)
• Population of France

1
Multiple Linear Regression
• Using each variable on its own:
• R2 = 0.44 using Average Growing Season Temperature
• R2 = 0.32 using Harvest Rain
• R2 = 0.22 using France Population
• R2 = 0.20 using Age
• R2 = 0.02 using Winter Rain

• Multiple linear regression allows us to use all of these


variables to improve our predictive ability

2
The Regression Model
• Multiple linear regression model with k variables
yi = Ø0 + Ø1xi + Ø2xi + . . . + Økxi + ⇥i
1 2 k

y i = dependent variable (wine price) for the ith observation


xji = jth independent variable for the ith observation
i=
error term for the ith observation
Ø0= intercept coefficient
Øj = regression coefficient for the jth independent variable

• Best model coefficients selected to minimize SSE

3
Adding Variables

Variables R2
Average Growing Season Temperature (AGST) 0.44
AGST, Harvest Rain 0.71
AGST, Harvest Rain, Age 0.79
AGST, Harvest Rain, Age, Winter Rain 0.83
AGST, Harvest Rain, Age, Winter Rain, Population 0.83

• Adding more variables can improve the model


• Diminishing returns as more variables are added

4
Selecting Variables
• Not all available variables should be used
• Each new variable requires more data
• Causes overfitting: high R2 on data used to create model,
but bad performance on unseen data

• We will see later how to appropriately choose


variables to remove

5
Understanding the Model and Coefficients
We have remove variable not Sign

1
Correlation
A measure of the linear relationship between variables
+1 = perfect positive linear relationship
0 = no linear relationship
-1 = perfect negative linear relationship

2
Examples of Correlation

8.5
8.0
(Logarithm of) Price

7.5
7.0
6.5

400 500 600 700 800

Winter Rain (mm)

3
Examples of Correlation

17.5
Avg Growing Season Temp (Celsius)

17.0
16.5
16.0
15.5
15.0

50 100 150 200 250 300

Harvest Rain (mm)

4
Examples of Correlation

Population of France (thousands)

52000
48000
44000

5 10 15 20 25 30

Age of Wine (Years)

5
Testing the Assumptions of Multivariate Analysis: Some techniques are less affected by violating certain

assumptions, which is termed robustness, but in all cases meeting some of the assumptions will be critical to a

successful analysis. Thus, it is necessary to understand the role played by each assumption for every

multivariate technique.

➢In almost all instances, the multivariate procedures will estimate the multivariate model and produce results

even when the assumptions are severely violated. Thus, the researcher must be aware of any assumption

violations and the implications they may have for the estimation process or the interpretation of the results.

➢Multivariate analysis requires that the assumptions underlying the statistical techniques be tested twice: first

for the separate variables, akin to the tests for a univariate analysis, and second for the multivariate model

variate, which acts collectively for the variables in the analysis and thus must meet the same assumptions as

individual variables.

➢Four of them potentially affect every univariate and multivariate statistical technique.

25
Data Assumptions:

❑ Normality (Yes)

❑ Homoscedasticity (Yes)

❑ Linearity (Yes)

❑ Multicollinearity (No)
Normality

➢Normality refers to the distributional assumptions of a variable.

➢Parametric Test: t tests and F tests assume normal distributions

➢Normality issues effect small sample sizes (<50) much more

than large sample sizes (>200)

27
Distribution
Let’s take a look and try it out

To check distribution in SPSS:


1. Analyze,
2. Explore,
3. Plots: Histogram with normality plot

28
Normality: The most fundamental assumption in multivariate analysis is normality, referring to the shape
of the data distribution for an individual metric variable and its correspondence to the normal distribution.
➢If the variation from the normal distribution is sufficiently large, all resulting statistical tests
are invalid, because normality is required to use the F and t statistics. Both the univariate
and the multivariate statistical methods discussed in this text are based on the assumption of
univariate normality, with the multivariate methods also assuming multivariate normality.
➢If a variable is multivariate normal, it is also univariate normal. However, the reverse is not
necessarily true (two or more univariate normal variables are not necessarily multivariate
normal).
➢Thus, a situation in which all variables exhibit univariate normality will help gain, although
not guarantee, multivariate normality. Multivariate normality is more difficult to test , but
specialized tests are available in the techniques most affected by departures from
multivariate normality.
29
➢In most cases assessing and achieving univariate normality for all variables is sufficient, and we will
address multivariate normality only when it is especially critical. Even though large sample sizes tend to
diminish the detrimental effects of nonnormality, the researcher should always assess the normality for
all metric variables included in the analysis.
➢In most programs, the skewness and kurtosis of a normal distribution are given values of zero. Then,
values above or below zero denote departures from normality. For example, negative kurtosis values
indicate a platykurtic (flatter) distribution, whereas positive values denote a leptokurtic (peaked)
distribution. Likewise, positive skewness values indicate the distribution shifted to the left, and the
negative values denote a rightward shift.
➢sample size has the effect of increasing statistical power by reducing sampling error. It results in a
similar effect here, in that larger sample sizes reduce the detrimental effects of non-normality. In small
samples of 50 or fewer observations, and especially if the sample size is less than 30 or so, significant
departures from normality can have a substantial impact on the results. as the sample sizes become large,
the researcher can be less concerned about non-normal variables, except as they might lead to other
assumption violations that do have an impact in other ways 30
Normality Testing Methods

1. Graphical Method: Histogram with Curve/Quantile-Quantile

(Q-Q) Plot/P-P (Probability-Probs) Plot

2. Statistical Method: Shapiro-Wilks Test/Kolmogorov–Smirnov

(K-S) test.

31
Graphical Analyses: The simplest diagnostic test for normality is a visual check of

the histogram that compares the observed data values with a distribution

approximating the normal distribution.

Although appealing because of its simplicity, this method is problematic for smaller

samples, where the construction of the histogram (e.g., the number of categories or

the width of categories) can distort the visual portrayal to such an extent that the

analysis is useless. A more reliable approach is the normal probability plot or

Quantile-Quantile (Q-Q) plot, which compares the cumulative distribution of actual

data values with the cumulative distribution of a normal distribution.

32
33
If either calculated z value exceeds the specified critical value, then the distribution is non-normal
in terms of that characteristic. The critical value is from a z distribution, based on the significance
level we desire. The most commonly used critical values are 62.58 (.01 significance level) and
61.96, which corresponds to a .05 error level. With these simple tests, the researcher can easily
assess the degree to which the skewness and peakedness of the distribution vary from the normal
distribution.
The two most common are the Shapiro-Wilks test and a modification of the Kolmogorov–
Smirnov test. Each calculates the level of significance for the differences from a normal
distribution. The researcher should always remember that tests of significance are less useful in
small samples (fewer than 30) and quite sensitive in large samples (exceeding 1,000
34
observations).
Normality Hypothesis: Analysis->DS->Explore->Normality Test

Ho: There is no significance difference between test data and normal


distribution

H1: There is a significance difference between test data and normal


distribution

35
Skewness: -1 to +1 or -0.5 to +0.5, Three times of stand error
Kurtosis: -3 to +3 or -2 to + 2, , Three times of stand error
Tests for Skewness and KurtosisThe values for
asymmetry and
1
kurtosis
between -2 and
+2 are
considered
acceptable in
2 order to prove
normal
univariate
distribution
(George &
• Relaxed rule: Mallery, 2010)
• Skewness > 1 = positive (right) skewed
• Skewness < -1 = negative (left) skewed
• Skewness between -1 and 1 is fine
• Strict rule: 3
• Abs(Skewness) > 3*Std. error = Skewed
• Same for Kurtosis
36
A P-P and Q-Q plot are two
types of probability-
probability plots used to
compare the distribution of
random variables. The P-P
(probability-probs) plot
compares the empirical
distribution function (EDF) of
a dataset to a uniform
distribution, while the Q-Q
(quantile-quantile) plot
compares the EDF to a
standard normal
distribution. These plots can
help identify the underlying
distribution of data and
whether it is normally
distributed or not.
Tests for Normality

SPSS
1. Analyze
2. Explore
3. Plots
4. Normality
*Neither of these variables would
be considered normally
distributed according to the KS or
SW measures, but a visual
inspection shows that role
conflict (left) is roughly normal
and participation (right) is
positive skewed.
So, ALWAYS conduct visual
inspections!

40
➢Thus, the researcher should always use both the graphical plots and any statistical

tests to assess the actual degree of departure from normality.

➢A number of data transformations available to accommodate non-normal

distributions. This session confines the discussion to univariate normality tests and

transformations. However, when we examine other multivariate methods, such as

multivariate regression or multivariate analysis of variance, we discuss tests for

multivariate normality as well. Moreover, many times when non-normality is

indicated, it also contributes to other assumption violations; therefore, remedying

normality first may assist in meeting other statistical assumptions as well.

41
Before and After Transformation
Negative Skewed Cubed

42
Normality Write Up
➢ In assessing the normality of auction prices for wine, both the Kolmogorov-Smirnov and
Shapiro-Wilk tests were utilized. The Kolmogorov-Smirnov test yielded a statistic of .094 with a
significance level of .200, while the Shapiro-Wilk test produced a statistic of .950 with a
significance level of .215. Neither test indicated significant deviations from normality, as p-
values were greater than the conventional alpha level of .05.
➢ The normal Q-Q plot further supports the conclusion of normality. Data points largely adhere
to the line of expected normal distribution, with only slight deviations, suggesting that the
auction prices do not significantly differ from a normal distribution.
➢ Lastly, the histogram of auction prices, with a mean of 7.04 and a standard deviation of .635
for the 27 observed wines, displays a fairly symmetrical distribution around the mean, which is
characteristic of a normal distribution. The overlaid normal curve fits the histogram reasonably
well, indicating that the data are approximately normally distributed.
➢ These statistical tests and visual inspections collectively suggest that the auction prices of wine
in this sample do not depart significantly from a normal distribution. It is, therefore,
reasonable to conclude that the auction prices for wine follow a normal distribution, allowing
for parametric statistical techniques that assume normality to be used in subsequent analyses.
➢ This interpretation should be treated with caution due to the small sample size (n=27), which
may not be representative of the population and could affect the power of the tests. 43
➢The histogram of regression standardized residuals appeared to be
normally distributed, with a mean of approximately 8.74E-15 (a value
close to zero) and a standard deviation of 0.920. The normal P-P plot
of regression standardized residuals showed points that closely
followed the expected line, suggesting that the residuals were
normally distributed. The normality of residuals suggested that the
assumptions for multiple regression were met, lending credibility to
the model. However, the relatively small sample size warrants caution
in generalizing these findings.

44
Homoscedasticity:
➢Dependent variable(s) exhibit equal levels of variance across the range

of predictor variable(s).

➢Homoscedasticity is desirable because the variance of the dependent

variable being explained in the dependence relationship should not be

concentrated in only a limited range of the independent values.

➢If data has normality and Linearity, This assumption will be fulfilled
45
➢If a variable has this property, it means that the DV exhibits consistent/equal variance across the range of
predictor variable(s).

➢A simple way to determine if a relationship is homoscedastic, is to do a scatter plot with the IV on the x-axis
and the DV on the y-axis. If the plot comes up with a linear pattern and has a substantial R-square, we have
homoscedasticity! If there is not a linear pattern, and the R-square is low, then the relationship is
heteroscedastic.

➢Homoscedasticity and heteroscedasticity are both related to linear regression. Homoscedasticity refers to the
condition where the variance of the error term around the regression line is constant for all levels of the
predictor. On the other hand, heteroscedasticity is the condition where the variance of the error term varies
across levels of the predictor. It's important to assess whether the residuals (the differences between the
observed and predicted values) of a regression model exhibit homoscedasticity or heteroscedasticity to ensure
its validity and reliability.

➢Homoscedasticity refers to the assumption that the variance of the errors is constant across all levels of the
independent variables. This is a key assumption in regression analysis because it ensures that the statistical
tests for significance are valid. 46
➢Imagine you are a researcher studying how different factors affect house prices. You
decide to use multiple regression, where the house price is the dependent variable, and
the factors you're looking at are the size of the house (in square feet), the number of
bedrooms, and the age of the house.

➢Homoscedasticity means that the spread or variability of house prices around the
predicted prices (based on your model) should be roughly the same, regardless of the
size of the house, the number of bedrooms, or the age of the house.

➢When you plot the residuals (the differences between the observed house prices and the
prices predicted by your model) against the predicted house prices, you should see a
random scatter of points, with no clear pattern and roughly equal spread across all levels
of predicted prices. Whether the predicted price is $100,000 or $500,000, the spread of
the residuals around each predicted value should be similar.
47
➢Example of Heteroscedasticity (the opposite of homoscedasticity):If you notice
that for cheaper houses (e.g., predicted price around $100,000), the residuals are
very tightly clustered (small variability), but for more expensive houses (e.g.,
predicted price around $500,000), the residuals are much more spread out (large
variability), this is a sign of heteroscedasticity. It means the error variance is not
constant

➢In multiple regression, if the assumption of homoscedasticity is violated, it can


lead to inefficient estimates and affect the reliability of hypothesis testing. The
standard errors of the coefficients might be inaccurate, leading to incorrect
conclusions about the significance of the independent variables. This is why
checking for homoscedasticity is an important step in regression analysis.
48
➢Let's say your model predicts that a house with 3 bedrooms, 2000 square feet, and 10
years old should be $300,000. But in reality, the house sold for $310,000. The residual
(actual minus predicted) is $10,000.
➢You do this for many houses and plot these residuals. If the spread of these residuals is
roughly the same for houses predicted at $100,000, $300,000, or $500,000, that's
homoscedasticity.
➢If the residuals are small (tight cluster) for low-priced houses but large and spread out
for high-priced houses, this is heteroscedasticity. It indicates that your model's ability to
predict the price accurately varies with the price level, which can be problematic for
drawing reliable conclusions from your model.
➢In simple terms, homoscedasticity means that your model's accuracy or error is not
dependent on the value of the house price you're trying to predict. The reliability and
stability of your model don't change whether you're predicting the price of a modest
home or a luxury mansion. 49
50
Homoscedasticity Testing Methods
1. Graphical Method: Scatter Plot: *Z residual (Y) and Z pred (X)

/Boxplot

2. Statistical Method: Levene Test (Bivariate)/Breusch-Pagan Test &

Box’s M Test (Multivariate)

*Z: Standardized

51
➢Nonmetric Independent Variables: In these analyses (e.g., ANOVA and MANOVA) the focus
now becomes the equality of the variance (single dependent variable) or the variance/covariance
matrices (multiple dependent variables) across the groups formed by the nonmetric independent
variables. The equality of variance/covariance matrices is also seen in discriminant analysis, but
in this technique the emphasis is on the spread of the independent variables across the groups
formed by the nonmetric dependent measure.
➢Graphical Tests of Equal Variance Dispersion: The test of homoscedasticity for two metric
variables is best examined graphically. Departures from an equal dispersion are shown by such
shapes as cones (small dispersion at one side of the graph, large dispersion at the opposite side)
or diamonds (a large number of points at the center of the distribution). Boxplots work well to
represent the degree of variation between groups formed by a categorical variable. The length of
the box and the whiskers each portray the variation of data within that group. Thus,
heteroscedasticity would be portrayed by substantial differences in the length of the boxes and
52
whiskers between groups representing the dispersion of observations in each group.
➢The statistical tests for equal variance dispersion assess the equality of variances within groups
formed by nonmetric variables. The most common test, the Levene test, is used to assess
whether the variances of a single metric variable are equal across any number of groups. If more
than one metric variable is being tested, so that the comparison involves the equality of
variance/covariance matrices, the Box’s M test is applicable..
➢Heteroscedastic variables can be remedied through data transformations similar to those used to
achieve normality. As mentioned earlier, many times heteroscedasticity is the result of non-
normality of one of the variables, and correction of the non-normality also remedies the unequal
dispersion of variance.
➢We should also note that the issue of heteroscedasticity can be remedied directly in some
statistical techniques without the need for transformation. For example, in multiple regression
the standard errors can be corrected for heteroscedasticity to produce heteroscedasticity-
consistent standard errors
53
Homoscedasticity Write Up

The scatterplot of the regression standardized predicted values against


the regression standardized residuals does not show any obvious
patterns or systematic deviations from the horizontal line at zero, which
suggests that the assumption of homoscedasticity (equal variances) is
met.

55
Residual as DV and Regression with selected IV then check ANOVA sig for above hypothesis
➢Linearity: An implicit assumption of all multivariate techniques based on

correlational measures of association, including multiple regression, logistic

regression, factor analysis, and structural equation modeling, is linearity. Because

correlations represent only the linear association between variables, nonlinear

effects will not be represented in the correlation value.

➢Many scatterplot programs can show the straight line depicting the linear

relationship, enabling the researcher to better identify any nonlinear

characteristics.

57
58
Linearity

• Linearity refers to the consistent slope of change that represents the relationship

between an IV and a DV.

• If the relationship between the IV and the DV is radically inconsistent, then it will

throw off your SEM analyses as your data is not linear

• Sometime you achieve this with transformations (log linear).

59
➢An alternative approach is to run a simple regression analysis and to examine the
residuals. A third approach is to explicitly model a nonlinear relationship by the
testing of alternative model specifications (also know as curve fitting) that reflect
the nonlinear elements.
➢If a nonlinear relationship is detected, the most direct approach is to transform one
or both variables to achieve linearity. A number of available transformations are
discussed later in this chapter. An alternative to data transformation is the creation
of new variables to represent the nonlinear portion of the relationship. The process
of creating and interpreting these additional variables, which can be used in all
linear relationships.
60
Linearity Testing Methods
1. Graphical Method: Scatter Plot/Matrix Scatter Plot with Trend Line

2. Statistical Method: Test of Linearity/Curve Linear Regression

61
62
Ho: There is no significance difference between test data and Linearity
H1: There is a significance difference between test data and Linearity

Good
Bad
63
64
Linearity Write Up

A linearity test was conducted to examine the relationship between the auction

price of wine and the amount of winter rain. The analysis utilized an ANOVA

framework to assess the linearity of the relationship and the deviation from

linearity. The sample size for the test was 26. The results of the ANOVA indicated

that the linearity between the auction price of the wine and winter rain. the test

for deviation from linearity was not significant, F(24, 25) = 1.883, p = .527,

indicating that there is no evidence of a non-linear relationship.

65
Multicollinearity
• Multicollinearity is not desirable in regressions (but desirable in factor analysis!).

• It means that independent variables are too highly correlated with each other and
share too much variance

• Influences the accuracy of estimates for DV and inflates error terms for DV
(Hair).

• How much unique variance does the black circle actually account for?

66
67
Linearity Testing Methods
1. Graphical Method: Scatter Plot

2. Statistical Method: Correlation value above 0.7 between IVs/Variable

Inflation Factor (VIF) above 10/Tolerance less 0.1

68
Detecting Multicollinearity
• IV to IV: above 0,7 Correlation is bad
• An easy way to check this is to calculate a Variable Inflation Factor (VIF) for each
independent variable after running a multivariate regression using one of the IVs as the
dependent variable, and then regressing it on all the remaining IVs. Then swap out the
IVs one at a time.
• The rules of thumb for the VIF are as follows:
• VIF < 3; no problem
• VIF > 3; potential problem
• VIF > 5; very likely problem
• VIF > 10; definitely problem
69
Handling Multicollinearity

Loyalty 2 and
loyalty 3 seem to
be too similar in
both of these test

Dropping Loyalty
2 fixed the
problem

70
Multicollinearity Write Up

In examining the predictors of the auction price of wine, a multiple regression analysis was

conducted. The variance inflation factor (VIF) and tolerance statistics were inspected to assess

multicollinearity among the independent variables. Generally, a VIF above 10 or tolerance below

0.1 might indicate serious multicollinearity concerns (James, Witten, Hastie, & Tibshirani, 2013).

In the current model, the VIF values ranged from 1.103 for Harvest Rain to 1.247 for Average

Growing Season Temperature, and tolerance values ranged from 0.802 to 0.907, suggesting that

multicollinearity is not a concern for this set of predictors. Each independent variable appears to

contribute unique information in predicting the auction price of the wine, without unduly

influencing the coefficients of the other variables in the model.


72
Fixing Issues

• Fix flat distribution with:


• Inverse: 1/X

• Fix negative skewed distribution with:


• Squared: X*X
• Cubed: X*X*X

• Fix positive skewed distribution with:


• Square root: SQRT(X)
• Logarithm: LG10(X)

73
Transformations Related to Specific Relationship Types :

➢Log-linear: a log of the Y variable with an untransformed X variable provides an

estimate of the percentage change in Y given a one unit change in X.

➢Linear-log: a log of the X variable with an untransformed Y variable provides an

estimate of the unit change in Y for a percentage change in X

➢Log-log: a log of both X and Y provides the ratio of the percentage change of Y

given a percentage change in X, the definition of elasticity

74
➢The researcher is faced with what may seem to be an impossible task: satisfy all

of these statistical assumptions or risk a biased and flawed analysis. We want to

note that even though these statistical assumptions are important, the researcher

must use judgment in how to interpret the tests for each assumption and when to

apply remedies.

➢Even analyses with small sample sizes can withstand small, but significant,

departures from normality. What is more important for the researcher is to

understand the implications of each assumption with regard to the technique of

interest, striking a balance between the need to satisfy the assumptions versus the

robustness of the technique and research context.


75
Violation of Normality:
➢Transformation: Apply transformations to your data, such as log, square root, or inverse transformations,
which can help in normalizing the distribution of residuals.
➢Non-parametric Methods: If transformations don’t work, consider using non-parametric regression techniques
that don’t assume normality.
Violation of Linearity:
➢Transformations: Similar to addressing normality, transforming either the dependent or independent variables
can help. Adding Polynomial Terms: Include squared or cubic terms of the predictors (polynomial regression)
to capture the non-linear relationship.
➢Segmentation: Sometimes, breaking the data into different segments (where the linear assumption holds) can
be effective.
Violation of Homoscedasticity (Heteroscedasticity):
➢Transformations: Applying a transformation to the dependent variable can stabilize the variance.
➢Weighted Regression: Use weighted least squares regression, where more weight is given to observations
with smaller variances.
76
Multicollinearity:

➢Remove Highly Correlated Predictors: If two variables are highly correlated, consider removing one of them.

➢Principal Component Analysis (PCA): PCA or Factor Analysis can reduce the dimensionality of your data,

combining highly correlated variables into a smaller set of uncorrelated components.

➢Regularization Techniques: Methods like Ridge Regression or Lasso can help in handling multicollinearity.

Autocorrelation:

➢Adding Lag Variables: In time series data, adding lagged versions of the dependent variable as predictors can

help.

➢Differencing: Apply differencing to the time series data (subtracting the previous observation from the current

observation).

➢Use Time Series Models: Consider time series-specific models like ARIMA, which are designed to handle

autocorrelation.

77
Overall Strategies:

➢Data Cleaning and Exploration: Ensure your data is clean and explore it thoroughly before

modeling. Outliers or errors in data can often cause violations of assumptions.

➢Model Selection: Sometimes, the choice of model itself can be the issue. Explore different types

of models that are more suited to your data.

➢Consult Domain Knowledge: Understanding the context and domain can sometimes provide

insights into why assumptions are violated and how to address them.

➢Always remember that when applying these remedies, it's important to understand their impact

on your model and interpretation. Each remedy can change the nature of your model and should

be chosen carefully based on both statistical reasoning and an understanding of your data and

research question.

78
Out-of-Sample R2

Model Test
Variables
R2 R2
Avg Growing Season Temp (AGST) 0.44 0.79
AGST, Harvest Rain 0.71 -0.08
AGST, Harvest Rain, Age 0.79 0.53
AGST, Harvest Rain, Age, Winter Rain 0.83 0.79
AGST, Harvest Rain, Age, Winter Rain, Population 0.83 0.76
• Better model R2 does not necessarily mean better test set R2
• Need more data to be conclusive
• Out-of-sample R2 can be negative!
2
Predictive Ability
• Our wine model had a value of R2 = 0.83

• Tells us our accuracy on the data that we used to


build the model

• But how well does the model perform on new data?


• Bordeaux wine buyers profit from being able to predict
the quality of a wine years before it matures

1
Harvestify : https://harvestify.herokuapp.com/

➢Harvestify : https://harvestify.herokuapp.com/

https://www.dreamstime.com/diseases-grape-leaves-caused-
parasite-insect-bites-living-vines-does-not-affect-grapes-leaf-
galls-look-like-image172727918
American Psychological Association (APA) Write Up

1
A linear regression analysis was conducted to predict auction prices of wine based on the age of the wine,
winter rain, harvest rain, and the average temperature during the growing season. The model significantly
predicted the auction price of wine, F(4, 22) = 26.390, p < .001, and accounted for approximately 82.8% of the
variance in auction prices (R² = .828, Adjusted R² = .796). The standard error of the estimate was .28.
Specifically, each unit increase in the average growing season temperature was associated with a .616 increase
in the auction price of wine (p < .001), holding all other variables constant. Conversely, each unit increase in
harvest rain was associated with a .004 decrease in the auction price of wine (p < .001). Additionally, winter rain
was positively correlated with auction price (B = .001, p = .024), and age of the wine was also positively
correlated with auction price (B = .024, p = .003).
The data assumption test through graphical and statistical methods indicated that non-linearity, non-normality,
heteroscedasticity and multicollinearity was not a concern,. These results suggest that the average growing
season temperature and the age of the wine are the strongest predictors of the auction price of wine, with the
average growing season temperature having the most substantial impact. However, it is important to note that
while this model is statistically significant, the practical significance should be considered alongside other
factors that might influence wine prices in auction settings.
The Results

• Parker:
• 1986 is “very good to sometimes exceptional”
• Ashenfelter:
• 1986 is mediocre
• 1989 will be “the wine of the century” and 1990 will be even better!
• In wine auctions,
• 1989 sold for more than twice the price of 1986
• 1990 sold for even higher prices!
• Later, Ashenfelter predicted 2000 and 2003 would be great

• Parker has stated that “2000 is the greatest vintage Bordeaux has ever produced”
The Analytics Edge

• A linear regression model with only a few variables can predict wine
prices well

• In many cases, outperforms wine experts’ opinions

• A quantitative approach to a traditionally qualitative problem

You might also like