0% found this document useful (0 votes)

6 views

Regression

Uploaded by

amatuer3293

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Regression

Uploaded by

amatuer3293

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 90

THE STATISTICAL SOMMELIER

An Introduction to Linear Regression

Bordeaux Wine

• Large differences in price and quality

between years, although wine is produced
in a similar way
• Meant to be aged, so hard to tell if wine
will be good when it is on the market
• Expert tasters predict which ones will be
good
• Can analytics be used to come up with a
different system for judging wine?
1
Predicting the Quality of Wine

• March 1990 - Orley

Ashenfelter, Princeton
University, Eeconomics
professor, claims he can
predict wine quality
without tasting the wine

2
The New York Time: https://www.nytimes.com/2023/12/05/science/bordeaux-red-wine-estate-machine-learning.html
Building a Model
• Ashenfelter used a method called linear regression
• Predicts an outcome variable, or dependent variable
• Predicts using a set of independent variables

• Dependent variable: typical price in 1990-1991 wine auctions (approximates

quality)

• Independent variables:
• Age – older wines are more expensive
• Weather
• Average Growing Season Temperature
• Harvest Rain
• Winter Rain
3
The Data (1952 – 1978)
8.5

8.5
(Logarithm of) Price
(Logarithm of) Price

7.5

7.5
6.5

6.5
5 10 15 20 25 30 15.0 15.5 16.0 16.5 17.0 17.5
Age of Wine (Years) Avg Growing Season Temp (Celsius)
8.5

8.5
(Logarithm of) Price
(Logarithm of) Price

7.5

7.5
6.5

6.5

50 100 150 200 250 300 400 500 600 700 800
Harvest Rain (mm) Winter Rain (mm)

4
The Expert’s Reaction
Robert Parker, the world's most
influential wine expert:

“Ashenfelter is an absolute
total sham”

“rather like a movie critic who

never goes to see the movie but
tells you how good it is based on
the actors and the director”

5
One-Variable Linear Regression
(Logarithm of) Price

8.5
8.0
7.5
7.0
6.5

15.0 15.5 16.0 16.5 17.0 17.5

Avg Growing Season Temp (Celsius)

1
The Regression Model
• One-variable regression model
yi = Ø0 + Ø1xi + ⇥i
y i = dependent variable (wine price) for the ith observation
xi = independent variable (temperature) for the ith observation
i
= error term for the ith observation
Ø0 = intercept coefficient
Ø1 = regression coefficient for the independent variable

• The best model (choice of coefficients) has the smallest error

terms

2
Selecting the Best Model
(Logarithm of) Price

8.5
8.0
7.5
7.0
6.5

15.0 15.5 16.0 16.5 17.0 17.5

Avg Growing Season Temp (Celsius)

3
Selecting the Best Model
8.5

SSE = 10.15
8. 0
(Logarithm of) Price

SSE = 6.03
7.5

SSE = 5.73
7.0
6.5

1155.00 1155.55 1166.00 1166.55 1177.00 1177.55

AAvvggGGroowwinngg SSeeaassoonnTTeemmpp((CCeelssiuuss))

4
Other Error Measures
• SSE can be hard to interpret
• Depends on N
• Units are hard to understand

• Root-Mean-Square Error (RMSE)

r
SSE
RMSE =
N
• Normalized by N, units of dependent variable

5
R2
• Compares the best
8.5

model to a “baseline”
model
8.0
(Logarithm of) Price

• The baseline model

7.5

does not use any

7.0

variables
• Predicts same outcome
6.5

(price) regardless of the

15.0 15.5 16.0 16.5 17.0 17.5 independent variable
Avg Growing Season Temp (Celsius) (temperature)
6
R2
8.5

SSE = 5.73
8.0

SST = 10.15
(Logarithm of) Price

7.5
7.0
6.5

15.0 15.5 16.0 16.5 17.0 17.5

Avg Growing Season Temp (Celsius)

7
Interpreting R2

2
SSE
R = 1—
SST
• R2 captures value added from using a model
• R2 = 0 means no improvement over baseline
• R2 = 1 means a perfect predictive model
• Unitless and universally interpretable
• Can still be hard to compare between problems
• Good models for easy problems will have R2 ≈ 1
• Good models for hard problems can still have R2 ≈ 0
8
Available Independent Variables
• So far, we have only used the Average Growing
Season Temperature to predict wine prices

• Many different independent variables could be used

• Average Growing Season Temperature
• Harvest Rain
• Winter Rain
• Age of Wine (in 1990)
• Population of France

1
Multiple Linear Regression
• Using each variable on its own:
• R2 = 0.44 using Average Growing Season Temperature
• R2 = 0.32 using Harvest Rain
• R2 = 0.22 using France Population
• R2 = 0.20 using Age
• R2 = 0.02 using Winter Rain

• Multiple linear regression allows us to use all of these

variables to improve our predictive ability

2
The Regression Model
• Multiple linear regression model with k variables
yi = Ø0 + Ø1xi + Ø2xi + . . . + Økxi + ⇥i
1 2 k

y i = dependent variable (wine price) for the ith observation

xji = jth independent variable for the ith observation
i=
error term for the ith observation
Ø0= intercept coefficient
Øj = regression coefficient for the jth independent variable

• Best model coefficients selected to minimize SSE

3
Adding Variables

Variables R2
Average Growing Season Temperature (AGST) 0.44
AGST, Harvest Rain 0.71
AGST, Harvest Rain, Age 0.79
AGST, Harvest Rain, Age, Winter Rain 0.83
AGST, Harvest Rain, Age, Winter Rain, Population 0.83

• Adding more variables can improve the model

• Diminishing returns as more variables are added

4
Selecting Variables
• Not all available variables should be used
• Each new variable requires more data
• Causes overfitting: high R2 on data used to create model,
but bad performance on unseen data

• We will see later how to appropriately choose

variables to remove

5
Understanding the Model and Coefficients
We have remove variable not Sign

1
Correlation
A measure of the linear relationship between variables
+1 = perfect positive linear relationship
0 = no linear relationship
-1 = perfect negative linear relationship

2
Examples of Correlation

8.5
8.0
(Logarithm of) Price

7.5
7.0
6.5

400 500 600 700 800

Winter Rain (mm)

3
Examples of Correlation

17.5
Avg Growing Season Temp (Celsius)

17.0
16.5
16.0
15.5
15.0

50 100 150 200 250 300

Harvest Rain (mm)

4
Examples of Correlation

Population of France (thousands)

52000
48000
44000

5 10 15 20 25 30

Age of Wine (Years)

5
Testing the Assumptions of Multivariate Analysis: Some techniques are less affected by violating certain

assumptions, which is termed robustness, but in all cases meeting some of the assumptions will be critical to a

successful analysis. Thus, it is necessary to understand the role played by each assumption for every

multivariate technique.

➢In almost all instances, the multivariate procedures will estimate the multivariate model and produce results

even when the assumptions are severely violated. Thus, the researcher must be aware of any assumption

violations and the implications they may have for the estimation process or the interpretation of the results.

➢Multivariate analysis requires that the assumptions underlying the statistical techniques be tested twice: first

for the separate variables, akin to the tests for a univariate analysis, and second for the multivariate model

variate, which acts collectively for the variables in the analysis and thus must meet the same assumptions as

individual variables.

➢Four of them potentially affect every univariate and multivariate statistical technique.

25
Data Assumptions:

❑ Normality (Yes)

❑ Homoscedasticity (Yes)

❑ Linearity (Yes)

❑ Multicollinearity (No)
Normality

➢Normality refers to the distributional assumptions of a variable.

➢Parametric Test: t tests and F tests assume normal distributions

➢Normality issues effect small sample sizes (<50) much more

than large sample sizes (>200)

27
Distribution
Let’s take a look and try it out

To check distribution in SPSS:

1. Analyze,
2. Explore,
3. Plots: Histogram with normality plot

28
Normality: The most fundamental assumption in multivariate analysis is normality, referring to the shape
of the data distribution for an individual metric variable and its correspondence to the normal distribution.
➢If the variation from the normal distribution is sufficiently large, all resulting statistical tests
are invalid, because normality is required to use the F and t statistics. Both the univariate
and the multivariate statistical methods discussed in this text are based on the assumption of
univariate normality, with the multivariate methods also assuming multivariate normality.
➢If a variable is multivariate normal, it is also univariate normal. However, the reverse is not
necessarily true (two or more univariate normal variables are not necessarily multivariate
normal).
➢Thus, a situation in which all variables exhibit univariate normality will help gain, although
not guarantee, multivariate normality. Multivariate normality is more difficult to test , but
specialized tests are available in the techniques most affected by departures from
multivariate normality.
29
➢In most cases assessing and achieving univariate normality for all variables is sufficient, and we will
address multivariate normality only when it is especially critical. Even though large sample sizes tend to
diminish the detrimental effects of nonnormality, the researcher should always assess the normality for
all metric variables included in the analysis.
➢In most programs, the skewness and kurtosis of a normal distribution are given values of zero. Then,
values above or below zero denote departures from normality. For example, negative kurtosis values
indicate a platykurtic (flatter) distribution, whereas positive values denote a leptokurtic (peaked)
distribution. Likewise, positive skewness values indicate the distribution shifted to the left, and the
negative values denote a rightward shift.
➢sample size has the effect of increasing statistical power by reducing sampling error. It results in a
similar effect here, in that larger sample sizes reduce the detrimental effects of non-normality. In small
samples of 50 or fewer observations, and especially if the sample size is less than 30 or so, significant
departures from normality can have a substantial impact on the results. as the sample sizes become large,
the researcher can be less concerned about non-normal variables, except as they might lead to other
assumption violations that do have an impact in other ways 30
Normality Testing Methods

1. Graphical Method: Histogram with Curve/Quantile-Quantile

(Q-Q) Plot/P-P (Probability-Probs) Plot

2. Statistical Method: Shapiro-Wilks Test/Kolmogorov–Smirnov

(K-S) test.

31
Graphical Analyses: The simplest diagnostic test for normality is a visual check of

the histogram that compares the observed data values with a distribution

approximating the normal distribution.

Although appealing because of its simplicity, this method is problematic for smaller

samples, where the construction of the histogram (e.g., the number of categories or

the width of categories) can distort the visual portrayal to such an extent that the

analysis is useless. A more reliable approach is the normal probability plot or

Quantile-Quantile (Q-Q) plot, which compares the cumulative distribution of actual

data values with the cumulative distribution of a normal distribution.

32
33
If either calculated z value exceeds the specified critical value, then the distribution is non-normal
in terms of that characteristic. The critical value is from a z distribution, based on the significance
level we desire. The most commonly used critical values are 62.58 (.01 significance level) and
61.96, which corresponds to a .05 error level. With these simple tests, the researcher can easily
assess the degree to which the skewness and peakedness of the distribution vary from the normal
distribution.
The two most common are the Shapiro-Wilks test and a modification of the Kolmogorov–
Smirnov test. Each calculates the level of significance for the differences from a normal
distribution. The researcher should always remember that tests of significance are less useful in
small samples (fewer than 30) and quite sensitive in large samples (exceeding 1,000
34
observations).
Normality Hypothesis: Analysis->DS->Explore->Normality Test

Ho: There is no significance difference between test data and normal

distribution

H1: There is a significance difference between test data and normal

distribution

35
Skewness: -1 to +1 or -0.5 to +0.5, Three times of stand error
Kurtosis: -3 to +3 or -2 to + 2, , Three times of stand error
Tests for Skewness and KurtosisThe values for
asymmetry and
1
kurtosis
between -2 and
+2 are
considered
acceptable in
2 order to prove
normal
univariate
distribution
(George &
• Relaxed rule: Mallery, 2010)
• Skewness > 1 = positive (right) skewed
• Skewness < -1 = negative (left) skewed
• Skewness between -1 and 1 is fine
• Strict rule: 3
• Abs(Skewness) > 3*Std. error = Skewed
• Same for Kurtosis
36
A P-P and Q-Q plot are two
types of probability-
probability plots used to
compare the distribution of
random variables. The P-P
(probability-probs) plot
compares the empirical
distribution function (EDF) of
a dataset to a uniform
distribution, while the Q-Q
(quantile-quantile) plot
compares the EDF to a
standard normal
distribution. These plots can
help identify the underlying
distribution of data and
whether it is normally
distributed or not.
Tests for Normality

SPSS
1. Analyze
2. Explore
3. Plots
4. Normality
*Neither of these variables would
be considered normally
distributed according to the KS or
SW measures, but a visual
inspection shows that role
conflict (left) is roughly normal
and participation (right) is
positive skewed.
So, ALWAYS conduct visual
inspections!

40
➢Thus, the researcher should always use both the graphical plots and any statistical

tests to assess the actual degree of departure from normality.

➢A number of data transformations available to accommodate non-normal

distributions. This session confines the discussion to univariate normality tests and

transformations. However, when we examine other multivariate methods, such as

multivariate regression or multivariate analysis of variance, we discuss tests for

multivariate normality as well. Moreover, many times when non-normality is

indicated, it also contributes to other assumption violations; therefore, remedying

normality first may assist in meeting other statistical assumptions as well.

41
Before and After Transformation
Negative Skewed Cubed

42
Normality Write Up
➢ In assessing the normality of auction prices for wine, both the Kolmogorov-Smirnov and
Shapiro-Wilk tests were utilized. The Kolmogorov-Smirnov test yielded a statistic of .094 with a
significance level of .200, while the Shapiro-Wilk test produced a statistic of .950 with a
significance level of .215. Neither test indicated significant deviations from normality, as p-
values were greater than the conventional alpha level of .05.
➢ The normal Q-Q plot further supports the conclusion of normality. Data points largely adhere
to the line of expected normal distribution, with only slight deviations, suggesting that the
auction prices do not significantly differ from a normal distribution.
➢ Lastly, the histogram of auction prices, with a mean of 7.04 and a standard deviation of .635
for the 27 observed wines, displays a fairly symmetrical distribution around the mean, which is
characteristic of a normal distribution. The overlaid normal curve fits the histogram reasonably
well, indicating that the data are approximately normally distributed.
➢ These statistical tests and visual inspections collectively suggest that the auction prices of wine
in this sample do not depart significantly from a normal distribution. It is, therefore,
reasonable to conclude that the auction prices for wine follow a normal distribution, allowing
for parametric statistical techniques that assume normality to be used in subsequent analyses.
➢ This interpretation should be treated with caution due to the small sample size (n=27), which
may not be representative of the population and could affect the power of the tests. 43
➢The histogram of regression standardized residuals appeared to be
normally distributed, with a mean of approximately 8.74E-15 (a value
close to zero) and a standard deviation of 0.920. The normal P-P plot
of regression standardized residuals showed points that closely
followed the expected line, suggesting that the residuals were
normally distributed. The normality of residuals suggested that the
assumptions for multiple regression were met, lending credibility to
the model. However, the relatively small sample size warrants caution
in generalizing these findings.

44
Homoscedasticity:
➢Dependent variable(s) exhibit equal levels of variance across the range

of predictor variable(s).

➢Homoscedasticity is desirable because the variance of the dependent

variable being explained in the dependence relationship should not be

concentrated in only a limited range of the independent values.

➢If data has normality and Linearity, This assumption will be fulfilled
45
➢If a variable has this property, it means that the DV exhibits consistent/equal variance across the range of
predictor variable(s).

➢A simple way to determine if a relationship is homoscedastic, is to do a scatter plot with the IV on the x-axis
and the DV on the y-axis. If the plot comes up with a linear pattern and has a substantial R-square, we have
homoscedasticity! If there is not a linear pattern, and the R-square is low, then the relationship is
heteroscedastic.

➢Homoscedasticity and heteroscedasticity are both related to linear regression. Homoscedasticity refers to the
condition where the variance of the error term around the regression line is constant for all levels of the
predictor. On the other hand, heteroscedasticity is the condition where the variance of the error term varies
across levels of the predictor. It's important to assess whether the residuals (the differences between the
observed and predicted values) of a regression model exhibit homoscedasticity or heteroscedasticity to ensure
its validity and reliability.

➢Homoscedasticity refers to the assumption that the variance of the errors is constant across all levels of the
independent variables. This is a key assumption in regression analysis because it ensures that the statistical
tests for significance are valid. 46
➢Imagine you are a researcher studying how different factors affect house prices. You
decide to use multiple regression, where the house price is the dependent variable, and
the factors you're looking at are the size of the house (in square feet), the number of
bedrooms, and the age of the house.

➢Homoscedasticity means that the spread or variability of house prices around the
predicted prices (based on your model) should be roughly the same, regardless of the
size of the house, the number of bedrooms, or the age of the house.

➢When you plot the residuals (the differences between the observed house prices and the
prices predicted by your model) against the predicted house prices, you should see a
random scatter of points, with no clear pattern and roughly equal spread across all levels
of predicted prices. Whether the predicted price is $100,000 or $500,000, the spread of
the residuals around each predicted value should be similar.
47
➢Example of Heteroscedasticity (the opposite of homoscedasticity):If you notice
that for cheaper houses (e.g., predicted price around $100,000), the residuals are
very tightly clustered (small variability), but for more expensive houses (e.g.,
predicted price around $500,000), the residuals are much more spread out (large
variability), this is a sign of heteroscedasticity. It means the error variance is not
constant

➢In multiple regression, if the assumption of homoscedasticity is violated, it can

lead to inefficient estimates and affect the reliability of hypothesis testing. The
standard errors of the coefficients might be inaccurate, leading to incorrect
conclusions about the significance of the independent variables. This is why
checking for homoscedasticity is an important step in regression analysis.
48
➢Let's say your model predicts that a house with 3 bedrooms, 2000 square feet, and 10
years old should be $300,000. But in reality, the house sold for $310,000. The residual
(actual minus predicted) is $10,000.
➢You do this for many houses and plot these residuals. If the spread of these residuals is
roughly the same for houses predicted at $100,000, $300,000, or $500,000, that's
homoscedasticity.
➢If the residuals are small (tight cluster) for low-priced houses but large and spread out
for high-priced houses, this is heteroscedasticity. It indicates that your model's ability to
predict the price accurately varies with the price level, which can be problematic for
drawing reliable conclusions from your model.
➢In simple terms, homoscedasticity means that your model's accuracy or error is not
dependent on the value of the house price you're trying to predict. The reliability and
stability of your model don't change whether you're predicting the price of a modest
home or a luxury mansion. 49
50
Homoscedasticity Testing Methods
1. Graphical Method: Scatter Plot: *Z residual (Y) and Z pred (X)

/Boxplot

2. Statistical Method: Levene Test (Bivariate)/Breusch-Pagan Test &

Box’s M Test (Multivariate)

*Z: Standardized

51
➢Nonmetric Independent Variables: In these analyses (e.g., ANOVA and MANOVA) the focus
now becomes the equality of the variance (single dependent variable) or the variance/covariance
matrices (multiple dependent variables) across the groups formed by the nonmetric independent
variables. The equality of variance/covariance matrices is also seen in discriminant analysis, but
in this technique the emphasis is on the spread of the independent variables across the groups
formed by the nonmetric dependent measure.
➢Graphical Tests of Equal Variance Dispersion: The test of homoscedasticity for two metric
variables is best examined graphically. Departures from an equal dispersion are shown by such
shapes as cones (small dispersion at one side of the graph, large dispersion at the opposite side)
or diamonds (a large number of points at the center of the distribution). Boxplots work well to
represent the degree of variation between groups formed by a categorical variable. The length of
the box and the whiskers each portray the variation of data within that group. Thus,
heteroscedasticity would be portrayed by substantial differences in the length of the boxes and
52
whiskers between groups representing the dispersion of observations in each group.
➢The statistical tests for equal variance dispersion assess the equality of variances within groups
formed by nonmetric variables. The most common test, the Levene test, is used to assess
whether the variances of a single metric variable are equal across any number of groups. If more
than one metric variable is being tested, so that the comparison involves the equality of
variance/covariance matrices, the Box’s M test is applicable..
➢Heteroscedastic variables can be remedied through data transformations similar to those used to
achieve normality. As mentioned earlier, many times heteroscedasticity is the result of non-
normality of one of the variables, and correction of the non-normality also remedies the unequal
dispersion of variance.
➢We should also note that the issue of heteroscedasticity can be remedied directly in some
statistical techniques without the need for transformation. For example, in multiple regression
the standard errors can be corrected for heteroscedasticity to produce heteroscedasticity-
consistent standard errors
53
Homoscedasticity Write Up

The scatterplot of the regression standardized predicted values against

the regression standardized residuals does not show any obvious
patterns or systematic deviations from the horizontal line at zero, which
suggests that the assumption of homoscedasticity (equal variances) is
met.

55
Residual as DV and Regression with selected IV then check ANOVA sig for above hypothesis
➢Linearity: An implicit assumption of all multivariate techniques based on

correlational measures of association, including multiple regression, logistic

regression, factor analysis, and structural equation modeling, is linearity. Because

correlations represent only the linear association between variables, nonlinear

effects will not be represented in the correlation value.

➢Many scatterplot programs can show the straight line depicting the linear

relationship, enabling the researcher to better identify any nonlinear

characteristics.

57
58
Linearity

• Linearity refers to the consistent slope of change that represents the relationship

between an IV and a DV.

• If the relationship between the IV and the DV is radically inconsistent, then it will

throw off your SEM analyses as your data is not linear

• Sometime you achieve this with transformations (log linear).

59
➢An alternative approach is to run a simple regression analysis and to examine the
residuals. A third approach is to explicitly model a nonlinear relationship by the
testing of alternative model specifications (also know as curve fitting) that reflect
the nonlinear elements.
➢If a nonlinear relationship is detected, the most direct approach is to transform one
or both variables to achieve linearity. A number of available transformations are
discussed later in this chapter. An alternative to data transformation is the creation
of new variables to represent the nonlinear portion of the relationship. The process
of creating and interpreting these additional variables, which can be used in all
linear relationships.
60
Linearity Testing Methods
1. Graphical Method: Scatter Plot/Matrix Scatter Plot with Trend Line

2. Statistical Method: Test of Linearity/Curve Linear Regression

61
62
Ho: There is no significance difference between test data and Linearity
H1: There is a significance difference between test data and Linearity

Good
Bad
63
64
Linearity Write Up

A linearity test was conducted to examine the relationship between the auction

price of wine and the amount of winter rain. The analysis utilized an ANOVA

framework to assess the linearity of the relationship and the deviation from

linearity. The sample size for the test was 26. The results of the ANOVA indicated

that the linearity between the auction price of the wine and winter rain. the test

for deviation from linearity was not significant, F(24, 25) = 1.883, p = .527,

indicating that there is no evidence of a non-linear relationship.

65
Multicollinearity
• Multicollinearity is not desirable in regressions (but desirable in factor analysis!).

• It means that independent variables are too highly correlated with each other and
share too much variance

• Influences the accuracy of estimates for DV and inflates error terms for DV
(Hair).

• How much unique variance does the black circle actually account for?

66
67
Linearity Testing Methods
1. Graphical Method: Scatter Plot

2. Statistical Method: Correlation value above 0.7 between IVs/Variable

Inflation Factor (VIF) above 10/Tolerance less 0.1

68
Detecting Multicollinearity
• IV to IV: above 0,7 Correlation is bad
• An easy way to check this is to calculate a Variable Inflation Factor (VIF) for each
independent variable after running a multivariate regression using one of the IVs as the
dependent variable, and then regressing it on all the remaining IVs. Then swap out the
IVs one at a time.
• The rules of thumb for the VIF are as follows:
• VIF < 3; no problem
• VIF > 3; potential problem
• VIF > 5; very likely problem
• VIF > 10; definitely problem
69
Handling Multicollinearity

Loyalty 2 and
loyalty 3 seem to
be too similar in
both of these test

Dropping Loyalty
2 fixed the
problem

70
Multicollinearity Write Up

In examining the predictors of the auction price of wine, a multiple regression analysis was

conducted. The variance inflation factor (VIF) and tolerance statistics were inspected to assess

multicollinearity among the independent variables. Generally, a VIF above 10 or tolerance below

0.1 might indicate serious multicollinearity concerns (James, Witten, Hastie, & Tibshirani, 2013).

In the current model, the VIF values ranged from 1.103 for Harvest Rain to 1.247 for Average

Growing Season Temperature, and tolerance values ranged from 0.802 to 0.907, suggesting that

multicollinearity is not a concern for this set of predictors. Each independent variable appears to

contribute unique information in predicting the auction price of the wine, without unduly

influencing the coefficients of the other variables in the model.

72
Fixing Issues

• Fix flat distribution with:

• Inverse: 1/X

• Fix negative skewed distribution with:

• Squared: X*X
• Cubed: X*X*X

• Fix positive skewed distribution with:

• Square root: SQRT(X)
• Logarithm: LG10(X)

73
Transformations Related to Specific Relationship Types :

➢Log-linear: a log of the Y variable with an untransformed X variable provides an

estimate of the percentage change in Y given a one unit change in X.

➢Linear-log: a log of the X variable with an untransformed Y variable provides an

estimate of the unit change in Y for a percentage change in X

➢Log-log: a log of both X and Y provides the ratio of the percentage change of Y

given a percentage change in X, the definition of elasticity

74
➢The researcher is faced with what may seem to be an impossible task: satisfy all

of these statistical assumptions or risk a biased and flawed analysis. We want to

note that even though these statistical assumptions are important, the researcher

must use judgment in how to interpret the tests for each assumption and when to

apply remedies.

➢Even analyses with small sample sizes can withstand small, but significant,

departures from normality. What is more important for the researcher is to

understand the implications of each assumption with regard to the technique of

interest, striking a balance between the need to satisfy the assumptions versus the

robustness of the technique and research context.

75
Violation of Normality:
➢Transformation: Apply transformations to your data, such as log, square root, or inverse transformations,
which can help in normalizing the distribution of residuals.
➢Non-parametric Methods: If transformations don’t work, consider using non-parametric regression techniques
that don’t assume normality.
Violation of Linearity:
➢Transformations: Similar to addressing normality, transforming either the dependent or independent variables
can help. Adding Polynomial Terms: Include squared or cubic terms of the predictors (polynomial regression)
to capture the non-linear relationship.
➢Segmentation: Sometimes, breaking the data into different segments (where the linear assumption holds) can
be effective.
Violation of Homoscedasticity (Heteroscedasticity):
➢Transformations: Applying a transformation to the dependent variable can stabilize the variance.
➢Weighted Regression: Use weighted least squares regression, where more weight is given to observations
with smaller variances.
76
Multicollinearity:

➢Remove Highly Correlated Predictors: If two variables are highly correlated, consider removing one of them.

➢Principal Component Analysis (PCA): PCA or Factor Analysis can reduce the dimensionality of your data,

combining highly correlated variables into a smaller set of uncorrelated components.

➢Regularization Techniques: Methods like Ridge Regression or Lasso can help in handling multicollinearity.

Autocorrelation:

➢Adding Lag Variables: In time series data, adding lagged versions of the dependent variable as predictors can

help.

➢Differencing: Apply differencing to the time series data (subtracting the previous observation from the current

observation).

➢Use Time Series Models: Consider time series-specific models like ARIMA, which are designed to handle

autocorrelation.

77
Overall Strategies:

➢Data Cleaning and Exploration: Ensure your data is clean and explore it thoroughly before

modeling. Outliers or errors in data can often cause violations of assumptions.

➢Model Selection: Sometimes, the choice of model itself can be the issue. Explore different types

of models that are more suited to your data.

➢Consult Domain Knowledge: Understanding the context and domain can sometimes provide

insights into why assumptions are violated and how to address them.

➢Always remember that when applying these remedies, it's important to understand their impact

on your model and interpretation. Each remedy can change the nature of your model and should

be chosen carefully based on both statistical reasoning and an understanding of your data and

research question.

78
Out-of-Sample R2

Model Test
Variables
R2 R2
Avg Growing Season Temp (AGST) 0.44 0.79
AGST, Harvest Rain 0.71 -0.08
AGST, Harvest Rain, Age 0.79 0.53
AGST, Harvest Rain, Age, Winter Rain 0.83 0.79
AGST, Harvest Rain, Age, Winter Rain, Population 0.83 0.76
• Better model R2 does not necessarily mean better test set R2
• Need more data to be conclusive
• Out-of-sample R2 can be negative!
2
Predictive Ability
• Our wine model had a value of R2 = 0.83

• Tells us our accuracy on the data that we used to

build the model

• But how well does the model perform on new data?

• Bordeaux wine buyers profit from being able to predict
the quality of a wine years before it matures

1
Harvestify : https://harvestify.herokuapp.com/

➢Harvestify : https://harvestify.herokuapp.com/

https://www.dreamstime.com/diseases-grape-leaves-caused-
parasite-insect-bites-living-vines-does-not-affect-grapes-leaf-
galls-look-like-image172727918
American Psychological Association (APA) Write Up

1
A linear regression analysis was conducted to predict auction prices of wine based on the age of the wine,
winter rain, harvest rain, and the average temperature during the growing season. The model significantly
predicted the auction price of wine, F(4, 22) = 26.390, p < .001, and accounted for approximately 82.8% of the
variance in auction prices (R² = .828, Adjusted R² = .796). The standard error of the estimate was .28.
Specifically, each unit increase in the average growing season temperature was associated with a .616 increase
in the auction price of wine (p < .001), holding all other variables constant. Conversely, each unit increase in
harvest rain was associated with a .004 decrease in the auction price of wine (p < .001). Additionally, winter rain
was positively correlated with auction price (B = .001, p = .024), and age of the wine was also positively
correlated with auction price (B = .024, p = .003).
The data assumption test through graphical and statistical methods indicated that non-linearity, non-normality,
heteroscedasticity and multicollinearity was not a concern,. These results suggest that the average growing
season temperature and the age of the wine are the strongest predictors of the auction price of wine, with the
average growing season temperature having the most substantial impact. However, it is important to note that
while this model is statistically significant, the practical significance should be considered alongside other
factors that might influence wine prices in auction settings.
The Results

• Parker:
• 1986 is “very good to sometimes exceptional”
• Ashenfelter:
• 1986 is mediocre
• 1989 will be “the wine of the century” and 1990 will be even better!
• In wine auctions,
• 1989 sold for more than twice the price of 1986
• 1990 sold for even higher prices!
• Later, Ashenfelter predicted 2000 and 2003 would be great

• Parker has stated that “2000 is the greatest vintage Bordeaux has ever produced”
The Analytics Edge

• A linear regression model with only a few variables can predict wine
prices well

• In many cases, outperforms wine experts’ opinions

• A quantitative approach to a traditionally qualitative problem

2. IKEA's International Business
No ratings yet
2. IKEA's International Business
25 pages
Summative Test #1 Grade 6 3rd Quarter
100% (10)
Summative Test #1 Grade 6 3rd Quarter
3 pages
Classical And. Modern Regression With Applications: Duxbury
No ratings yet
Classical And. Modern Regression With Applications: Duxbury
7 pages
Automotive Sensor Testing and Waveform Analysis
From Everand
Automotive Sensor Testing and Waveform Analysis
Mandy Concepcion
4.5/5 (14)
Rodin AeroDynamics Presentation
100% (2)
Rodin AeroDynamics Presentation
23 pages
Notes - Predicitve Analystics - Multiple Regression_s
No ratings yet
Notes - Predicitve Analystics - Multiple Regression_s
24 pages
BAM3 Lesson03.1 LinearRegression
No ratings yet
BAM3 Lesson03.1 LinearRegression
22 pages
Module 3 - MultipleLinearRegression - Afterclass1b
No ratings yet
Module 3 - MultipleLinearRegression - Afterclass1b
34 pages
Module 3 - SimpleLinearRegression - Afterclass1b
No ratings yet
Module 3 - SimpleLinearRegression - Afterclass1b
26 pages
Analysis of Regression
No ratings yet
Analysis of Regression
22 pages
Linear Regression
No ratings yet
Linear Regression
28 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages
HW3_solution_Fall_2024
No ratings yet
HW3_solution_Fall_2024
8 pages
RMD S10 Regression
No ratings yet
RMD S10 Regression
22 pages
R_corregr
No ratings yet
R_corregr
147 pages
R Project
No ratings yet
R Project
22 pages
Cheming e
No ratings yet
Cheming e
61 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
45 pages
Using Multivariate Statistics: Barbara G. Tabachnick
No ratings yet
Using Multivariate Statistics: Barbara G. Tabachnick
22 pages
Multivariate Statistical Functions in R
100% (3)
Multivariate Statistical Functions in R
382 pages
Multivariate Data Analysis in R PDF
No ratings yet
Multivariate Data Analysis in R PDF
400 pages
BMS2024-Multiple Linear Regression-1 Lesson
No ratings yet
BMS2024-Multiple Linear Regression-1 Lesson
37 pages
BES - Lecture 10 - Simple Linear Regression
No ratings yet
BES - Lecture 10 - Simple Linear Regression
15 pages
Lecture 12 (Data Analysis and Interpretation
No ratings yet
Lecture 12 (Data Analysis and Interpretation
16 pages
3.regression Slides
100% (1)
3.regression Slides
25 pages
10a. Estimation and Forecasting Techniques
No ratings yet
10a. Estimation and Forecasting Techniques
39 pages
Analysing Data Using Linear Models 5th Ed January 2021
No ratings yet
Analysing Data Using Linear Models 5th Ed January 2021
388 pages
Correlacion y Regresion 2
No ratings yet
Correlacion y Regresion 2
28 pages
Regression Monograph DSBA Final
No ratings yet
Regression Monograph DSBA Final
38 pages
Wine Prediction
100% (1)
Wine Prediction
13 pages
Unit-III (Data Analytics)
100% (1)
Unit-III (Data Analytics)
15 pages
College of Natural and Computational Science Department of Statistics Linear Regression Biostatistics Master Program
No ratings yet
College of Natural and Computational Science Department of Statistics Linear Regression Biostatistics Master Program
3 pages
Report Stats PDF
No ratings yet
Report Stats PDF
23 pages
Topic - 9 PDF
No ratings yet
Topic - 9 PDF
12 pages
Stat 378
No ratings yet
Stat 378
73 pages
ASM using r 2 marks answer Keys
No ratings yet
ASM using r 2 marks answer Keys
10 pages
Multiple Regression
No ratings yet
Multiple Regression
11 pages
Multiple Linear Regression Analysis
No ratings yet
Multiple Linear Regression Analysis
23 pages
Lab 2
No ratings yet
Lab 2
23 pages
Statistical Analysis Using SPSS and R - Chapter 5 PDF
No ratings yet
Statistical Analysis Using SPSS and R - Chapter 5 PDF
93 pages
Multivariate Statistical Method
No ratings yet
Multivariate Statistical Method
85 pages
Study Guide For STA3701
No ratings yet
Study Guide For STA3701
325 pages
Lecture 12 Regression
No ratings yet
Lecture 12 Regression
55 pages
Ans to the que no 1
No ratings yet
Ans to the que no 1
9 pages
ch03 Regression
No ratings yet
ch03 Regression
10 pages
Demand Forecasting Information
No ratings yet
Demand Forecasting Information
66 pages
Group 1 Testing Assumptions
No ratings yet
Group 1 Testing Assumptions
35 pages
Hypothesis Testing and Regression Modelling
No ratings yet
Hypothesis Testing and Regression Modelling
8 pages
Comprehensive Assignment 2 Question One
No ratings yet
Comprehensive Assignment 2 Question One
11 pages
Regression Analysis Willey Publication
20% (5)
Regression Analysis Willey Publication
15 pages
MODULE-3
No ratings yet
MODULE-3
34 pages
Chapter 14 Multiple Regression
No ratings yet
Chapter 14 Multiple Regression
28 pages
Af Notes by Midhila)
No ratings yet
Af Notes by Midhila)
60 pages
CH3. Multiple Linear Regression 2023
No ratings yet
CH3. Multiple Linear Regression 2023
76 pages
UKP6053 - L8 Multiple Regression
100% (2)
UKP6053 - L8 Multiple Regression
105 pages
Regression Model
No ratings yet
Regression Model
30 pages
Regression Analysis (Spring, 2000) : by Wonjae
No ratings yet
Regression Analysis (Spring, 2000) : by Wonjae
6 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
Data Screening Checklist
No ratings yet
Data Screening Checklist
57 pages
Intermediate Analytics-Regression-Week 1
No ratings yet
Intermediate Analytics-Regression-Week 1
52 pages
Handbook of Ion Chromatography
From Everand
Handbook of Ion Chromatography
Joachim Weiss
No ratings yet
Reverse Osmosis Principles, Design, and Operation for Water Treatment Professionals
From Everand
Reverse Osmosis Principles, Design, and Operation for Water Treatment Professionals
Ramven
No ratings yet
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet
SAPM Assignment1 Aadish 01
No ratings yet
SAPM Assignment1 Aadish 01
518 pages
Part III, Services
No ratings yet
Part III, Services
26 pages
Smokey Valley Case Study Solution
No ratings yet
Smokey Valley Case Study Solution
3 pages
8th Class Math
No ratings yet
8th Class Math
2 pages
Lesson 5 COMP218
No ratings yet
Lesson 5 COMP218
15 pages
Computer Architecture & Operating Systems: University of Colombo School of Computing
100% (2)
Computer Architecture & Operating Systems: University of Colombo School of Computing
32 pages
Physics Assignment
No ratings yet
Physics Assignment
20 pages
Solid Mensuration
No ratings yet
Solid Mensuration
1 page
Hydraulic Structures I L7 Outlet Structure Design (Copy)
No ratings yet
Hydraulic Structures I L7 Outlet Structure Design (Copy)
20 pages
100-Machine-Learning-Interview-Questions-and-Answers (Downloaded From Internet)
No ratings yet
100-Machine-Learning-Interview-Questions-and-Answers (Downloaded From Internet)
24 pages
Chem 206 Lab Manual
No ratings yet
Chem 206 Lab Manual
69 pages
Class1 QuadraticEquations PT 6 Jan 2022
No ratings yet
Class1 QuadraticEquations PT 6 Jan 2022
1 page
Cse4019 Image-Processing Eth 1.0 37 Cse4019
No ratings yet
Cse4019 Image-Processing Eth 1.0 37 Cse4019
2 pages
PRAM Model
No ratings yet
PRAM Model
72 pages
Abacus Syllabus Advacad Solutions
50% (2)
Abacus Syllabus Advacad Solutions
2 pages
3140 Manual en
No ratings yet
3140 Manual en
17 pages
De Stijl Modernism in The Netherlands Van Dusburg
No ratings yet
De Stijl Modernism in The Netherlands Van Dusburg
37 pages
EFS Module 6 Refined 11 Aug-1
No ratings yet
EFS Module 6 Refined 11 Aug-1
91 pages
Annual Date Sheet
No ratings yet
Annual Date Sheet
1 page
Member Design Reinforced Concrete Staircase Bs8110 v2015 01
100% (1)
Member Design Reinforced Concrete Staircase Bs8110 v2015 01
27 pages
Sumpart
No ratings yet
Sumpart
1 page
DS On Search
No ratings yet
DS On Search
17 pages
Chapter-10 Asymptotes
No ratings yet
Chapter-10 Asymptotes
21 pages
Logarithms
No ratings yet
Logarithms
32 pages
PABASA Part 2 DAO 2007-29
No ratings yet
PABASA Part 2 DAO 2007-29
305 pages
Video 7 Rib Feature
No ratings yet
Video 7 Rib Feature
17 pages
Thermoelasticity
No ratings yet
Thermoelasticity
63 pages
Chapter Five
No ratings yet
Chapter Five
35 pages
SPM Chemistry Answering Technique PDF
50% (2)
SPM Chemistry Answering Technique PDF
12 pages
Mathematics Max. Marks: 61: FT X X X X X DX
No ratings yet
Mathematics Max. Marks: 61: FT X X X X X DX
5 pages
The Essentials of Statistics A Tool for Social Research 3rd Edition Joseph F. Healey - Instantly access the complete ebook with just one click
No ratings yet
The Essentials of Statistics A Tool for Social Research 3rd Edition Joseph F. Healey - Instantly access the complete ebook with just one click
43 pages