Regression and Correlation Analysis
Regression and Correlation Analysis
1
06/03/2023
Regression Analysis
• A regression analysis generates an equation to describe the statistical
relationship between one or more predictors and the response
variable and to predict new observations. Linear regression usually
uses the ordinary least squares estimation method which derives the
equation by minimizing the sum of the squared residuals.
• Example:, you work for a potato chip company that is analyzing
factors that affect the percentage of crumbled potato chips per
container before shipping (response variable - Y). You are conducting
the regression analysis and include the percentage of potato relative
to other ingredients and the cooking temperature (Celsius) as your
two predictors (x)
2
06/03/2023
3
06/03/2023
• Quadratic Model
• Here, the predictor variable, X, is squared in order to model the curvature.
Y = bo + b1X + b2X2
• Residuals have a mean of zero. Inclusion of a constant in the model
will force the mean to equal zero.
• All predictors are uncorrelated with the residuals.
• Residuals are not correlated with each other (serial correlation).
• Residuals have a constant variance.
• No predictor variable is perfectly correlated (r=1) with a different
predictor variable. It is best to avoid imperfectly high correlations
(multicollinearity) as well.
• Residuals are normally distributed.
4
06/03/2023
10
5
06/03/2023
The slope is 0.
When x increases
by 1, y neither
increases or
decreases. The y-
intercept is -4.
11
12
6
06/03/2023
13
14
7
06/03/2023
15
16
8
06/03/2023
17
18
9
06/03/2023
19
20
10
06/03/2023
21
22
11
06/03/2023
23
24
12
06/03/2023
Shift
25
26
13
06/03/2023
27
Example
• A materials engineer at a furniture manufacturing site wants to assess
the stiffness of the particle board that the manufacturer uses. The
engineer measures the stiffness and the density of a sample of
particle board pieces.
• The engineer uses simple regression to determine whether the
density of the particles is associated with the stiffness of the board
28
14
06/03/2023
29
Interpretation of R-sq
In these results, the density of the particle board
explains 84.5% of the variation in the stiffness of
Note: the higher the R-sq or R-sq(adj) the the boards. The R2 value indicates that the model
better the model fit. fits the data well.
30
15
06/03/2023
31
32
16
06/03/2023
Assumptions
• The predictors can be continuous or categorical - If you want to plot
the relationship between one continuous (numeric) predictor and a
continuous response.
• The response variable should be continuous
• Collect data using best practices
• The correlation among the predictors, also known as multicollinearity,
should not be severe
• The model should provide a good fit to the data
33
Example
• A research chemist wants to understand how several predictors are
associated with the wrinkle resistance of cotton cloth. The chemist
examines 32 pieces of cotton cellulose produced at different settings
of curing time, curing temperature, formaldehyde concentration, and
catalyst ratio. The durable press rating, a measure of wrinkle
resistance, is recorded for each piece of cotton.
• The chemist performs a multiple regression analysis to fit a model
with the predictors and eliminate the predictors that do not have a
statistically significant relationship with the response
34
17
06/03/2023
35
36
18
06/03/2023
Results Interpretation
The predictors temperature, catalyst ratio, and
formaldehyde concentration have p-values that are less
than the significance level of 0.05. These results indicate
that these predictors have a statistically significant effect
on wrinkle resistance. The p-value for time is greater than
0.05, which indicates that there is not enough evidence to
conclude that time is related to the response. The chemist
may want to refit the model without this predictor.
Interpretation
In these results, the model explains approximately 73% of
the variation in the response.
There are some guidelines we can use to determine whether our VIFs
(Variance Inflation Factor) are in an acceptable range. A rule of thumb
commonly used in practice is if a VIF is < 10, is acceptable
37
Versus Fits
In this residuals versus fits plot, the points
do not appear to be randomly distributed
about zero. There appear to be clusters of
points that could represent different groups
in the data. You should investigate the
groups to determine their cause.
Versus Order
In this residuals versus order plot, the
residuals do not appear to be randomly
distributed about zero. The residuals
appear to systematically decrease as the
observation order increases. You should
investigate the trend to determine the
cause.
38
19
06/03/2023
Correlation
• Use Correlation to measure the strength and direction of the association
between two variables. Minitab offers two methods of correlation: the
Pearson product moment correlation and the Spearman rank order
correlation.
• The Pearson correlation (also known as r), which is the most common
method, measures the linear relationship between two continuous
variables.
• If you are not certain whether your variables are linearly related, you
should create a scatter plot. If the relationship between the variables is
not linear, you may be able to use the Spearman rank order correlation
(also known as Spearman's rho). The Spearman correlation measures the
monotonic relationship between two continuous or ordinal variables
39
Assumptions
• The data should be continuous or ordinal
• If you have categorical data, you should perform Cross Tabulation and Chi-
Square to examine the association between variables.
• The relationship between variables should be linear or monotonic
• If your variables do not have a linear or monotonic relationship, the results
from the correlation analysis will not accurately reflect the strength of the
relationship.
• Unusual values can have a strong effect on the results
• Because unusual values can have a strong effect on the results, use
Scatterplot or Fitted Line Plot to identify these values.
40
20
06/03/2023
41
42
21
06/03/2023
The following plots show data with specific correlation values to illustrate different
patterns in the strength and direction of the relationships between variables
Large positive
No relationship: relationship
Pearson r
Moderate positive
relationship
43
44
22
06/03/2023
The following plots show data with specific Spearman correlation coefficient values to illustrate
different patterns in the strength and direction of the relationships between variables
Strong positive
No relationship
relationship
Strong
negative
relationship
45
Example
• An engineer at an aluminum castings plant assesses the relationship
between the hydrogen content and the porosity of aluminum alloy
castings. The engineer collects a random sample of 14 castings and
measures the following properties of each casting: hydrogen content,
porosity, and strength.
• The engineer uses the Pearson correlation to examine the strength
and direction of the linear relationship between each pair of
variables.
46
23
06/03/2023
47
Interpretation
Results The Pearson correlation coefficient between
hydrogen content and porosity is 0.625 and
represents a positive relationship between the
variables. As hydrogen increases, porosity also
increases. The p-value is 0.017, which is less than
the significance level of 0.05. The p-value indicates
that the correlation is significant.
48
24
06/03/2023
Results Interpretation
49
50
25
06/03/2023
51
Problem 1
• The rotations per minute (RPM) is critical to the quality of a
wind generator. Several components affect the RPM of a
particular generator. Among them, the weight of the fans, the
speed of the wind, and the pressure. After having designed the
Conakry model of a wind generator, the reliability engineer
wants to build a model that will show how the “Rotation”
variable relates to the “Wind,” “Pressure,” and “Weight”
variables.
a. Show that “Wind” and “Pressure” are highly correlated.
b. Show that “Rotation” is highly dependent on the input
factors.
c. Show that only “Weight” is significant in the equation.
d. Show that the VIF is too high for “Wind” and “Pressure.”
e. Interpret the probability plot for the residuals.
52
26
06/03/2023
Problem 2
• Organophosphate (OP) compounds are used as
pesticides. However, it is important to study their
effect on species that are exposed to them. In
the laboratory study Some Effects of
Organophosphate Pesticides on Wildlife Species,
by the Department of Fisheries and Wildlife at
Virginia Tech, an experiment was conducted in
which different dosages of a particular OP
pesticide were administered to 5 groups of 5
mice (peromysius leucopus). The 25 mice were
females of similar age and condition. One group
received no chemical. The basic response y was a
measure of activity in the brain. It was postulated
that brain activity would decrease with an
increase in OP dosage. The data are as follows:
• Determine the regression model and
interpret
• Construct an analysis-of-variance table
and interpret.
• Interpret the residual plots, R-sq.
• Test the correlation of the two variables
53
Problem 3
• The Statistics Consulting Center at
Virginia Tech analyzed data on normal
woodchucks for the Department of
Veterinary Medicine. The variables of
interest were body weight in grams and
heart weight in grams. It was desired to
• develop a linear regression equation in
order to determine if there is a
significant linear relationship between
heart weight and total body weight.
• Test the correlation of two variables
• Interpret the results
54
27
06/03/2023
Problem 4
• An experiment was conducted to study the size of squid
eaten by sharks and tuna. The regressor variables are
characteristics of the beaks of the squid. The data are given
as follows:
• In the study, the regressor variables and response
considered are
x1 = rostral length, in inches,
x2 = wing length, in inches,
x3 = rostral to notch length, in inches,
x4 = notch to wing length, in inches,
x5 = width, in inches,
y = weight, in pounds.
55
28