0% found this document useful (0 votes)
16 views28 pages

Regression and Correlation Analysis

The document provides an overview of regression and correlation analysis, detailing simple and multiple linear regression, ordinary least squares regression, and the assumptions required for accurate modeling. It explains the roles of response and predictor variables, as well as the importance of residual analysis in validating model assumptions. Additionally, it includes examples and interpretations of regression results in practical scenarios.

Uploaded by

lemuel rena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views28 pages

Regression and Correlation Analysis

The document provides an overview of regression and correlation analysis, detailing simple and multiple linear regression, ordinary least squares regression, and the assumptions required for accurate modeling. It explains the roles of response and predictor variables, as well as the importance of residual analysis in validating model assumptions. Additionally, it includes examples and interpretations of regression results in practical scenarios.

Uploaded by

lemuel rena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

06/03/2023

DATA ANALYTICS AND DESIGN


OF EXPERIMENT
Dr. Ryan Jeffrey P. Curbano
Subject Professor

Regression and Correlation


Analysis

1
06/03/2023

Regression Analysis
• A regression analysis generates an equation to describe the statistical
relationship between one or more predictors and the response
variable and to predict new observations. Linear regression usually
uses the ordinary least squares estimation method which derives the
equation by minimizing the sum of the squared residuals.
• Example:, you work for a potato chip company that is analyzing
factors that affect the percentage of crumbled potato chips per
container before shipping (response variable - Y). You are conducting
the regression analysis and include the percentage of potato relative
to other ingredients and the cooking temperature (Celsius) as your
two predictors (x)

What is simple linear regression?


• Simple linear regression examines the linear relationship between
two continuous variables: one response (y) and one predictor (x).
When the two variables are related, it is possible to predict a
response value from a predictor value with better than chance
accuracy.
• Regression provides the line that "best" fits the data. This line can
then be used to:
✓Examine how the response variable changes as the predictor variable
changes.
✓Predict the value of a response variable (y) for any predictor variable (x).

2
06/03/2023

What is multiple linear regression?


• Multiple linear regression examines the linear relationships
between one continuous response and two or more
predictors.
• If the number of predictors is large, then before fitting a
regression model with all the predictors, you should use
stepwise or best subsets model-selection techniques to
screen out predictors not associated with the responses.

What is ordinary least squares regression?


• In ordinary least squares (OLS) regression, the estimated equation is
calculated by determining the equation that minimizes the sum of the
squared distances between the sample's data points and the values
predicted by the equation.

• Response vs. Predictor


• With one predictor (simple linear regression), the sum of the squared distances from
each point to the line are as small as possible.

3
06/03/2023

Assumptions that should be met for OLS


regression
• OLS regression provides the most precise, unbiased estimates only
when the following assumptions are met:
• The regression model is linear in the coefficients. Least squares can
model curvature by transforming the variables (instead of the
coefficients). You must specify the correct functional form in order to
model any curvature.

• Quadratic Model
• Here, the predictor variable, X, is squared in order to model the curvature.
Y = bo + b1X + b2X2
• Residuals have a mean of zero. Inclusion of a constant in the model
will force the mean to equal zero.
• All predictors are uncorrelated with the residuals.
• Residuals are not correlated with each other (serial correlation).
• Residuals have a constant variance.
• No predictor variable is perfectly correlated (r=1) with a different
predictor variable. It is best to avoid imperfectly high correlations
(multicollinearity) as well.
• Residuals are normally distributed.

4
06/03/2023

Slope and intercept of the regression line


• The slope indicates the steepness of a line and the intercept indicates
the location where it intersects an axis.
• The slope and the intercept define the linear relationship between
two variables, and can be used to estimate an average rate of change.
The greater the magnitude of the slope, the steeper the line and the
greater the rate of change.
• By examining the equation of a line, you quickly can discern its slope
and y-intercept (where the line crosses the y-axis).

The slope is positive


5. When x increases
by 1, y increases by 5.
The y-intercept is 2.

The slope is negative


0.4. When x increases
by 1, y decreases by
0.4. The y-intercept is
7.2

10

5
06/03/2023

The slope is 0.
When x increases
by 1, y neither
increases or
decreases. The y-
intercept is -4.

11

What are categorical, discrete, and continuous


variables?
Quantitative variables can be classified as discrete or continuous.
• Categorical variable
❑Categorical variables contain a finite number of categories or distinct groups.
Categorical data might not have a logical order. For example, categorical predictors
include gender, material type, and payment method.
• Discrete variable
❑Discrete variables are numeric variables that have a countable number of values
between any two values. A discrete variable is always numeric. For example, the
number of customer complaints or the number of flaws or defects.
• Continuous variable
❑Continuous variables are numeric variables that have an infinite number of values
between any two values. A continuous variable can be numeric or date/time. For
example, the length of a part or the date and time a payment is received.

12

6
06/03/2023

What are response and predictor variables?


• Variables of interest in an experiment (those that are measured or observed) are
called response or dependent variables. Other variables in the experiment that
affect the response and can be set or measured by the experimenter are called
predictor, explanatory, or independent variables.
• For example, you might want to determine the recommended baking time for a
cake recipe or provide care instructions for a new hybrid plant.

13

Regression analyses for continuous response


variables
• Regression - Model the relationship between categorical or
continuous predictors and one response, and use the model to
predict response values for new observations. Easily include
interaction and polynomial terms, transform the response, or use
stepwise regression if needed.
In Minitab, choose Stat > Regression > Regression > Fit Regression Model.

14

7
06/03/2023

Basic measures of association


• Correlation - Use to calculate Pearson's correlation or Spearman
rank-order correlation (also called Spearman's rho).
In Minitab, choose Stat > Basic Statistics > Correlation.

• Covariance - Use to calculate the covariance, a measure of the


relationship between two variables. The covariance is not
standardized, unlike the correlation coefficient.
In Minitab, choose Stat > Basic Statistics > Covariance.

15

Overview for Fitted Line Plot


• Use Fitted Line Plot to display the relationship between one
continuous predictor and a response. You can fit a linear, quadratic, or
cubic model to the data.
• A fitted line plot shows a scatterplot of the data with a regression line
representing the regression equation.
• For example, an engineer at a manufacturing site wants to examine
the relationship between energy consumption and the setting of a
machine used in the manufacturing process. The engineer thinks the
relationship between these variables is curvilinear. Therefore, the
engineer created a fitted line plot and fits a quadratic model to the
data.

16

8
06/03/2023

• Where to find this analysis


• To create a fitted line plot, choose Stat > Regression >
Fitted Line Plot.
• When to use an alternate analysis
• If you have one categorical predictor and no continuous
predictors, use One-Way ANOVA.
• If you have more than one predictor, use Fit Regression
Model.

17

Assumption of Fitted Line Plots


• The data should include only one continuous predictor
• The response variable should be continuous
• Collect data using best practices
• To ensure that your results are valid, consider the following guidelines:
• Make certain that the data represent the population of interest.
• Collect enough data to provide the necessary precision.
• Measure variables as accurately and precisely as possible.
• Record the data in the order it is collected.
• The model should provide a good fit to the data

18

9
06/03/2023

Determine how well the model fits your data


• R-sq
✓R² is the percentage of variation in the response that is explained by the
model. The higher the R² value, the better the model fits your data. R² is
always between 0% and 100%. R² always increases when you add additional
predictors to a model.
• R-sq (adj)
• Use adjusted R² when you want to compare models that have different
numbers of predictors. R² always increases when you add a predictor to the
model, even when there is no real improvement to the model. The adjusted
R² value incorporates the number of predictors in the model to help you
choose the correct model.

19

Consider the following when you compare the


R2 values:
• Small samples do not provide a precise estimate of the strength of
the relationship between the response and predictors. If you need
R2 to be more precise, you should use a larger sample (typically, 40
or more).
• R2 is just one measure of how well the model fits the data. Even when
a model has a high R2, you should check the residual plots to verify
that the model meets the model assumptions.

20

10
06/03/2023

Determine whether your model meets the


assumptions of the analysis
• Use the residual plots to help you determine whether the model is
adequate and meets the assumptions of the analysis.
• If the assumptions are not met, the model may not fit the data well
and you should use caution when you interpret the results.

21

Residuals versus fits plot


• Use the residuals versus fits plot to verify the
assumption that the residuals are randomly
distributed and have constant variance. Ideally, the
points should fall randomly on both sides of 0, with
no recognizable patterns in the points.

22

11
06/03/2023

The patterns in the following table may indicate that


the model does not meet the model assumptions.

23

Residuals versus order plot


• Use the residuals versus order plot to verify the assumption that the
residuals are independent from one another. Independent residuals
show no trends or patterns when displayed in time order. Patterns in
the points may indicate that residuals near each other may be
correlated, and thus, not independent. Ideally, the residuals on the
plot should fall randomly around the center line:

24

12
06/03/2023

The following types of patterns may indicate


that the residuals are dependent.
Trend
Cycle

Shift

25

Normal probability plot


•Use the normal probability plot of the residuals
to verify the assumption that the residuals are
normally distributed. The normal probability
plot of the residuals should approximately
follow a straight line.

26

13
06/03/2023

The patterns in the following table may indicate that


the model does not meet the model assumptions.

27

Example
• A materials engineer at a furniture manufacturing site wants to assess
the stiffness of the particle board that the manufacturer uses. The
engineer measures the stiffness and the density of a sample of
particle board pieces.
• The engineer uses simple regression to determine whether the
density of the particles is associated with the stiffness of the board

28

14
06/03/2023

• Choose Stat > Regression >


Fitted Line Plot.
• In Response, enter Stiffness.
• In Predictor, enter Density.
• Click Options. Under Display
Options, select Display
confidence interval and Display
prediction interval. Click OK.
• Click Graphs. Under Residual
Plots, select Four in one.
• Click OK in each dialog.

29

Results Interpret the results of P-value


The p-value for the regression model is 0.000,
which means that the actual p-value is less than
0.0005. Because the p-value is less than the
significance level of 0.05, the engineer can
conclude that the association between stiffness
Regression equation and density is statistically significant.

Interpretation of Regression Equation


In these results, the coefficient for the predictor,
Density, is 3.541. The average stiffness of the
particle board increases by approximately 3.5 for
every 1 unit increase in density. The sign of the
coefficient is positive, which indicates that as
density increases, stiffness also increases.

Interpretation of R-sq
In these results, the density of the particle board
explains 84.5% of the variation in the stiffness of
Note: the higher the R-sq or R-sq(adj) the the boards. The R2 value indicates that the model
better the model fit. fits the data well.

30

15
06/03/2023

Versus Fits Interpretation


In this residuals versus fits plot, the
points appear randomly scattered on
the plot. However, the point in the
upper right corner appears to be an
outlier.

Versus order Interpretation


In this residuals versus order plot, the
outlier that is also visible on the other
residual plots appears to correspond to
the observation in row 21 of the
worksheet.
Normal Probability Plot Interpretation
In this normal probability plot, the
residuals generally appear to follow a
straight line. However, the point in the
upper right corner of the plot is far
away from the line and appears to be
an outlier, which was also visible on
the other residual plots

31

Multiple Regression Model


Overview for Fit Regression Model
• Use Fit Regression Model to describe the relationship
between a set of predictors and a continuous response
using the ordinary least squares method. After you perform
the analysis
• Predict the response for new observations.
• Plot the relationships among the variables.
• Find values that optimize one or more responses.
• To fit a regression model, choose Stat > Regression > Regression > Fit
Regression Model.

32

16
06/03/2023

Assumptions
• The predictors can be continuous or categorical - If you want to plot
the relationship between one continuous (numeric) predictor and a
continuous response.
• The response variable should be continuous
• Collect data using best practices
• The correlation among the predictors, also known as multicollinearity,
should not be severe
• The model should provide a good fit to the data

33

Example
• A research chemist wants to understand how several predictors are
associated with the wrinkle resistance of cotton cloth. The chemist
examines 32 pieces of cotton cellulose produced at different settings
of curing time, curing temperature, formaldehyde concentration, and
catalyst ratio. The durable press rating, a measure of wrinkle
resistance, is recorded for each piece of cotton.
• The chemist performs a multiple regression analysis to fit a model
with the predictors and eliminate the predictors that do not have a
statistically significant relationship with the response

34

17
06/03/2023

• Choose Stat > Regression > Regression >


Fit Regression Model.
• In Responses, enter Rating.
• In Continuous predictors, enter Conc
Ratio Temp Time.
• Click Graphs.
• Under Residuals plots, choose Four in
one.
• In Residuals versus the variables, enter
Conc Ratio Temp Time.
• Click OK in each dialog box.

35

36

18
06/03/2023

Results Interpretation
The predictors temperature, catalyst ratio, and
formaldehyde concentration have p-values that are less
than the significance level of 0.05. These results indicate
that these predictors have a statistically significant effect
on wrinkle resistance. The p-value for time is greater than
0.05, which indicates that there is not enough evidence to
conclude that time is related to the response. The chemist
may want to refit the model without this predictor.

Interpretation
In these results, the model explains approximately 73% of
the variation in the response.

Interpretation for VIF (Variance Inflation Factor)


In these results, the variance factor is less than 10,
therefore there is no multicollinearity in the model

There are some guidelines we can use to determine whether our VIFs
(Variance Inflation Factor) are in an acceptable range. A rule of thumb
commonly used in practice is if a VIF is < 10, is acceptable

Note: Multicollinearity means when two or more predictors in the


model are correlated and provide redundant information about the
response.

37

Versus Fits
In this residuals versus fits plot, the points
do not appear to be randomly distributed
about zero. There appear to be clusters of
points that could represent different groups
in the data. You should investigate the
groups to determine their cause.

Versus Order
In this residuals versus order plot, the
residuals do not appear to be randomly
distributed about zero. The residuals
appear to systematically decrease as the
observation order increases. You should
investigate the trend to determine the
cause.

Normal Probability Plot


In this normal probability plot, the points
generally follow a straight line. There is no
evidence of nonnormality, outliers, or
unidentified variables.

38

19
06/03/2023

Correlation
• Use Correlation to measure the strength and direction of the association
between two variables. Minitab offers two methods of correlation: the
Pearson product moment correlation and the Spearman rank order
correlation.
• The Pearson correlation (also known as r), which is the most common
method, measures the linear relationship between two continuous
variables.
• If you are not certain whether your variables are linearly related, you
should create a scatter plot. If the relationship between the variables is
not linear, you may be able to use the Spearman rank order correlation
(also known as Spearman's rho). The Spearman correlation measures the
monotonic relationship between two continuous or ordinal variables

39

Assumptions
• The data should be continuous or ordinal
• If you have categorical data, you should perform Cross Tabulation and Chi-
Square to examine the association between variables.
• The relationship between variables should be linear or monotonic
• If your variables do not have a linear or monotonic relationship, the results
from the correlation analysis will not accurately reflect the strength of the
relationship.
• Unusual values can have a strong effect on the results
• Because unusual values can have a strong effect on the results, use
Scatterplot or Fitted Line Plot to identify these values.

40

20
06/03/2023

When to use Pearson’r and Spearman rho?


• Pearson correlation coefficient is used when data is normally
distributed
• Spearman correlation coefficient is used when data is not normally
distributed
• Check the normality of data using Anderson Darling test to verify if
data is normally distributed meaning used the Test of Normality.

41

Examine the linear relationship between variables (Pearson)


• Use the Pearson correlation coefficient to examine the strength and direction of the
linear relationship between two continuous variables.
• Strength
• The correlation coefficient can range in value from −1 to +1. The larger the
absolute value of the coefficient, the stronger the relationship between the
variables.
• For the Pearson correlation, an absolute value of 1 indicates a perfect linear
relationship. A correlation close to 0 indicates no linear relationship between the
variables.
• Direction
• The sign of the coefficient indicates the direction of the relationship. If both
variables tend to increase or decrease together, the coefficient is positive, and
the line that represents the correlation slopes upward. If one variable tends to
increase as the other decreases, the coefficient is negative, and the line that
represents the correlation slopes downward.

42

21
06/03/2023

The following plots show data with specific correlation values to illustrate different
patterns in the strength and direction of the relationships between variables

Large positive
No relationship: relationship
Pearson r

Moderate positive
relationship

The points fall close to the line, Large negative relationship


which indicates that there is a
strong negative relationship
between the variables. The
relationship is negative because, as
one variable increases, the other
variable decreases

43

Examine the monotonic relationship between variables


(Spearman)
• Strength
• The correlation coefficient can range in value from −1 to +1. The larger the
absolute value of the coefficient, the stronger the relationship between the
variables.
• Direction
• The sign of the coefficient indicates the direction of the relationship. If both
variables tend to increase or decrease together, the coefficient is positive, and
the line that represents the correlation slopes upward. If one variable tends
to increase as the other decreases, the coefficient is negative, and the line
that represents the correlation slopes downward.

44

22
06/03/2023

The following plots show data with specific Spearman correlation coefficient values to illustrate
different patterns in the strength and direction of the relationships between variables

Strong positive
No relationship
relationship

Strong
negative
relationship

45

Example
• An engineer at an aluminum castings plant assesses the relationship
between the hydrogen content and the porosity of aluminum alloy
castings. The engineer collects a random sample of 14 castings and
measures the following properties of each casting: hydrogen content,
porosity, and strength.
• The engineer uses the Pearson correlation to examine the strength
and direction of the linear relationship between each pair of
variables.

46

23
06/03/2023

• Choose Stat > Basic Statistics >


Correlation.
• In Variables, enter Hydrogen Porosity
Strength.
• Click OK.

47

Interpretation
Results The Pearson correlation coefficient between
hydrogen content and porosity is 0.625 and
represents a positive relationship between the
variables. As hydrogen increases, porosity also
increases. The p-value is 0.017, which is less than
the significance level of 0.05. The p-value indicates
that the correlation is significant.

The Pearson correlation coefficient between


hydrogen content and strength is −0.790 and the p-
value is 0.001. The p-value is less than the
significance level of 0.05, which indicates that the
correlation is significant. As hydrogen content
increases, strength tends to decrease. The Pearson
correlation coefficient between porosity and
strength is −0.527 and the p-value is 0.053. The p-
value is close to the significance level of 0.05, which
provides inconclusive evidence for the association
between porosity and strength.

48

24
06/03/2023

Results Interpretation

In these results, the Spearman


correlation between porosity and
hydrogen is 0.590, which indicates that
there is a positive relationship between
the variables. The Spearman correlation
between strength and hydrogen is -0.859
and between strength and porosity is -
0.675. The relationship between these
variables is negative, which indicates
that as hydrogen and porosity increase,
strength decreases.

49

Test of Normality Results

Since the p-value is greater than 0.05,


the results showed that data are
normally distributed

50

25
06/03/2023

Activity for Regression and


Correlation

51

Problem 1
• The rotations per minute (RPM) is critical to the quality of a
wind generator. Several components affect the RPM of a
particular generator. Among them, the weight of the fans, the
speed of the wind, and the pressure. After having designed the
Conakry model of a wind generator, the reliability engineer
wants to build a model that will show how the “Rotation”
variable relates to the “Wind,” “Pressure,” and “Weight”
variables.
a. Show that “Wind” and “Pressure” are highly correlated.
b. Show that “Rotation” is highly dependent on the input
factors.
c. Show that only “Weight” is significant in the equation.
d. Show that the VIF is too high for “Wind” and “Pressure.”
e. Interpret the probability plot for the residuals.

52

26
06/03/2023

Problem 2
• Organophosphate (OP) compounds are used as
pesticides. However, it is important to study their
effect on species that are exposed to them. In
the laboratory study Some Effects of
Organophosphate Pesticides on Wildlife Species,
by the Department of Fisheries and Wildlife at
Virginia Tech, an experiment was conducted in
which different dosages of a particular OP
pesticide were administered to 5 groups of 5
mice (peromysius leucopus). The 25 mice were
females of similar age and condition. One group
received no chemical. The basic response y was a
measure of activity in the brain. It was postulated
that brain activity would decrease with an
increase in OP dosage. The data are as follows:
• Determine the regression model and
interpret
• Construct an analysis-of-variance table
and interpret.
• Interpret the residual plots, R-sq.
• Test the correlation of the two variables

53

Problem 3
• The Statistics Consulting Center at
Virginia Tech analyzed data on normal
woodchucks for the Department of
Veterinary Medicine. The variables of
interest were body weight in grams and
heart weight in grams. It was desired to
• develop a linear regression equation in
order to determine if there is a
significant linear relationship between
heart weight and total body weight.
• Test the correlation of two variables
• Interpret the results

54

27
06/03/2023

Problem 4
• An experiment was conducted to study the size of squid
eaten by sharks and tuna. The regressor variables are
characteristics of the beaks of the squid. The data are given
as follows:
• In the study, the regressor variables and response
considered are
x1 = rostral length, in inches,
x2 = wing length, in inches,
x3 = rostral to notch length, in inches,
x4 = notch to wing length, in inches,
x5 = width, in inches,
y = weight, in pounds.

Determine and interpret the following


• Regression equation, SE Coefficient, R-sq, VIF, Residual
Plot value

55

28

You might also like