Activity 5 - Statistical Analysis and Design - Regression - Correlation
Activity 5 - Statistical Analysis and Design - Regression - Correlation
Activity 5
Submitted by:
Jericka Christine Alson, CIE
Statistical Analysis and Design
Activity 5
Discuss the comprehensively some terminologies include the references
Statistical Analysis and Design
• Why do different regression methods provide different values for R-squared, adjusted R-squared, and S for
the same model?
Statistical Analysis and Design
Residual Variance (also called unexplained variance or error variance) is the variance of any error (residual).
For example, in regression analysis, random fluctuations cause variation around the “true” regression line.
The total variance of a regression line is made up of two parts: explained variance and unexplained variance.
The unexplained variance is simply what’s left over when you subtract the variance due to regression from the
total variance of the dependent variable.
Reference:
https://www.statisticshowto.com/residual-variance/
Statistical Analysis and Design
The log-likelihood is the expression that Minitab maximizes to determine optimal values of the estimated
coefficients (β).
Log-likelihood values cannot be used alone as an index of fit because they are a function of sample size but can
be used to compare the fit of different coefficients. Because you want to maximize the log-likelihood, the higher
value is better.
Reference:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-
models/what-is-log-likelihood/
Statistical Analysis and Design
Adjusted sums of squares are measures of variation for different components of the model.
The order of the predictors in the model does not affect the calculation of the adjusted sums of squares.
In the Analysis of Variance table, Minitab separates the sums of squares into different components that describe
the variation due to different sources.
Adj SS Term
The adjusted sum of squares for a term is the increase in the
regression sum of squares compared to a model with only the other
terms. It quantifies the amount of variation in the response data that is
explained by each term in the model.
Statistical Analysis and Design
Reference:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-
models/what-is-log-likelihood/
Statistical Analysis and Design
Reference:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/how-to/fit-regression-
model/interpret-the-results/all-statistics-and-graphs/analysis-of-variance-table/#adj-ms
Statistical Analysis and Design
What is lack-of-fit?
A regression model exhibits lack-of-fit when it Lack-of-fit can occur if important terms from the model
fails to adequately describe the functional such as interactions or quadratic terms are not
relationship between the experimental factors included. It can also occur if several, unusually large
and the response variable. residuals result from fitting the model.
Minitab displays the lack-of-fit test when your data contain replicates (multiple observations with identical x-
values). Replicates represent "pure error" because only random variation can cause differences between the
observed response values.
Statistical Analysis and Design
To determine whether the model accurately fits the data, compare the p-value (P-value) to your significant level.
Usually, a significance level (also called alpha or α) of 0.05 works well. An α of 0.05 means that your chance of
concluding that the model does not fit the data when it really does is only 5%.
P-value > α : There is no evidence that the model does not fit the data
If the p-value is larger than α, you cannot conclude that the model does not fit the data well.
Reference:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-
models/lack-of-fit-and-lack-of-fit-tests/
Statistical Analysis and Design
Use the standard error of the coefficient to measure the precision of the estimate of the coefficient.
The smaller the standard error, the more precise the estimate. Dividing the coefficient by its standard error
calculates a t-value. If the p-value associated with this t-statistic is less than your alpha level, you conclude that
the coefficient is significantly different from zero.
Statistical Analysis and Design
R2 is the percentage of variation in the response that is explained by the model. It is calculated as 1 minus the
ratio of the error sum of squares (which is the variation that is not explained by model) to the total sum of
squares (which is the total variation in the model).
Interpretation
• Use R2 to determine how well the model fits your data.
• The higher the R2 value, the better the model fits your data.
• R2 is always between 0% and 100%.
Statistical Analysis and Design
• The first plot illustrates a simple regression model that explains 85.5%
of the variation in the response.
• The second plot illustrates a model that explains 22.6% of the
variation in the response.
The more variation that is explained by the model, the closer the data
points fall to the fitted regression line. Theoretically, if a model could
explain 100% of the variation, the fitted values would always equal the
observed values and all of the data points would fall on the fitted line.
However, even if R2 is 100%, the model does not necessarily predict new
observations well.
Statistical Analysis and Design
• R2 always increases when you add additional predictors to a model. For example, the best five-predictor
model will always have an R2 that is at least as high as the best four-predictor model. Therefore, R2 is most
useful when you compare models of the same size.
• Small samples do not provide a precise estimate of the strength of the relationship between the response and
predictors. If you need R2 to be more precise, you should use a larger sample (typically, 40 or more).
• R2 is just one measure of how well the model fits the data. Even when a model has a high R2, you should
check the residual plots to verify that the model meets the model assumptions.
Reference:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/how-to/best-subsets-
regression/interpret-the-results/all-statistics/#r-sq
Statistical Analysis and Design
Overview
• Use Best Subsets Regression to compare different regression models
that contain subsets of the predictors you specify.
To perform best subsets
• Minitab selects the best-fitting models that contain one predictor, two regression,
predictors, and so on. The best-fitting models have the highest choose Stat > Regression
R2 values. > Regression > Best
Subsets
• Use Best subsets regression when you have a continuous response
variable and more than one continuous predictor.
• It is an efficient way to identify models that adequately fit your data with
as few predictors as possible.
Statistical Analysis and Design
References:
https://www.statisticssolutions.com/what-is-logistic-regression/
Statistical Analysis and Design
• Use Partial Least Squares Regression (PLS) to describe the For example, a chemical spectrography
relationship between a set of predictors and one or more company uses PLS to model the
continuous responses. relationship between spectral
• Use PLS when your predictors are highly collinear, or when measurements (NIR, IR, UV), because
you have more predictors than observations. these models include many variables that
• PLS is also appropriate to use when the predictors are not are correlated with one another.
fixed and are measured with error. To perform partial least squares
• PLS reduces the predictors to a smaller set of uncorrelated regression, choose Stat > Regression >
components and performs least squares regression on these Partial Least Squares
components, instead of on the original data.
• If you perform the analysis with correlated response variables,
PLS can detect multivariate response patterns and weaker
relationships than are possible with a separate analysis for
each response.
Statistical Analysis and Design
References:
https://support.minitab.com/en-us/minitab/18/help-
and-how-to/modeling-statistics/regression/how-
to/partial-least-squares/before-you-start/overview/
https://support.minitab.com/en-us/minitab/18/help-
and-how-to/modeling-statistics/regression/how-
to/partial-least-squares/before-you-start/example-
with-a-test-data-set/
Statistical Analysis and Design
• Why do different regression methods provide different values for R-squared, adjusted R-squared, and S for
the same model?
You can get different results for the same model if your data set contains missing values for any predictors.
Minitab removes all rows that contain missing values for any predictors that are in the list of predictors. Minitab
removes the rows whether or not the predictors are in the model. If you change the lists of predictors, the results
can change because of the missing values even though the model is the same.
For example, suppose the data set has the response in C1, the predictors in C2-C4, and one missing value in
C4. You perform an analysis and list all of the predictors. Then the row with the missing value is not used to
calculate the statistics, even for the model that contains only C2 and C3 as predictors. However, if you redo the
analysis and list only C2 and C3 as predictors, the entire data set is used to calculate the statistics. Therefore,
R-squared, adjusted R-squared, and S will differ for the same model.
Statistical Analysis and Design
• Why do different regression methods provide different values for R-squared, adjusted R-squared, and S for
the same model?
Reference:
https://support.minitab.com/en-us/minitab/19/help-and-how-to/statistical-modeling/regression/supporting-topics/goodness-of-
fit-statistics/why-do-different-regression-methods-provide-different-results/