0% found this document useful (0 votes)
56 views29 pages

Activity 5 - Statistical Analysis and Design - Regression - Correlation

The standard error of a coefficient measures how precisely the regression model estimates that coefficient's value. A smaller standard error indicates a more precise estimate. Dividing a coefficient by its standard error gives the t-value, which is used to determine if the coefficient is statistically significant based on the p-value.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views29 pages

Activity 5 - Statistical Analysis and Design - Regression - Correlation

The standard error of a coefficient measures how precisely the regression model estimates that coefficient's value. A smaller standard error indicates a more precise estimate. Dividing a coefficient by its standard error gives the t-value, which is used to determine if the coefficient is statistically significant based on the p-value.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Statistical Analysis and Design

Statistical Analysis and Design

Activity 5
Submitted by:
Jericka Christine Alson, CIE
Statistical Analysis and Design

Activity 5
Discuss the comprehensively some terminologies include the references
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


1. Residual variance
2. Log like hood
3. Adj SS
4. Adj MS
5. Lack of fit test
6. Standard error of coefficient
7. R-sq

• Explain the other regression and its the application


1. Best subset regression
2. Nonlinear regression
3. Logistics regression
4. Partial least square regression

• Why do different regression methods provide different values for R-squared, adjusted R-squared, and S for
the same model?
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


1. Residual variance

Residual Variance (also called unexplained variance or error variance) is the variance of any error (residual).

For example, in regression analysis, random fluctuations cause variation around the “true” regression line.
The total variance of a regression line is made up of two parts: explained variance and unexplained variance.
The unexplained variance is simply what’s left over when you subtract the variance due to regression from the
total variance of the dependent variable.

Symbol for Residual Variance


The symbols σ or σ2 are often used to denote unexplained variance.

Reference:
https://www.statisticshowto.com/residual-variance/
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


2. Log like hood

The log-likelihood is the expression that Minitab maximizes to determine optimal values of the estimated
coefficients (β).

Log-likelihood values cannot be used alone as an index of fit because they are a function of sample size but can
be used to compare the fit of different coefficients. Because you want to maximize the log-likelihood, the higher
value is better.

For example, a log-likelihood value of -3 is better than -7.

Reference:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-
models/what-is-log-likelihood/
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


3. Adj SS

Adjusted sums of squares are measures of variation for different components of the model.
The order of the predictors in the model does not affect the calculation of the adjusted sums of squares.

In the Analysis of Variance table, Minitab separates the sums of squares into different components that describe
the variation due to different sources.

Adj SS Term
The adjusted sum of squares for a term is the increase in the
regression sum of squares compared to a model with only the other
terms. It quantifies the amount of variation in the response data that is
explained by each term in the model.
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


3. Adj SS

Adj SS Error Interpretation


The error sum of squares is the sum of the
squared residuals. It quantifies the variation in • Minitab uses the adjusted sums of squares to
the data that the predictors do not explain. calculate the p-value for a term.
• Minitab also uses the sums of squares to calculate
the R2 statistic.
Adj SS Total • Usually, you interpret the p-values and the
The total sum of squares is the sum of the term R2 statistic instead of the sums of squares.
sum of squares and the error sum of squares. It
quantifies the total variation in the data.

Reference:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-
models/what-is-log-likelihood/
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


4. Adj MS

Adjusted mean squares measure how much variation Interpretation


a term or a model explains, assuming that all other
terms are in the model, regardless of the order they • Minitab uses the adjusted mean squares to
were entered. calculate the p-value for a term.
• Minitab also uses the adjusted mean squares to
• Unlike the adjusted sums of squares, the adjusted calculate the adjusted R2 statistic.
mean squares consider the degrees of freedom. • Usually, you interpret the p-values and the
adjusted R2 statistic instead of the adjusted mean
• The adjusted mean square of the error (also called squares.
MSE or s2) is the variance around the fitted
values.

Reference:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/how-to/fit-regression-
model/interpret-the-results/all-statistics-and-graphs/analysis-of-variance-table/#adj-ms
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


5. Lack of fit test

What is lack-of-fit?

A regression model exhibits lack-of-fit when it Lack-of-fit can occur if important terms from the model
fails to adequately describe the functional such as interactions or quadratic terms are not
relationship between the experimental factors included. It can also occur if several, unusually large
and the response variable. residuals result from fitting the model.

Lack-of-fit test in Minitab

Minitab displays the lack-of-fit test when your data contain replicates (multiple observations with identical x-
values). Replicates represent "pure error" because only random variation can cause differences between the
observed response values.
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


5. Lack of fit test

To determine whether the model accurately fits the data, compare the p-value (P-value) to your significant level.
Usually, a significance level (also called alpha or α) of 0.05 works well. An α of 0.05 means that your chance of
concluding that the model does not fit the data when it really does is only 5%.

P-value < α : The model does not fit the data


If the p-value is less than or equal to α, you conclude that the model does not accurately fit
the data. To get a better model, you may need to add terms or transform your data.

P-value > α : There is no evidence that the model does not fit the data
If the p-value is larger than α, you cannot conclude that the model does not fit the data well.

Reference:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-
models/lack-of-fit-and-lack-of-fit-tests/
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


6. Standard error of coefficient

• The standard deviation of an estimate is called the standard error.


• The standard error of the coefficient measures how precisely the
model estimates the coefficient's unknown value.
• The standard error of the coefficient is always positive.

Use the standard error of the coefficient to measure the precision of the estimate of the coefficient.
The smaller the standard error, the more precise the estimate. Dividing the coefficient by its standard error
calculates a t-value. If the p-value associated with this t-statistic is less than your alpha level, you conclude that
the coefficient is significantly different from zero.
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


6. Standard error of coefficient

For example, a materials engineer at a


furniture manufacturing site wants to
assess the strength of the particle
board that they use. The engineer
collects stiffness data from particle
The standard error of the Stiffness coefficient is smaller than that
board pieces with various densities at
of Temp. Therefore, your model was able to estimate the coefficient
different temperatures and produces for Stiffness with greater precision. In fact, the standard error of
the following linear regression output. the Temp coefficient is about the same as the value of the coefficient
The standard errors of the coefficients itself, so the t-value of -1.03 is too small to declare statistical
are in the third column. significance. The resulting p-value is much greater than common levels
of α, so that you cannot conclude this coefficient differs from zero. You
remove the Temp variable from your regression model and continue the
Reference: analysis.
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-
models/what-is-the-standard-error-of-the-coefficient/
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


7. R-sq

R2 is the percentage of variation in the response that is explained by the model. It is calculated as 1 minus the
ratio of the error sum of squares (which is the variation that is not explained by model) to the total sum of
squares (which is the total variation in the model).

Interpretation
• Use R2 to determine how well the model fits your data.
• The higher the R2 value, the better the model fits your data.
• R2 is always between 0% and 100%.
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


7. R-sq
You can use a fitted line plot to graphically illustrate different R2 values.

• The first plot illustrates a simple regression model that explains 85.5%
of the variation in the response.
• The second plot illustrates a model that explains 22.6% of the
variation in the response.

The more variation that is explained by the model, the closer the data
points fall to the fitted regression line. Theoretically, if a model could
explain 100% of the variation, the fitted values would always equal the
observed values and all of the data points would fall on the fitted line.
However, even if R2 is 100%, the model does not necessarily predict new
observations well.
Statistical Analysis and Design

• Discuss the use of the following in terms in statistics


7. R-sq
Consider the following issues when interpreting the R2 value:

• R2 always increases when you add additional predictors to a model. For example, the best five-predictor
model will always have an R2 that is at least as high as the best four-predictor model. Therefore, R2 is most
useful when you compare models of the same size.

• Small samples do not provide a precise estimate of the strength of the relationship between the response and
predictors. If you need R2 to be more precise, you should use a larger sample (typically, 40 or more).

• R2 is just one measure of how well the model fits the data. Even when a model has a high R2, you should
check the residual plots to verify that the model meets the model assumptions.

Reference:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/how-to/best-subsets-
regression/interpret-the-results/all-statistics/#r-sq
Statistical Analysis and Design

• Explain the other regression and its the application


1. Best subset regression

Overview
• Use Best Subsets Regression to compare different regression models
that contain subsets of the predictors you specify.
To perform best subsets
• Minitab selects the best-fitting models that contain one predictor, two regression,
predictors, and so on. The best-fitting models have the highest choose Stat > Regression
R2 values. > Regression > Best
Subsets
• Use Best subsets regression when you have a continuous response
variable and more than one continuous predictor.

• It is an efficient way to identify models that adequately fit your data with
as few predictors as possible.
Statistical Analysis and Design

• Explain the other regression and its the application


1. Best subset regression
Example:
Technicians measure heat flux as part of a
solar thermal energy test. An energy
engineer wants to determine how total heat
flux is predicted by other variables:
insolation, the position of the east, south,
and north focal points, and the time of day.

To select a group of likely models for further


analysis, the technicians use best subsets
regression.
In Minitab, best subsets regression uses the
maximum R2 criterion to select likely models.
Statistical Analysis and Design

• Explain the other regression and its the application


1. Best subset regression
Interpret the results
The technicians identify several models to examine further. The
model with all 5 predictors has the lowest value of S and the
highest value of adjusted R2, approximately 8 and 88%
respectively. One of the models with 4 predictors has the
smallest value of Mallows' Cp, 5.8. A model with 2 predictors
and a model with 3 predictors both have the highest predicted
R2, which is approximately 81.4%.

Before the technicians choose a final model, they examine the


models for violations of the regression assumptions using
residual plots and other diagnostic measures.
References:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/how-to/best-
subsets-regression/before-you-start/overview/
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/how-to/best-
subsets-regression/before-you-start/example/
Statistical Analysis and Design

• Explain the other regression and its the application


2. Nonlinear regression

What is Nonlinear Regression?


• Nonlinear regression generates an equation
to describe the nonlinear relationship For example, a scientist wants to understand the
between a continuous response variable and relationship between semiconductor electron mobility and
one or more predictor variables, and predicts the natural log of the density. Because the best linear
new observations. model provides a biased fit, the scientist use a nonlinear
• Use nonlinear regression instead of ordinary model.
least squares regression when you cannot To perform nonlinear regression,
adequately model the relationship with linear choose Stat > Regression > Nonlinear Regression.
parameters. Parameters are linear when
each term in the model is additive and
contains only one parameter that multiplies
the term.
Statistical Analysis and Design

• Explain the other regression and its the application


2. Nonlinear regression
Example:
Researchers for the NIST (National
Institute of Standards and Technology)
want to understand the relationship
between the coefficient of thermal
expansion for copper and the temperature
in degrees Kelvin.

Previous research indicates that a


nonlinear model with 7 parameters
provides an adequate fit. The researchers
use nonlinear regression to estimate the
parameters in the model.
Statistical Analysis and Design

• Explain the other regression and its the application


2. Nonlinear regression
Interpret the results
The fitted line plot shows that the fitted line follows the
observed values, which visually indicates that the model
fits the data. The p-value for the lack-of-fit test is 0.679,
which provides no evidence that the model fits the data
poorly.

The warning about highly correlated parameters indicates


that at least one pair of parameters has a correlation
greater than an absolute value of 0.99. However, because
previous studies indicate that a nonlinear model with 7
parameters provides an adequate fit to the data, the
researchers do not change the model.
References:
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/how-to/nonlinear-regression/before-you-start/overview/
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/how-to/nonlinear-regression/before-you-start/example/
Statistical Analysis and Design

• Explain the other regression and its the application


3. Logistics regression

• Logistic regression is the appropriate Major Assumptions


regression analysis to conduct when the • The dependent variable should be dichotomous in
dependent variable is dichotomous (binary). nature (e.g., presence vs. absent).
• Like all regression analyses, the logistic • There should be no outliers in the data, which can be
regression is a predictive analysis. assessed by converting the continuous predictors to
• Logistic regression is used to describe data standardized scores, and removing values below -
and to explain the relationship between one 3.29 or greater than 3.29.
dependent binary variable and one or more • There should be no high correlations
nominal, ordinal, interval or ratio-level (multicollinearity) among the predictors.
independent variables. • At the center of the logistic regression analysis is the
task estimating the log odds of an event.
Statistical Analysis and Design

• Explain the other regression and its the application


3. Logistics regression

Major Assumptions (cont..)


• Overfitting. When selecting the model for the • Reporting the R2. Numerous pseudo-R2 values
logistic regression analysis, another important have been developed for binary logistic
consideration is the model fit. Adding independent regression. These should be interpreted with
variables to a logistic regression model will always extreme caution as they have many
increase the amount of variance explained in the computational issues which cause them to be
log odds (typically expressed as R²). However, artificially high or low. A better approach is to
adding more and more variables to the model can present any of the goodness of fit tests available;
result in overfitting, which reduces the Hosmer-Lemeshow is a commonly used measure
generalizability of the model beyond the data on of goodness of fit based on the Chi-square test.
which the model is fit.

References:
https://www.statisticssolutions.com/what-is-logistic-regression/
Statistical Analysis and Design

• Explain the other regression and its the application


4. Partial least square regression

• Use Partial Least Squares Regression (PLS) to describe the For example, a chemical spectrography
relationship between a set of predictors and one or more company uses PLS to model the
continuous responses. relationship between spectral
• Use PLS when your predictors are highly collinear, or when measurements (NIR, IR, UV), because
you have more predictors than observations. these models include many variables that
• PLS is also appropriate to use when the predictors are not are correlated with one another.
fixed and are measured with error. To perform partial least squares
• PLS reduces the predictors to a smaller set of uncorrelated regression, choose Stat > Regression >
components and performs least squares regression on these Partial Least Squares
components, instead of on the original data.
• If you perform the analysis with correlated response variables,
PLS can detect multivariate response patterns and weaker
relationships than are possible with a separate analysis for
each response.
Statistical Analysis and Design

• Explain the other regression and its the application


4. Partial least square regression
Example:
A scientist at a food chemistry laboratory analyzes
60 soybean flour samples. For each sample, the
scientist determines the moisture and fat content,
and records near-infrared (NIR) spectral data at
88 wavelengths. The scientist randomly selects
54 of the 60 samples and estimates the
relationship between the responses (moisture and
fat) and the predictors (the 88 NIR wavelengths)
using PLS regression. The scientist uses the
remaining 6 samples as a test data set to
evaluate the predictive ability of the model.
Statistical Analysis and Design

• Explain the other regression and its the application


4. Partial least square regression

Interpret the results


The p-values for both responses are approximately 0.000,
which are less than the significance level of 0.05.

These results indicate that at least one coefficient in the


model is different from zero. Minitab results on the next slide

The test R2 value for moisture is approximately 0.9. The


test R2 value for fat is almost 0.8. The test R2 statistics
indicate that the models predict well.

The analysis of each response individually would provide


different results.
Statistical Analysis and Design

• Explain the other regression and its the application


4. Partial least square regression

References:
https://support.minitab.com/en-us/minitab/18/help-
and-how-to/modeling-statistics/regression/how-
to/partial-least-squares/before-you-start/overview/
https://support.minitab.com/en-us/minitab/18/help-
and-how-to/modeling-statistics/regression/how-
to/partial-least-squares/before-you-start/example-
with-a-test-data-set/
Statistical Analysis and Design

• Why do different regression methods provide different values for R-squared, adjusted R-squared, and S for
the same model?

You can get different results for the same model if your data set contains missing values for any predictors.

When you perform:


Stat > Regression > Regression > Fit Regression Model > Stepwise or Stat > Regression > Regression > Best
Subsets

Minitab removes all rows that contain missing values for any predictors that are in the list of predictors. Minitab
removes the rows whether or not the predictors are in the model. If you change the lists of predictors, the results
can change because of the missing values even though the model is the same.

For example, suppose the data set has the response in C1, the predictors in C2-C4, and one missing value in
C4. You perform an analysis and list all of the predictors. Then the row with the missing value is not used to
calculate the statistics, even for the model that contains only C2 and C3 as predictors. However, if you redo the
analysis and list only C2 and C3 as predictors, the entire data set is used to calculate the statistics. Therefore,
R-squared, adjusted R-squared, and S will differ for the same model.
Statistical Analysis and Design

• Why do different regression methods provide different values for R-squared, adjusted R-squared, and S for
the same model?
Reference:
https://support.minitab.com/en-us/minitab/19/help-and-how-to/statistical-modeling/regression/supporting-topics/goodness-of-
fit-statistics/why-do-different-regression-methods-provide-different-results/

You might also like