Chapter 11 Analysis of Variance and Regression

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

ANALYSIS OF • Tarique Zaryab Hassan.

VARIANCE AND
• School Of Mathematics
And Statistics.
• Nanjing University Of
REGRESSION Science And Technology.
11.1 INTRODUCTION
Analysis of variance (ANOVA) and regression are two fundamental
statistical techniques used to evaluate relationships between
variables and to analyze the differences among group means.
ANOVA is a statistical method used to determine whether there
are statistically significant differences between the means of three
or more independent groups. The fundamental premise of ANOVA is
to analyze the variance within and between groups to understand if
the differences observed are genuine or merely due to random
chance.
Regression analysis is a statistical technique used to understand
the relationship between a dependent variable and one or more
independent variables. It is especially useful for predicting the
value of the dependent variable based on the values of
independent variables, making it an essential tool in data analysis.
11.2 ONE-WAY ANALYSIS OF
VARIANCE
One-way analysis of variance (ANOVA) is a statistical technique
used to determine whether there are statistically significant
differences between the means of three or more independent
groups. This method is particularly useful when analyzing
experiments involving one independent variable, which can also be
referred to as a factor.
One-way ANOVA operates by comparing the variance between the
group means to the variance within each group. In practical terms,
it partitions the total variability in the data into two components:
variation due to the treatment or factor (between-group variability)
and variation due to random error (within-group variability). The F-
statistic is then calculated as the ratio of these variances:
11.2.1 MODEL
DISTRIBUTIONS AND
ASSUMPTIONS
Model distributions refer to the mathematical functions that
describe how the values of a random variable are distributed.
These distributions are used to represent various types of data and
the relationships among them, underpinning many statistical
analyses and machine learning algorithms.
•Assumptions:
1. Normality.
2. Independence of Observations.
3. Linearity.
4. Homoscedasticity (Equality of Variance).
5. No Multicollinearity.
11.2.2 THE CLASSIC ANOVA
HYPOTHESIS
ANOVA (Analysis of Variance) is a statistical method used to
analyze differences among group means.
Hypotheses in ANOVA:
•Null Hypothesis (H0): All group means are equal.
•Alternative Hypothesis (Ha): At least one group mean is different.

Types of ANOVA:
•One-Way ANOVA: Tests differences among three or more groups
based on one independent variable.
•Two-Way ANOVA: Tests differences based on two independent
variables and their interaction.
11.2.3 INFERENCES
REGARDING LINEAR
COMBINATIONS OF MEANS
Linear combinations of means are essential in statistical analysis for
making inferences about group effects and comparisons.
Linear Combination:
•A weighted sum of means from different groups.
•Example:
Statistical Inference:
•Hypothesis Testing:
1. Assessing whether a specific linear combination of means is significantly
different from a hypothesized value.
2. Null Hypothesis: Ho :L=Lo​.
3. Alternative Hypothesis: H1:L≠Lo.
11.2.4 THE ANOVA F TEST
The ANOVA F Test is a statistical method used to determine whether there are
significant differences between the means of three or more independent groups.
Purpose: Tests the null hypothesis that all group means are equal. Assesses
whether observed differences among group means are due to random chance or
reflect true differences.
Components:
•F-Statistic: Calculated as the ratio of variance between group means to variance
within groups.

Assumptions:
1. Normality: Each group should be approximately normally distributed.
2. Independence: Observations must be independent.
3. Homogeneity of variance: Variances among the groups should be equal.
11.2.5 SIMULTANEOUS ESTIMATION OF
CONTRASTS
Simultaneous estimation of contrasts refers to comparing specific group means in
a statistical context while controlling for type I error across multiple comparisons.
Key Concepts:
• Contrasts: Linear combinations of group means designed to test specific
hypotheses about their differences.
• Purpose: To evaluate the differences between treatment levels while maintaining
the overall significance level.
Types of Contrasts:
• Simple Contrasts: Compare selected pairs of means.
• Complex Contrasts: Involve multiple means and groups.
Methods for Estimation:
• Tukey’s Honestly Significant Difference (HSD): Adjusts for multiple comparisons by
controlling the familywise error rate.
• Dunnett's Test: Used for comparing multiple treatments against a control group.
11.2.6 PARTITIONING SUMS
OF SQUARES
Partitioning the sums of squares is a statistical method that separates
the total variability in a dataset into components attributable to different
sources or factors.
Key Concepts:
•Total Sum of Squares (SST): Represents the total variation in the
dependent variable, calculated as the sum of the squared differences
between each observation and the overall mean.
•Explained Sum of Squares (ESS): Quantifies the variation explained by
the independent variables in the model, reflecting how well the model
accounts for the variability in the dependent variable.
•Residual Sum of Squares (RSS): Represents the variation that cannot be
explained by the model, calculated as the sum of the squared differences
between observed values and the values predicted by the model.
Applications:
•Widely used in ANOVA to assess the effect of one or more factors
on a response variable.
•Essential for model evaluation in regression analysis to determine
goodness of fit.
Visuals:
•Consider including a diagram visualizing the partitioning of sums of
squares (e.g., a flow chart or pie chart illustrating the components
of total variability).
11.3 SIMPLE LINEAR
REGRESSION
Simple linear regression is a statistical technique used to model
the relationship between two variables by fitting a linear equation
to observed data.
Key Components:
•Dependent Variable (Y): The outcome or response variable.
•Independent Variable (X): The predictor or explanatory variable.
•Regression Equation: Y=bo+b1X
bo : Intercept of the regression line.
b1​: Slope of the regression line indicating the change in Y for a
one-unit change in X.
Features:
•Assumes a linear relationship between the variables.
•Utilizes the Least Squares method to minimize the sum of squared
errors between observed and predicted values.
Assumptions:
•Normality: Residuals should be normally distributed.
•Independence: Observations should be independent.
•Homoscedasticity: Constant variance of residuals across levels of
the independent variable.
Applications:
•Used in various fields such as economics, psychology, and natural
sciences for prediction and trend analysis.
11.3.1 LEAST SQUARES: A
MATHEMATICAL SOLUTION
A statistical method used to find the best-fitting curve to a given set of data
points by minimizing the sum of the squares of the vertical distances
(residuals) between the observed values and the values predicted by the
model.
Key Concepts:
•Objective: Minimize the sum of squared residuals:
MinimizeS=∑(yi​− yi​^​)2
where yi​ are observed values and y^​i​ are predicted values.
Steps in Least Squares:
1. Model Selection: Choose a model (linear, quadratic, etc.).
2. Data Collection: Gather data points for fitting.
3. Calculate Coefficients: Use formulas or algorithms (e.g., Normal Equations):

4. Evaluate Fit: Assess goodness-of-fit using R-squared or adjusted R-squared.


11.3.2 BEST LINEAR UNBIASED
ESTIMATORS:
A STATISTICAL SOLUTION
Best Linear Unbiased Estimators (BLUE) are statistical estimators that
provide the best (minimum variance) linear unbiased estimates of a
parameter in a linear regression model.
Key Properties:
•Linearity: The estimator is a linear function of the observations.
•Unbiasedness: The expected value of the estimator equals the true
parameter value; i.e. E [β^​]=β.
•Efficiency: BLUE achieves the minimum variance among all linear unbiased
estimators.
Applications:
•Used in various fields such as economics, social sciences, and engineering
for reliable parameter estimation in linear regression contexts.
Mathematical Foundation:
•The Gauss-Markov theorem states that in a linear regression model
with assumptions of linearity, independence, and homoscedasticity,
the BLUE estimate for the coefficients is given by:

Assumptions for BLUE:


•Linearity: The relationship between independent and dependent
variables is linear.
•Independence: Observations are independent of each other.
•Homoscedasticity: Variances of errors are constant across all levels of
the independent variable.
11.3.3 MODELS AND
DISTRIBUTION
ASSUMPTIONS
Models Used:
1. Analysis of Variance (ANOVA):
•Tests for differences in means across multiple groups.
•One-way ANOVA for one independent variable.
•Two-way ANOVA for two independent variables, analyzing
interactions.
2. Regression Analysis:
•Explores relationships between dependent and independent
variables.
•Types include linear regression (one predictor) and multiple
regression (multiple predictors).
Key Distribution Assumptions :
1. Normality: Data should be normally distributed within each
group (especially in ANOVA).
2. Independence: Observations must be independent to avoid
bias.
3. Homoscedasticity: Equal variances across groups (ANOVA)
or across residuals (regression).
4. Linearity: Assumes a linear relationship between
independent and dependent variables (regression).
Importance of Assumptions :
•Ensures validity and reliability of statistical inferences.
•Violation can lead to incorrect conclusions and affect
decision-making.
11.3.4 ESTIMATION AND
TESTING WITH NORMAL
ERRORS
 Introduction to Normal Errors:
•Normal errors refer to the assumption that the error terms in
regression models are normally distributed.
•Important for valid parameter estimation and hypothesis testing.
Key Concepts:
•Maximum Likelihood Estimation: Estimates parameters by maximizing
the likelihood that the observed data occurred given the parameter
values.
•Confidence Intervals: Provides a range of probable values for the
population parameters based on sample statistics.
Hypothesis Testing:
•Null Hypothesis (H0​): Errors follow a normal distribution.
•Alternate Hypothesis (HA​): Errors do not follow a normal
distribution.
•Assessing normality through residual plots and formal tests.
Common Tests for Normality:
1. Anderson-Darling Test:
•Measures how well the data follows a specified distribution.
•Small test statistic values indicate a better fit.
2. Shapiro-Wilk Test:
•Tests if a sample comes from a normally distributed population.
•Small p-values imply rejection of the null hypothesis.
11.3.5 SIMULTANEOUS
ESTIMATION AND
CONFIDENCE BANDS
Simultaneous estimation provides a framework for estimating
multiple parameters at the same time, allowing for the assessment
of uncertainty across several values simultaneously.
Confidence bands are graphical representations that indicate the
uncertainty around an estimated function, providing regions within
which the true function value is expected to lie with a certain
probability.
Purpose:
•To deliver valid inference over a range of values rather than for
individual points.
•Ensures that the overall coverage probability across multiple
estimates meets the desired confidence level.
Key Features:
•Type: Can be pointwise or simultaneous; simultaneous bands
offer wider intervals to account for multiple comparisons.
•Methods: Common methods include Bonferroni and Scheffé
approaches to control for family-wise error rates.
Applications:
•Widely used in areas such as regression analysis, survival
analysis, and time series analysis to visualize and account for
uncertainty in predictions and estimates.

You might also like