How Do You Use This Module?: Module 4 - Chapter 4: Test For Contingency Tables
How Do You Use This Module?: Module 4 - Chapter 4: Test For Contingency Tables
Remember to:
Read and understand the Specific Learning Outcome(s). These tell you
what you should know and be able to do at the end of this module.
Work through all the information and complete the activities in each
section.
After reading every discussion, test yourself on how much you learned by
means of the Written Works. Use the White Book to write your answers.
Note: You need to complete this module before you can perform the next
module.
Grading System
References………………………………………………………………………………………………… 10
A correlation exists between two variables when one of them is related to Linear relationships can be either positive or negative. Positive relationships have
the other in some way. A scatterplot is the best place to start. A points that incline upwards to the right. As x values increase, y values increase. As
scatterplot (or scatter diagram) is a graph of the paired (x, y) sample data x values decrease, y values decrease. For example, when studying plants, height
with a horizontal x-axis and a vertical y-axis. Each individual (x, y) pair is typically increases as diameter increases.
plotted as a single point.
Non-linear relationships have an apparent pattern, just not linear. For example, as
age increases height increases up to a point then levels off after reaching a
maximum height.
Because visual examinations are largely subjective, we need a more precise and It is always between -1 and +1.
objective measure to define the correlation between the two variables. To It is a unitless measure so “r” would be the same value whether you
quantify the strength and direction of the relationship between two variables, we measured the two variables in pounds and inches or in grams and
use the linear correlation coefficient: centimeters.
Positive values of “r” are associated with positive relationships.
𝑥1 −𝑥 𝑦1 −𝑦 Negative values of “r” are associated with negative relationships.
𝑠𝑥 𝑠𝑦
𝑟= 𝑛−1
where x̄ and sx are the sample mean and sample standard deviation of the x’s,
and ȳ and sy are the mean and standard deviation of the y’s. The sample size is n.
When you investigate the relationship between two variables, always begin
with a scatterplot. This graph allows you to look for patterns (both linear and non-
linear). The next step is to quantitatively describe the strength and direction of the
linear relationship using “r”. Once you have established that a linear relationship
exists, you can take the next step in model building.
Our model will take the form of ŷ = b 0 + b1x where b0 is the y-intercept,
b1 is the slope, x is the predictor variable, and ŷ an estimate of the mean value of
the response variable for any value of the predictor variable.
The y-intercept is the predicted value for the response (y) when x = 0. The
slope describes the change in y for each one unit change in x. Let’s look at this
example to clarify the interpretation of the slope and intercept.
This simple model is the line of best fit for our sample data. The regression line
does not go through every point; instead it balances the difference between all
data points and the straight-line model. The difference between the observed
data value and the predicted value (the value on the straight line) is the error or
residual. The criterion to determine the line that best describes the relation
between two variables is based on the residuals.
But a measured bear chest girth (observed value) for a bear that weighed 120 lb. After we fit our regression line (compute b0 and b1), we usually wish to know how
was actually 62.1 in. well the model fits our data. To determine this, we need to think back to the idea
of analysis of variance. In ANOVA, we partitioned the variation using sums of
The residual would be 62.1 – 64.8 = -2.7 in. squares so we could identify a treatment effect opposed to random variation that
occurred in our data. The idea is the same for regression. We want to partition the
A negative residual indicates that the model is over-predicting. A positive residual
total variability into two parts: the variation due to the regression and the
indicates that the model is under-predicting. In this instance, the model over-
variation due to random error. And we are again going to compute sums of
predicted the chest girth of a bear that actually weighed 120 lb.
squares to help us do this.
Suppose the total variability in the sample measurements about the sample mean
is denoted by 11856.png, called the sums of squares of total variability about the
mean (SST). The squared difference between the predicted value 13147.png and
the sample mean is denoted by 11878.png, called the sums of squares due to
regression (SSR). The SSR represents the variability explained by the regression
line. Finally, the variability which cannot be explained by the regression line is
called the sums of squares due to error (SSE) and is denoted by 11892.png. SSE is
actually the squared residual.
The Coefficient of Determination and the linear correlation coefficient are related
mathematically.
𝑅2 = 𝑟 2
However, they have two very different meanings: r is a measure of the strength
and direction of a linear relationship between two variables; R2 describes the
percent variation in “y” that is explained by the model.
https://milnepublishing.geneseo.edu/natural-resources-
2 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 biometrics/chapter/chapter-7-correlation-and-simple-linear-regression/
𝑟 =
𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛