0% found this document useful (0 votes)
39 views50 pages

Unit Iv

REGRESSION ANALYSIS

Uploaded by

vidhyasreem56
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
0% found this document useful (0 votes)
39 views50 pages

Unit Iv

REGRESSION ANALYSIS

Uploaded by

vidhyasreem56
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 50

UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Normal distributions – z scores – normal curve problems – finding


proportions – finding scores –more about z scores – correlation – scatter
plots – correlation coefficient for quantitative data –computational
formula for correlation coefficient – regression – regression line – least
squares regression line –standard error of estimate – interpretation of r2 –
Population – Analysis of variance.

4.1 NORMAL DISTRIBUTIONS AND STANDARD (z) SCORES

The idealized normal curve has been superimposed on the


original distribution for 3091 men. Irregularities in the original
distribution, most likely due to chance, are ignored by the
smooth normal curve. Accordingly, any generalizations based on
the smooth normal curve will tend to be more accurate than
those based on the original distribution.

Interpreting the Shaded Area


The total area under the normal curve can be identified with
all FBI applicants. Viewed relative to the total area, the shaded
area represents the proportion of applicants who will be eligible
because they are shorter than exactly 66 inches. This new, more
accurate proportion will differ from that obtained from the
original histogram (.165) because of discrepancies between the
two distributions.

Finding a Proportion for the Shaded Area

To find this new proportion, we cannot rely on the vertical


scale. It describes as proportions the areas in the rectangular
bars of histograms, not the areas in the various curved sectors of
the normal curve.

Properties of the Normal Curve


 Obtained from a mathematical equation, the normal curve is
a theoretical curve defined for a continuous variable, and
noted for its symmetrical bell-shaped form, as revealed
 Because the normal curve is symmetrical, its lower half is the
mirror image of its upper half.

 Being bell shaped, the normal curve peaks above a point


midway along the horizontal spread and then tapers off
gradually in either direction from the peak(without actually
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

touching the horizontal axis, since, in theory, the tails of a


normal curve extend infinitely far).

 The values of the mean, median (or 50th percentile), and


mode, located at a point midway along the horizontal spread,
are the same for the normal curve.

Normal curve superimposed on the distribution of heights.

4.2 Z SCORES

A z score is a unit-free, standardized score that, regardless of the


original units of measurement, indicates how many standard
deviations a score is above or below the mean of its distribution.

where X is the original score and μ and σ are the mean and the
standard deviation, respectively,
A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or below the
mean; and
2. a number indicating the size of its deviation from the mean in
standard deviation units. FBI applicants, replace X with 66 (the
maximum permissible height), μ with 69 (the mean height),
and σ with 3 (the standard deviation of heights) and solve for z
as follows:

66-69/3 =-1
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

4.3 THE NORMAL CURVE

Importance of Mean and Standard Deviation

When using the normal curve, two bits of information are


indispensable: values for the mean and the standard deviation.
For example, before the normal curve can be used to answer the
question about eligible FBI applicants, it must be established
that, for the original distribution of 3091 men, the mean height
equals 69 inches and the standard deviation equals 3 inches.

Different Normal Curves

Particular normal curve has a mean of 69 inches and a


standard deviation of 3 inches, we can’t arbitrarily change these
values, as any change in the value of either the mean or the
standard deviation (or both) would create a new normal curve
that no longer describes the original distribution of heights.
Nevertheless, as a theoretical exercise, it is instructive to note the
various types of normal curves that are produced by an arbitrary
change in the value of either the mean (μ) or the standard
deviation (σ).*
For example, changing the mean height from 69 to 79 inches
produces a new normal curve that, as shown in panel A is
displaced 10 inches to the right of the original curve.
Dramatically new normal curves are produced by changing the
value of the standard deviation changing the standard deviation
from 3 to 1.5 inches produces a more peaked normal curve with
smaller variability, whereas changing the standard deviation from
3 to 6 inches produces a shallower normal curve with greater
variability. Obvious differences in appearance among normal
curves are less important than you might suspect. Because of
their common mathematical origin, every normal curve can be
interpreted in exactly the same way once any distance from the
mean is expressed in standard deviation units
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

The normal curve is to describe a complete set of observations


or a population, the symbols μ and σ, representing the mean and
standard deviation of the population, respectively.

STANDARD NORMAL CURVE

If the original distribution approximates a normal curve, then the


shift to standard or z scores will always produce a new
distribution that approximates the standard normal curve. This
is the one normal curve for which a table is actually available. It
is a mathematical fact—not proven in this book—that the
standard normal curve always has a mean of 0 and a standard
deviation of 1. However, to verify (rather than prove) that the
mean of a standard normal distribution equals 0,

Replace X in the z score formula with μ, the mean of any


(nonstandard) normal distribution, and then solve for z

Replace X in the z score formula with μ + 1σ, the value


corresponding to one standard deviation above the mean for
any (nonstandard) normal distribution, and then solve for z:

Although there are an infinite number of different normal


curves, each with its own mean and standard deviation, there is
only one standard normal curve, with a mean of 0 and a
standard deviation of 1.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Converting three normal curves to the standard normal curve.

Standard Normal Table

Standard normal table consists of columns of z scores


coordinated with columns of proportions. In a typical problem,
access to the table is gained through a z score, such as – 1.00,
and the answer is read as a proportion, such as the proportion of
eligible FBI applicants.

Using the Top Legend of the table


The entries in column A are z scores, beginning with 0.00
and ending (in the full-length table of Appendix C) with 4.00.
Given a z score of zero or more, columns B and C indicate how
the z score splits the area in the upper half of the normal curve.
As suggested by the shading in the top legend, column B
indicates the proportion of area between the mean and the z
score, and column C indicates the proportion of area beyond the
z score, in the upper tail of the standard normal curve.

Using the Bottom Legend of the Table


The symmetry of the normal curve, the entries .Table A of
Appendix C also can refer to the lower half of the normal curve.
Now the columns are designated as A′, B′, and C′ in the
legend at the bottom of the table. When using the bottom legend,
all entries refer to the lower half of the standard normal curve.
Imagine that the nonzero entries in column A′ are negative z
scores, beginning with –0.01 and ending (in the full-length
table of Appendix C) with –
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

4.00. Given a negative z score, columns B′ and C′ indicate how


that z score splits the lower half of the normal curve. As
suggested by the shading in the bottom legend of the table,
column B′ indicates the proportion of area between the mean and

the negative z score, and column C′ indicates the proportion of


area beyond the negative z score, in the lower tail of the standard
normal curve.

SOLVING NORMAL CURVE PROBLEMS


Two main types of normal curve problems. In the first type of
problem, we use a known score (or scores) to find an unknown
proportion. For instance, we use the known score of 66 inches to
find the unknown proportion of eligible FBI applicants. In the
second type of problem, the procedure is reversed. Now we use a
known proportion to find an unknown score (or scores).

Interpretation of Table A, Appendix C.

When using the standard normal table, it is important to


remember that for any zscore, the corresponding proportions in
columns B and C (or columns B′ and C′) always sum to .5000.
Similarly, the total area under the normal curve always equals
1.0000, the sum of the proportions in the lower and upper halves,
that is, .5000 + .5000. Finally, although a z score can be either
positive or negative, the proportions of area under the curve are
always positive or zero but never negative (because an area
cannot be negative).
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

4.4 FINDING PROPORTIONS


1. Sketch a normal curve and shade in the target area, as in the
left part of Figure .Being less than the mean of 69, 66 is
located to the left of the mean.Furthermore, since the
unknown proportion represents those applicants who are
shorter than 66 inches, the shaded target sector is located to
the left of 66.
2. Plan your solution according to the normal table. Decide
precisely how you will find the value of the target area. In the
present case, the answer will be obtained from column C′ of
the standard normal table, since the target area coincides with
the type of area identified with column C′, that is, the area in
the lower tail beyond a negative z.
3. Convert X to z. Express 66 as a z score:
X−μ
z= =66-69/3=-1
σ

4. Find the target area. Refer to the standard normal table, using
the bottom legend, as the z score is negative. The arrows in
Table 5.1 show how to read the table. Look up column A’ to
1.00 (representing a z score of –1.00), and note the
corresponding proportion of .1587 in column C’: This is the
answer, as suggested in the right part of Figure.It can be
concluded that only .1587 (or .16) of all of the FBI applicants will
be shorter than 66 inches.

Finding Proportions between Two Scores


Assume that, when not interrupted artificially, the gestation
periods for human foetuses approximate a normal curve with a
mean of 270 days (9 months) and a standard deviation of 15
days. What proportion of gestation periods will be between 245
and 255 days?
 Sketch a normal curve and shade in the target area, as in the
top panel to the shaded area represents just those gestation
periods between 245 and 255 days.
 Plan your solution according to the normal table. This type of
problem requires more effort to solve because the value of the
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

target area cannot be read directly from Table A. As suggested


in the bottom two panels of Figure 5.7, the basic idea is to
identify the target area with the difference between two
overlapping areas whose values can be read from column C′
of Table A. The larger area (less than 255 days) contains two
sectors: the target area (between 245 and 255 days) and a
remainder (less than 245 days). The smaller area contains
only the remainder (less than 245 days). Subtracting the
smaller area (less than 245 days) from the larger area (less
than 255 days), therefore, eliminates the common remainder
(less than 245 days), leaving only the target area (between
245 and 255 days).
 Convert X to z by
expressing 255 as
Z=255-270/15 =-
15/15=-1.00
 and byexpressing 245 as
Z=245-270/15 =-25/15 =-
1.67

1. To the target area. Look up column A′ to a negative z score of –


1.00 (remember, you must imagine the negative sign), and note
the corresponding proportion of .1587 in column C′. Likewise,
look up column A′ to a z score of –1.67, and note the
corresponding proportion of
.0475 in column C′. Subtract the smaller proportion from the
larger proportion to obtain the answer, .1112. Thus, only .11, or
11 percent, of all gestation periods will be between 245 and 255
days.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Finding Proportions beyond Two Scores


Assume that high school students’ IQ scores approximate a
normal distribution with a mean of 105 and a standard
deviation of 15. What proportion of IQs are more than 30 points
either above or below the mean?
1. Sketch a normal curve and shade in the two target
areas, as in the top panel of Figure.
2. Plan your solution according to the normal table. The
solution to this type problem is straight because each of
the target areas can be read directly from table A

3. Convert X to z by expressing IQ
scores of 135 and 75 as Z=135-105/15
=30/15=2.00
Z=75-135/15 =-30/15 =-2.00

4. Find the target area. In Table A, locate a z score of 2.00 in


column A, and note the corresponding proportion of .0228 in
column C. Because of the symmetry of the normal curve, you
need not enter the table again to find the proportion below a z
score of –2.00. Instead, merely double the above proportion of
.0228 to obtain .0456, which represents the proportion of
students with IQs more than 30 points either above or below the
mean.

4.5 FINDING SCORES

Table A must be consulted to find the unknown proportion (of


area) associated with some known score or pair of known scores.
For instance, given a GRE score of 650, we found that the
unknown proportion of scores larger than 650 equals .07. Now
we will concentrate on the
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

opposite type of normal curve problem for which Table A must be


consulted to find the unknown score or scores associated with
some known proportion. For instance, given that a GRE score
must be in the upper 25 percent of the distribution (in order for
an applicant to be considered for admission to graduate school),
we must find the unknown minimum GRE score.

Finding One Score


Exam scores for a large psychology class approximate a normal
curve with a mean of 230 and a standard deviation of 50.
Furthermore, students are graded “on a curve,” with only the
upper 20 percent being awarded grades of A. What is the lowest
score on the exam that receives an A?
1. Sketch a normal curve and, on the correct side of the
mean, draw a line representing the target score, as in Figure.
This is often the most difficult step, and it involves semantics
rather than statistics. It’s often helpful to visualize the target.

2. Plan your solution according to the normal table.


The target score is on the right side of the mean,
concentrate on the area in the upper half of the normal curve, as
described in columns B and C. The right panel of Figure 5.9
indicates that either column B or C can be used to locate a z
score in column A. It is crucial, however, to search for the single

value (.3000) that is valid for column B or the single value


(.2000) that is valid for column C. Note that we look in column B
for .3000, not for .8000. Table A is not designed for sectors, such
as the lower .8000, that span the mean of the normal curve.

3. Find z.
The entry in column C closest to .2000 is .2005, and the
corresponding z score in column A equals 0.84. Verify this by
checking Table A. Also note that exactly the same z score of 0.84
would have been identified if column B had been searched to find
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

the entry (.2995) nearest to


.3000. The z score of 0.84 represents the point that separates the
upper 20 percent of the area from the rest of the area under the
normal curve.

4. Convert z to the target score. Finally, convert the z score of


0.84 into an exam score, given a distribution with a mean of 230
and a standard deviation of 50. You’ll recall that a z score
indicates how many standard deviations the original score is
above or below its mean. In the present case, the target score
must belocated .84 of a standard deviation above its mean. The
distance of the target score above its mean equals 42 (from .84 .
50), which, when added to the mean of 230, yields a value of 272.
Therefore, 272 is the lowest score on the exam that receives an A.

Finding Two Scores


Assume that the annual rainfall in the San Francisco
area approximates a normal curve with a mean of 22 inches and a
standard deviation of 4 inches. What are the rainfalls for the
more atypical years, defined as the driest 2.5 percent of all years
and the wettest 2.5 percent of all years?

1. Sketch a normal curve. On either side of the mean,


draw two lines representing the two target scores,
The smaller (driest) target score splits the total area into .0250
to the left and .9750 to the right, and the larger (wettest) target
score does the exact opposite.

2. Plan your solution according to the normal table.


The target z score can be found by scanning either column B′ for
.4750 or column C′where X is the target score, expressed in
original units of measurement; μ and σ are the mean and the
standard deviation, respectively, for the original normal curve;
and z is the standard score read from column A or A′
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Substitute Pairs of Convenient Numbers


The substitution of other arbitrary pairs of numbers serves no
purpose; indeed, because of their peculiarity, they might make
the new distribution, even though it lacks the negative signs and
decimal points common to z scores, slightly less comprehensible
to people who have been exposed to the traditional pairs of
numbers.

4.6 MORE ABOUT Z SCORES:

For instance, if the original distribution is positively skewed, the


distribution of z scores also will be positively skewed. Regardless
of the shape of the distribution, the shift to z scores always
produces a distribution of standard scores with a mean of 0 and
a standard deviation of

z Scores for Non-normal Distributions


The original distribution is positively skewed, the
distribution of z scores also will be positively skewed. Regardless
of the shape of the distribution, the shift to z scores always
produces a distribution of standard scores with a mean of 0 and
a standard deviation of

Interpreting Test Scores I


The evaluation of her test performance is greatly facilitated
by converting her raw scores into the z scores listed in the final
column A glance at the z scores suggests that although she did
relatively well on the math test, her performance on the English
test was only slightly above average, as indicated by a z score of
0.50, and her performance on the psychology test was slightly
below average, as indicated by a z score of –0.67. The use of z
scores can help you identify a person’s relative strengths and
weaknesses.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Standard Score
Whenever any unit-free scores are expressed relative to a
known mean and a known standard deviation, they are referred
to as standard scores. Although z scores qualify as standard
scores because they are unit-free and expressed relative to a
known mean of 0 and a known standard deviation of 1, other
scores also qualify as standard scores.

Transformed Standard Scores


Being by far the most important standard score, z scores
are often viewed as synonymous with standard scores. For
convenience, particularly when reporting test results to a wide
audience, z scores can be changed to transformed standard
scores, other types of unit-free standard scores that lack
negative signs and decimal points.

Converting to Transformed Standard Scores


where z′ (called z prime) is the transformed standard
score and z is the original standard score. For instance, if you
wish to convert a z score of –1.50 into a new distribution of z’
scores for which the desired mean equals 500 and the desired
standard deviation equals 100, substitute these numbers into
Formula 5.3 to obtain Z’=500+(-1.50)(100) =500-150 =350 The
change from a z score of −1.50 to a z′ score of 350 eliminates
negative signs and decimal points without distorting the relative
location of the original score, expressed as a distance from the
mean in standard deviation units. Substitute Pairs of
Convenient Numbers The substitution of other arbitrary pairs of
numbers serves no purpose; indeed, because of their peculiarity,
they might make the new distribution, even though it lacks the
negative signs and decimal points common to z scores, slightly
less comprehensible to people who have been exposed to the
traditional pairs of numbers.

4.7 CORRELATION
Two variables are related if pairs of scores show an
orderliness that can be depicted graphically with a scatterplot
and numerically with a correlation coefficient.

AN INTUITIVE APPROACH
The suspected relationship does exist between cards
sent and cards received, then an inspection of the data might
reveal, as one possibility, a tendency for “big senders” to be “big
receivers” and for “small senders” to be “small receivers.” More
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

generally, there is a tendency for pairs of scores to occupy

similar relative positions in their respective distributions.

Positive Relationship
Postive relationships are relatively low values are paired with
relatively low values, and relatively high values are paired with
relatively high values, the relationship is positive. This
relationship implies “You get what you give.”

Negative Relationship
Negative relationships are relatively low values are paired with
relatively high values, and relatively high values are paired with
relatively low values, the relationship is negative. “You get the
opposite of what you give.”

Little or No Relationship
If any, relationship exists between the two variables and that
“What you get has no bearing on what you give.”

4.7 SCATTERPLOTS
A scatterplot is a graph containing a cluster of dots that
represents all pairs of scores. With a little training, you can use

any dot cluster as a preview of a fully measured relationship.


Two variables are positively related if pairs of scores tend to
occupy similar relative positions (high with high and low with
low) in their respective distributions, and they are negatively
related if pairs of scores tend to occupy dissimilar relative
positions (high with low and vice versa) in their respective
distributions.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Scatterplot for greeting card exchange


Example involving greeting cards has shown the basic idea of
correlation and the construction of a scatterplot.

The first step is to note the tilt or slope, if any, of a dot cluster. A
dot cluster that has a slope from the lower left to the upper
right, as in panel A of Figure 6.2, reflects a positive
relationship. Small values of one variable are paired with small
values of the other variable, and large values are paired with
large values. In panel A, short people tend to be light, and tall
people tend to be heavy.
In other hand, a dot cluster that has a slope from the upper left
to the lowerright, as in panel B of Figure reflects a negative
relationship. Small values ofone variable tend to be paired with
large values of the other variable, and vice versa.Finally, a dot
cluster that lacks any apparent slope, as in panel C of Figure
6.2,reflects little or no relationship. Small values of one variable
are just as likely tobe paired with small, medium, or large values
of the other variable.

Strong or Weak Relationship?


Having established that a relationship is established that a
relationship is either positive or negative, note how closely the dot
cluster approximates a straight line. The more closely the dot
cluster approximates a straight line, the stronger (the more
regular) the relationship.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Perfect Relationship
A dot cluster that equals (rather than merely approximates) a
straight line reflects a perfect relationship between two variables.
In practice, perfect relationships are most unlikely.

Curvilinear Relationship
The previous discussion assumes that a dot cluster approximates
a straight line and, therefore, reflects a linear relationship. But
this is not always the case. Sometimes a dot cluster approximates
a bent or curved line, as in Figure 6.4.

The scatterplot in Figure for the greeting card data. Although the
small number of dots in Figure hinders any interpretation, the
dot cluster appears to approximate a straight line, stretching
from the lower left to the upper right. This suggests a positive
relationship between greeting cards sent and received, in
agreementwith the earlier intuitive analysis of these data.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Look again at the scatterplot in Figure for the greeting card data.
Although the small number of dots in Figure 6.1 hinders any
interpretation, the dot cluster appears to approximate a straight
line, stretching from the lower left to the upper right. This
suggests a positive relationship between greeting cards sent and
received, in agreement with the earlier intuitive analysis of these
data.

4.8 A CORRELATION COEFFICIENT FOR QUANTITATIVE DATA r


A correlation coefficient is a number between –1 and 1
that describes the relationship between pairs of variables.The
type of correlation coefficient, designated as r, that describes the
linear relationship between pairs of variables form quantitative
data. Many other types of correlation coefficients have been
introduced to handle specific types of data, including ranked and
qualitative data.

Key Properties of r
The Pearson correlation coefficient, r, can equal any value
between –1.00 and +1.00. Furthermore, the following two
properties apply:
1. The sign of r indicates the type of linear relationship, whether positive or
negative.
2. The numerical value of r, without regard to sign, indicates
the strength of the linear relationship.

Sign of r
A number with a plus sign (or no sign) indicates a positive
relationship, and a number with a minus sign indicates a
negative relationship.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Numerical Value of r
The more closely a value of r approaches either –1.00 or +1.00,
the stronger (more regular) the relationship. Conversely, the more
closely the value of r approaches 0, the weaker (less regular) the
relationship. For example, an r of –.90 indicates a stronger
relationship than does an r of –.70, and an r of –.70 indicates
a stronger relationship than does an r of .50. (Remember, if no
sign appears, it is understood to be plus.) the value of r is a
measure of how well a straight line (representing the linear
relationship) describes the cluster of dots in the scatterplot.

Interpretation of r
Located along a scale from –1.00 to +1.00, the value of r supplies
information about the direction of a linear relationship—whether
positive or negative—and, generally, information about the
relative strength of a linear relationship—whether relatively weak
(and a poor describer of the data) because r is in the vicinity of 0,
or relatively strong (and a good describer of the data) because r
deviates from 0 in the direction of either +1.00 or –1.00.

r Is Independent of Units of Measurement


A positive value of r reflects a tendency for pairs of scores to
occupy similar relative locations (high with high and low with
low) in their respective distributions, while a negative value of r
reflects a tendency for pairs of scores to occupy dissimilar relative
locations (high with low and vice versa) in their respective
distributions.

Effect of range restriction on the value of r.


The value of r can’t be interpreted as a proportion or percentage
of some perfect relationship.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

4.9 DETAILS: COMPUTATION FORMULA FOR CORRELATION


COEFFICIENT
Calculate a value for r by using the following computation formula:

where the two sum of squares terms in the denominator are defined
as

The sum of the products term in the numerator, SPxy, is defined

In the case of SPxy, instead of summing the squared deviation


scores for either X or Y, as with SSx and SSy, we find the sum of
the products for each pair of deviation scores. Notice in Formula
6.1 that, since the terms in the denominator must be positive,
only the sum of the products, SPxy, determines whether the
value of r is positive or negative. Furthermore, the size of SPxy
mirrors the strength of the relationship; stronger relationships
are associated with larger positive or negative sums of products.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Calculation of r

OUTLIERS
Outliers were defined as very extreme scores that require special
attention because of their potential impact on a summary of data.
This is also true when outliers appear among sets of paired
scores. Although quantitative techniques can be used to detect
these outliers, we simply focus on dots in scatterplots that
deviate conspicuously from the main dot cluster.

OTHER TYPES OF CORRELATION COEFFICIENTS


There are many other types of correlation coefficients,
but we will discuss only several that are direct descendants of the
Pearson correlation coefficient. Although designed originally for
use with quantitative data, the Pearson r has been extended,
sometimes under the guise of new names and customized
versions of Formula 6.1, to other kinds of situations. For
example, to describe the correlation between ranks assigned
independently by two judges to a set of science projects, simply
substitute the numerical ranks into Formula, then solve for a
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

value of the Pearson r (also referred to as Spearman’s rho


coefficient for ranked or ordinal data).

Interpreting a Larger Correlation Matrix

Three of the six shaded correlations in Table 6.5 involve


GENDER. GENDER qualifies for a correlation analysis once
arbitrary numerical codes (1 for male and 2 for female) have been
assigned. Looking across the bottom row, GENDER is positively
correlated with AGE (.0813); with COLLEGE GPA (.2069); and
with HIGH SCHOOL GPA (.2981). Looking
across the next row, HIGH SCHOOL GPA is negatively correlated
with AGE (–.0376) and
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

positively correlated with COLLEGE GPA (.2521). Lastly, COLLEGE


GPA is positively correlated with AGE (.2228).

Computational formula for correlation coefficient

The formula for the sample correlation coefficient is:

where Cov(x,y) is the covariance of x and y defined as

and are the sample variances of x and y, defined as follows:

and
The variances of x and y measure the variability of the x scores
and y scores around their respective sample means of X and Y
considered separately. The covariance measures the variability of
the (x,y) pairs around the mean of x and mean of y, considered
simultaneously.

To compute the sample correlation coefficient, we need to


compute the variance of gestational age, the variance of birth
weight, and also the covariance of gestational age and birth
weight.

We first summarize the gestational age data. The mean gestational age is:

To compute the variance of gestational age, we need to sum the


squared deviations (or differences) between each observed
gestational age and the mean gestational age. The computations
are summarized below.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Infant Gestational Age


ID # (weeks)
1 34.7 - 13.6
3.7 9
2 36.0 - 5.76
2.4
3 29.3 - 82,8
9.1 1
4 40.1 1.7 2.89
5 35.7 - 7.29
2.7
6 42.4 4.0 16.0
7 40.3 1.9 3.61
8 37.3 - 1.21
1.1
9 40.9 2.5 6.25
10 38.3 - 0.01
0.1
11 38.5 0.1 0.01
12 41.4 3.0 9.0
13 39.7 1.3 1.69
14 39.7 1.3 1.69
15 41.1 2.7 7.29
16 38.0 - 0.16
0.4
17 38.7 0.3 0.09

The variance of gestational age is:

Next, we summarize the birth weight data. The mean birth weight is:
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

The variance of birth weight is computed just as we did for


gestational age as shown in the table below.

Infant Birth
ID# Weight
1 1895 -1007 1,014,
049
2 2030 -872 760,38
4
3 1440 -1462 2,137,
444
4 2835 - 4,489
67
5 3090 188 35,344
6 3827 925 855,62
5
7 3260 358 128,16
4
8 2690 -212 44,944
9 3285 383 146,68
9
10 2920 18 324
11 3430 528 278,76
4
12 3657 755 570,02
5
13 3685 783 613,08
9
14 3345 443 196,24
9
15 3260 358 128,16
4
16 2680 -222 49,284
17 2005 -897 804,60
9
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

The variance of birth weight is:

Next we compute the covariance:

To compute the covariance of gestational age and birth weight,


we need to multiply the deviation from the mean gestational age
by the deviation from the mean birth weight for each participant,
that is:

The computations are summarized below. Notice that we simply


copy the deviations from the mean gestational age and birth
weight from the two tables above into the table below and
multiply.

Infant
ID#
1 - -1007 3725.9
3.7
2 - -872 2092.8
2.4
3 - -1462 13,304.
9,1 2
4 1.7 - -113.9
67
5 - 188 -507.6
2.7
6 4.0 925 3700.0
7 1.9 358 680.2
8 - -212 233.2
1.1
9 2.5 383 957.5
10 - 18 -1.8
0.1
11 0.1 528 52.8
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

12 3.0 755 2265.0


13 1.3 783 1017.9
14 1.3 443 575.9
15 2.7 358 966.6
16 - -222 88.8
0.4
17 0.3 -897 -269.1
Total =
28,768.
4

The covariance of gestational age and birth weight is:

Finally, we can ow compute the sample correlation coefficient:

Not surprisingly, the sample correlation coefficient indicates a strong


positive correlation.

4.10 Regression

A predictive modeling technique that evaluates the relation


between dependent (i.e. the target variable) and independent
variables is known as regression analysis. Regression analysis
can be used for forecasting, time series modeling, or finding the
relation between the variables and predict continuous values. For
example, the relationship between household locations and the
power bill of the household by a driver is best studied through
regression.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

We can analyze data and perform data modeling using regression


analysis. Here, we create a decision boundary/line according to
the data points, such that the differences between the distances
of data points from the curve or line are minimized.

Need for Regression techniques


The applications of regression analysis, advantages of linear
regression, as well as the benefits of regression analysis and the
regression method of forecasting can help a small business, and
indeed any business, create a better understanding of the
variables (or factors) that can impact its success in the coming
weeks, months and years into the future.
Data are essential figures that define the complete business.
Regression analysis helps to analyze the data numbers and
help big firms and businesses to make better decisions.
Regression forecasting is analyzing the relationships between
data points, which can help you to peek into the future.
9 Types of Regression Analysis
The types of regression analysis that we are going to study here are:

1. Simple Linear Regression


2. Multiple Linear Regression
3. Polynomial Regression
4. Logistic Regression
5. Ridge Regression
6. Lasso Regression
7. Bayesian Linear Regression
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

There are some algorithms we use to train a regression model to


create predictions with continuous values.

8. Decision Tree Regression


9. Random Forest Regression

There are various different types of regression models to create


predictions. These techniques are mostly driven by three prime
attributes: one the number of independent variables, second the
type of dependent variables, and lastly the shape of the
regression line.

1) Simple Linear Regression


Linear regression is the most basic form of regression
algorithms in machine learning. The model consists of a single
parameter and a dependent variable has a linear relationship.
When the number of independent variables increases, it is called
the multiple linear regression models.
We denote simple linear regression by the following equation given below.

y = mx + c + e

where m is the slope of the line, c is an intercept, and e represents the


error in the model.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

The best-fit decision boundary is determined by varying the


values of m and c for different combinations. The difference
between the observed values and the predicted value is called a
predictor error. The values of m and c get selected to minimum
predictor error.

2) Multiple Linear Regression


Simple linear regression allows a data scientist or data
analyst to make predictions about only one variable by training
the model and predicting another variable. In a similar way, a
multiple regression model extends to several more than one
variable.
Simple linear regression uses the following linear function to
predict the value of a target variable y, with independent variable
x?.
y = b0 + b1x1

To minimize the square error we obtain the parameters b? and


b? that best fits the data after fitting the linear equation to
observed data.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

3) Polynomial Regression
In a polynomial regression, the power of the
independent variable is more than 1. The equation below
represents a polynomial equation:
y = a + bx2
In this regression technique, the best fit line is not a straight
line. It is rather a curve that fits into the data points.

4) Logistic Regression
Logistic regression is a type of regression technique
when the dependent variable is discrete. Example: 0 or 1, true or
false, etc. This means the target variable can have only two
values,and a sigmoid function shows the relation between the
target variable and the independent variable.
The logistic function is used in Logistic Regression to create a
relation between the target variable and independent variables.
The below equation denotes the logistic regression.
here p is the probability of occurrence of the feature.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

5) Ridge Regression
Ridge Regression is another type of regression in
machine learning and is usually used when there is a high
correlation between the parameters. This is because as the
correlation increases the least square estimates give unbiased
values. But if the collinearity is very high, there can be some bias
value. Therefore, we introduce a bias matrix in the equation of
Ridge Regression. It is a powerful regression method where the
model is less susceptible to overfitting.

Below is the equation used to denote the Ridge Regression, λ


(lambda) resolves the multicollinearity issue:
β = (X^{T}X + λ*I)^{-1}X^{T}y

6) Lasso Regression
Lasso Regression performs regularization along with
feature selection. It avoids the absolute size of the regression
coefficient. This results in the coefficient value getting nearer to
zero, this property is different from what in ridge regression.
Therefore we use feature selection in Lasso Regression. In the
case of Lasso Regression, only the required parameters are used,
and the rest is made zero. This helps avoid the overfitting in the
model. But if independent variables are highly collinear, then
Lasso regression chooses
only one variable and makes other variables reduce to zero.
Below equation represents the Lasso Regression method:
N^{-1}Σ^{N}_{i=1}f(x_{i}, y_{I}, α, β)
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

7) Bayesian Linear Regression


Bayesian Regression is used to find out the value of
regression coefficients. In Bayesian linear regression, the
posterior distribution of the features is determined instead of
finding the least-squares. Bayesian Linear Regression is a
combination of Linear Regression and Ridge Regression but is
more stable than simple Linear Regression.

Now, we will learn some types of regression analysis which can be


used to train regression models to create predictions with
continuous values.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

8) Decision Tree Regression


The decision tree as the name suggests works on the
principle of conditions. It is efficient and has strong algorithms
used for predictive analysis. It has mainly attributed that include
internal nodes, branches, and a terminal node.
Every internal node holds a “test” on an attribute, branches hold
the conclusion of the test and every leaf node means the class
label. It is used for both classifications as well as regression
which are both supervised learning algorithms. Decisions trees
are extremely delicate to the information they are prepared on —
little changes to the preparation set can bring about
fundamentally different tree structures.

9) Random Forest Regression

Random forest, as its name suggests, comprises an


enormous amount of individual decision trees that work as a
group or as they say, an ensemble. Every individual decision tree
in the random forest lets out a class prediction and the class with
the most votes is considered as the model's prediction.
Random forest uses this by permitting every individual tree to
randomly sample from the dataset with replacement, bringing
about various trees. This is known as bagging.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

4.11 Regression line

All five dots contribute to the more precise prediction,


illustrated in Figure 4.13.1, that Emma will receive 15.20 cards.
Look more closely at the solid line designated as the regression
line in Figure 4.13.1, which guides the string of arrows,
beginning at 11, toward the predicted value of 15.20. The
regression line is a straight line rather than a curved line because
of the linear relationship between cards sent and cards received.
As will become apparent, it can be used repeatedly to predict
cards received. Regardless of whether Emma decides to send 5,
15, or 25 cards, it will guide a new string of arrows, beginning at
5 or 15 or 25, toward a new predicted value along the Y axis.

Placement of Line

For the time being, forget about any prediction for Emma and
concentrate on how the five dots dictate the placement of the
regression line. If all five dots had defined a single straight line,
placement of the regression line would have been simple; merely
let it pass through all dots. When the dots fail to define a single
straight line, as in the scatterplot for the five friends, placement
of the regression line represents a compromise. It passes through
the main cluster, possibly touching some dots but missing
others.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Predictive Errors

Figure 4.13.2 illustrates the predictive errors that would


have occurred if the regression line had been used to predict
the number of cards received by the five friends. Solid dots
reflect the actual number of cards received, and open dots,
always located along

Figure 4.13.2 Prediction of 15.20 for Emma (using the regression


line).

Figure 4.13.3 Predictive errors.


UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

the regression line, reflect the predicted number of cards received.


(To avoid clutter in Figure 4.13.3, the strings of arrows have been
omitted. However, you might find it helpful to imagine a string of
arrows, ending along the Y axis, for each dot, whether solid or
open.) The largest predictive error, shown as a broken vertical line,
occurs for Steve, who sent 9 cards. Although he actually received
18 cards, he should have received slightly fewer than 14 cards,
according to the regression line. The smallest predictive error—
none whatsoever—occurs for Mike, who sent 7 cards. He actually
received the 12 cards that he should have received, according to
the regression line.

Total Predictive Error

Engage in the seemingly silly activity of predicting what is


known already for the five friends to check the adequacy of our
predictive effort. The smaller the total for all predictive errors in
Figure 4.13.3, the more favorable will be the prognosis for our
predictions. Clearly, it is desirable for the regression line to be
placed in a position that minimizes the total predictive error, that
is, that minimizes the total of the vertical discrepancies between
the solid and open dots shown in Figure 4.13.3.

Progress Check *4.13.1 To check your understanding of the first


part of this chapter, make predictions using the following graph.

(a) Predict the approximate rate of inflation, given an unemployment rate


of 5 percent.
(b) Predict the approximate rate of inflation, given an unemployment rate
of 15 percent
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

4.12 Least squares regression line

To avoid the arithmetic standoff of zero always produced by


adding positive and negative predictive errors (associated with
errors above and below the regression line, respectively), the
placement of the regression line minimizes not the total
predictive error but the total squared predictive error, that is,
the total for all squared predictive errors. When located in this
fashion, the regression line is often referred to as the least
squares regression line. Although more difficult to visualize,
this approach is consistent with the original aim—to minimize
the total predictive error or some version of the total
predictive error, thereby providing a more favorable prognosis for
our predictions.

Need a Mathematical Solution

Without the aid of mathematics, the search for a least squares


regression line would be frustrating. Scatterplots would be
proving grounds cluttered with tentative regression lines,
discarded because of their excessively large totals for squared
discrepancies. Even the most time-consuming, conscientious
effort would culminate in only a close approximation to the least
squares regression line.
Least Squares Regression Equation

Happily, an equation pinpoints the exact least squares


regression line for any scatterplot. Most generally, this
equation reads:

………………>1

where Y´ represents the predicted value (the predicted number of


cards that will be received by any new friend, such as Emma); X
represents the known value (the known number of cards sent by
any new friend); and b and a represent numbers calculated from
the original correlation analysis, as described next.*

Finding Values of b and a


To obtain a working regression equation, solve each of the
following expressions, first for b and then for a, using data from the
original correlation analysis. The expression
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

for b reads:

…………………………….>2

where r represents the correlation between X and Y (cards sent


and received by the five friends); SSy represents the sum of
squares for all Y scores (the cards received by the five friends);
and SSx represents the sum of squares for all X scores (the cards
sent by the five friends).

The expression for a reads:

……………………………>3

where Y and X refer to the sample means for all Y and X


scores, respectively, and b is defined by the preceding
expression.The values of all terms in the expressions for b and a
can be obtained from the original correlation analysis either
directly, as with the value of r, or indirectly, as with
the values of the remaining terms: SSy′ SSx′ Y, and X.

Table 4.14.1 illustrates the computational sequence that


produces a least squares regression equation for the greeting
card example, namely,
Y’ .80(X) 6.40
where .80 and 6.40 represent the values computed
for b and a, respectively.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

4.13 Standard error of estimates y | x

Although we predicted that Emma’s investment of 11 cards will


yield a return of 15.20 cards, we would be surprised if she
actually received 15 cards. It is more likely that because of the
imperfect relationship between cards sent and cards received,
Emma’s return will be some number other than 15. Although
designed to minimize predictive error, the least squares equation
does not eliminate it. Therefore, our next task is to estimate the
amount of error associated with our predictions. The smaller the
estimated error is, the better the prognosis will be for our
predictions.

Finding the Standard Error of Estimate

The estimate of error for new predictions reflects our failure to


predict the number of cards received by the original five friends,
as depicted by the discrepancies between solid and open dots in
Figure7.15. Known as the standard error of estimate and
symbolized as sy|x, this estimate of predictive error complies
with the general format for any sample standard deviation, that
is, the square root of a sum of squares term divided by its degrees
of freedom. (See Formula 4.10 on page 76.) The formula for sy|x
reads:

....................>4
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

where the sum of squares term in the numerator, SSy|x,


represents the sum of the squares for predictive errors, Y − Y′,
and the degrees of freedom term in the denominator, n − 2,
reflects the loss of two degrees of freedom because any straight
line, including the regression line, can be made to coincide with
two data points. The symbol sy|x is read as “s sub y given x.”
Although we can estimate the overall predictive error by
dealing directly with predictive errors, Y − Y′, it is more efficient
to use the following computation formula:

.....................>5

where SSy is the sum of the


squares for Y scores that is,

and r is the correlation coefficient.

4.14 Interpretation of r2:

The squared correlation coefficient, r2, provides us


with not only a key interpretation of the correlation coefficient
but also a measure of predictive accuracy that supplements the
standard error of estimate, sy|x. (Remember, we engage in the
seemingly silly activity of predicting that which we already
know not as an end-in-itself, but as a way to check the
adequacy of our predictive effort.) Paradoxically, even though
our ultimate goal is to show the relationship between r2 and
predictive accuracy, we will initially concentrate on two kinds of
predictive errors—those due to the repetitive prediction of the
mean and those due to the regression equation.

Repetitive Prediction of the Mean

For the sake of the present argument, pretend that we know the
Y scores but not the corresponding X scores.Lacking information
about the relationship between X and Y scores, circumstances,
statisticians recommend repetitive predictions of the mean, Y, for
a variety of reasons, including the fact that, although the
predictive error for any individual might be quite large, the sum of
all of the resulting five predictive errors (deviations of Y scores
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

about Y) always equals zero, as you may recall from Section 3.3.]
Most important for our purposes, using the repetitive prediction
of Y for each of the Y scores of all five friends will supply us with
a frame of reference against which to evaluate our customary
predictive effort based on the correlation between cards sent (X)
and cards received (Y).

Predictive Errors

Panel A of Figure 4.16 shows the predictive errors for


all five friends when the mean for all five friends, Y, of 12 (shown
as the mean line) is always used to predict each of their five Y
scores. Panel B shows the corresponding predictive errors for all
five friends when a series of different Y′ values, obtained from the
least squares equation , is used to predict each of their five Y
scores. For example, panel A of Figure 7.5 shows the error for
John when the mean for all five friends, Y, of 12 is used to
predict his Y score of 6. Shown as a broken vertical line, the
error of −6 for John (from Y − Y = 6 − 12 = −6) indicates that Y
overestimates John’s Y score by 6 cards. Panel B shows a smaller
error of −1.20 for John when a Y′ value of 7.20 is used to predict
the same Y score of 6. This Y’ value of 7.20 is obtained from the
least squares equation,

where the number of cards sent by John, 1, has been substituted


for X.Positive and negative errors indicate that Y scores are either
above or below their corresponding predicted scores. Overall, as
expected, errors are smaller when customized predictions of Y′
from the least squares equation can be used (because X scores
are known) than when only the repetitive prediction of Y can be
used (because X scores are ignored.) As with most statistical
phenomena, there are exceptions: The predictive error for Doris is
slightly larger when the least squares equation is used.

Error Variability (Sum of Squares)


To more precisely evaluate the accuracy of our two
predictive efforts, we need some measure of the collective errors
produced by each effort. It probably will not surprise you that the
sum of squares qualifies for this role. The sum of squares of any
set of deviations, now called errors, can be calculated by first
squaring each error (to eliminate negative signs), then summing
all squared errors. The error variability for the repetitive
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

prediction of the mean can be designated as SSy, since each Y


score is expressed as a squared deviation from Y and then
summed, that is

Using the errors for the five friends shown in Panel A of Figure 4.16,
this becomes

The error variability for the customized predictions from the least
squares equation can be designated as SSy|x, since each Y score
is expressed as a squared deviation from its corresponding Y’ and
then summed, that is

Using the errors for the five friends shown in Panel B of Figure 4.16,
obtain
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Figure 4.16 Predictive errors for five friends.

Proportion of Predicted Variability


If you think about it, SSy measures the total variability of
Y scores that occurs after only primitive predictions based on Y
are made (because X scores are ignored), while SSy|x measures
the residual variability of Y scores that remains after
customized least square
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

predictions are made (because X scores are used). The error


variability of 28.8 for the least squares predictions is much
smaller than the error variability of 80 for the repetitive
prediction of Y, confirming the greater accuracy of the least
squares predictions apparent in Figure 4.16 To obtain an SS
measure of the actual gain in accuracy due to the least squares
predictions, subtract the residual variability from the total

variability, that is, subtract SSy|x from SSy, to obtain


To express this difference, 51.2, as a gain in accuracy relative to
the original error variability for the repetitive prediction of Y,
divide the above difference by SSy, that is,

This result, .64 or 64 percent, represents the proportion or


percent gain in predictive accuracy when the repetitive prediction
of Y is replaced by a series of customized Y′ predictions based on
the least squares equation. In other words, .64 or 64 percent
represents the proportion or percent of the total variability of SSy
that is predictable from its relationship with the X variable. To
the delight of statisticians, when squared, the value of the
correlation coefficient equals this proportion of predictable
variability. Recalling that an r of .80 was obtained for the
correlation between cards sent and cards received by the five
friends, we can verify that r2 = (.80)(.80) = .64, which, of course,
also is the proportion of predictable variability. Given this
perspective,

The square of the correlation coefficient, r2, always


indicates the proportion of total variability in one
variable that is predictable from its relationship with
the other variable.
Expressing the equation for r2 in symbols, we have:

......................................>4.16
where the one new sum of squares term, SSy′, is simply the
variability explained by or predictable from the regression
equation, that is,
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

Accordingly, r2 provides us with a straightforward measure of


the worth of our least squares predictive effort.*

4.15 POPULATIONS
Any complete set of observations (or potential
observations) may be characterized as a population. Accurate
descriptions of populations specify the nature of the
observations to be taken. For example, a population might be
described as “attitudes toward abortion of currently enrolled
students at Bucknell University” or as “SAT critical reading
scores of currently enrolled students at Rutgers University.”

Real Populations
Pollsters, such as the Gallup Organization, deal with real
populations. A real population is one in which all potential
observations are accessible at the time of sampling. Examples of
real populations include the two described in the previous
paragraph, as well as the ages of all visitors to Disneyland on a
given day, the ethnic backgrounds of all current employees of
the U.S. Postal Department, and presidential preferences of all
currently registered voters in the United States. Incidentally,
federal law requires that a complete survey be taken every 10
years of the real population of all U.S. households—at
considerable expense, involving thousands of data collectors—as
a means of revising election districts for the House of
Representatives. (An estimated undercount of millions of people,
particularly minorities, in both the 2000 and 2010 censuses has

revived a suggestion, long endorsed by statisticians, that the


entire U.S. population could be estimated more accurately if a
highly trained group of data collectors focused only on a random
sample of households.

Hypothetical Populations

Insofar as research workers concern themselves with


populations, they often invoke the notion of a hypothetical
population. A hypothetical population is one in which all
potential observations are not accessible at the time of sampling.
In most experiments, subjects are selected from very small,
uninspiring real populations: the lab rats housed in the local
animal colony or student volunteers from general psychology
classes. Experimental subjects often are viewed, nevertheless, as
a sample from a much larger hypothetical population, loosely
described as “the scores of all similar animal subjects (or
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

student volunteers) who could conceivably undergo the present


experiment.” According to the rules of inferential statistics,
generalizations should be made only to real populations that, in
fact, have been sampled. Generalizations to hypothetical
populations should be viewed, therefore, as provisional
conclusions based on the wisdom of the researcher rather than
on any logical or statistical necessity. In effect, it’s an open
question—often answered only by additional experimentation—
whether or not a given experimental finding merits the generality
assigned to it by the researcher.

4.16 ANOVA:

When data are quantitative, an overall test of the null


hypothesis for more than two population means requires a new
statistical procedure known as analysis of variance, which is
often abbreviated as ANOVA (from ANalysis Of Variance).

One-Factor ANOVA
It describes the simplest type of analysis of variance.
Often referred to as a one-factor (or one-way) ANOVA, it tests
whether differences exist among population means categorized
by only one factor or independent variable, such as hours of
sleep deprivation, with measures on different subjects. The
ANOVA techniques described in this chapter presume that all
scores are independent. In other words, each subject contributes
just one score to the overall analysis.

Two Possible Outcomes


To simplify computations, unrealistically small,
numerically friendly samples are used in this and the next two
chapters. In practice, samples that are either unduly small
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

or excessively large should be avoided. Let’s assume that the


psychologist randomly assigns only three subjects to each of the
three levels of sleep deprivation. Subsequently, subjects’
aggression scores reflect their behavior in a controlled social
situation. Table 16.1 shows two fictitious experimental outcomes
that, when analyzed with ANOVA, produce different decisions
about the null hypothesis: It is retained for one outcome but
rejected for the other. Before reading on, predict which outcome
would cause the null hypothesis to be retained and which would
cause it to be rejected. You are correct if you predicted that
Outcome A would cause the null hypothesis to be retained, while
Outcome B would cause the null hypothesis to be rejected. Mean
Differences Still Important Your predictions for Outcomes A and
B most likely were based on the relatively small differences
between group means for Outcome A and the relatively large
differences between group means for Outcome B. Observed
mean differences have been a major ingredient in previous t
tests, and these differences are just as important in ANOVA. It is
easy to lose sight of this fact because observed mean differences
appear, somewhat disguised, as one type of variability in
ANOVA. It takes extra effort to view ANOVA—with its emphasis
on the analysis of several sources of variability—as related to
previous t tests.

TWO SOURCES OF VARIABILITY :


Differences between Group Means First, without
worrying about computational details, look more closely at one
source of variability in Outcomes A and B: the differences
between group means. Differences of 5, 6, and 4 appear between
group means in Outcome A, and these relatively small
differences might reflect only chance. Even though the null
hypothesis is true (because sleep deprivation does not affect the
subjects’ aggression scores), group means tend to differ merely
because of chance sampling variability. It’s reasonable to expect,
therefore, that the null hypothesis for Outcome A should not be
rejected. There appears to be a lack of evidence that sleep
deprivation affects the subjects’ aggression scores in Outcome A.
On the other hand, differences of 2, 5, and 8 appear between the
group means for Outcome B, and these relatively large
differences might not be attributable to chance. Instead, they
indicate that the null hypothesis probably is false (because sleep
deprivation affects the subjects’ aggression scores). It’s
reasonable to expect, therefore, that the null hypothesis for
Outcome B should be rejected. There appears to be evidence of a
treatment effect, that is, the existence of at least one difference
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

between the population means defined by the independent


variable (sleep deprivation).

Variability between Groups


Variability among scores of subjects who, being in
different groups, receive different experimental treatments.

Variability within Groups


Variability among scores of subjects who, being in
the same group, receive the same experimental treatment . A
more definitive decision about the null hypothesis views the
differences between group means as one source of variability to
be compared with a second source of variability. An estimate of
variability between groups, that is, the variation among scores of
subjects who, being in different groups, receive different
experimental treatments, must be compared with another,
completely independent estimate of variability within groups,
that is, the variation among scores of subjects who, being in the
same group, receive the same experimental treatment. As will be
seen, the more that the variability between groups exceeds the
variability within groups, the more suspect will be the null
hypothesis. Let’s focus on the second source of variability—the
variability within groups for subjects treated similarly. Referring
to Table 16.1, focus on the differences among the scores of 3, 5,
and 7 for the three subjects who are treated similarly in the first
group. Continue this procedure, one group at a time, to obtain
an overall impression of variability within groups for all three
groups in Outcome A and for all three groups in Outcome B.
Notice the relative stability of the differences among the three
scores within each of the various groups, regardless of whether
the group happens to be in Outcome A or Outcome B. For
instance, one crude measure of variability, the range, equals
either 3 or 4 for each group shown in Table 16.1. A key point is
that the variability within each group depends entirely on the
scores of subjects treated similarly (exposed to the same sleep
deprivation period), and it never involves the scores of subjects
treated differently (exposed to different sleep deprivation
periods). In contrast to the variability between groups, the
variability within groups never reflects the presence of a
treatment effect. Regardless of whether the null hypothesis is
true or false, the variability within groups reflects only random
error, that is, the combined effects on the scores of individual
subjects of all uncontrolled factors, such as individual
differences among subjects, slight variations in experimental
conditions, and errors in measurement. In ANOVA, the within-
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

group estimate often is referred to simply as the error term, and


it is analogous to the pooled variance estimate (sp 2 ) in the t
test for two independent samples.

EXAMPLE:
Imagine a simple experiment with three groups, each
containing four observations. For each of the following outcomes,
indicate whether there is variability between groups and also
whether there is variability within groups.

ANSWER:

F TEST
The null hypothesis has been tested with a t ratio. In the two-
sample case, t reflects the ratio between the observed difference between
the two sample means in the numerator and the estimated standard
error in the denominator. For three or more samples, the null hypothesis
is tested with a new ratio, the F ratio. Essentially, F reflects the ratio of
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402

the observed differences between all sample means (measured as


variability between groups) in the numerator and the estimated error

term or pooled variance estimate (measured as variability within groups)


in the denominator term, that is,

If Null Hypothesis Is True


If the null hypothesis is true (because there is no treatment
effect due to different sleep deprivation periods), the two estimates of
variability (between and within groups) would reflect only random error.
In this case

If Null Hypothesis Is False


If the null hypothesis is false (because there is a treatment
effect due to different sleep deprivation periods), both estimates still
would reflect random error, but the estimate for between groups would
also reflect the treatment effect. In this case

When the null hypothesis is false, the presence of a treatment effect


tends to cause a chain reaction: The observed differences between group
means tend to be large, as does the variability between groups.
Accordingly, the numerator term tends to exceed the denominator term,
producing an F whose value is larger than 1. When the null hypothesis is
false because of a large treatment effect, there is an even more
pronounced chain reaction, beginning with very large observed
differences between group means and ending with an F whose value
tends to be considerably larger than 1.

You might also like