0% found this document useful (0 votes)

8 views49 pages

Unit III-The Normal Curve

The document discusses the normal curve and its properties, emphasizing the importance of sample size for accurate generalizations. It explains the concept of z-scores and how to use the standard normal curve to find proportions and scores based on known values. Additionally, it provides examples of solving normal curve problems, including finding proportions between two scores and determining scores based on known proportions.

Uploaded by

vasanthkv1982004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views49 pages

Unit III-The Normal Curve

Uploaded by

vasanthkv1982004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 49

UNIT III THENORMALCURVE

Accurate generalizations can be obtained from distributions based on larger numbers of

men.A distribution based on 30,910 men usually is more accurate than one based on 3091,
and a distribution based on 3,091,000 usually is even more accurate. But it is prohibitively
expensive in both time and money to even survey30,910 people.

NORMAL DISTRIBUTIONS AND STANDARD(z)SCORES

The idealized normal curve has been superimposed on the original distribution for 3091
men.Irregularities in the original distribution, most likely due to chance, are ignored by the
smoothnormal curve. Accordingly, any generalizationsbased on the smooth normal curve will
tendtobe moreaccurate thanthose based onthe originaldistribution.

InterpretingtheShadedArea

The total area under the normal curve can be identified with all FBI applicants.
Viewedrelative to the total area, the shaded area represents the proportion of applicants who
will beeligible because they are shorter than exactly 66 inches. This new, more accurate
proportionwill differ from that obtained from the original histogram (.165) because of
discrepanciesbetweenthe two distributions.
FindingaProportionfortheShadedArea

To find this new proportion, we cannot rely on the vertical scale. It describes as
proportionsthe areas in the rectangular bars of histograms,not the areas in the various curved
sectors ofthenormal curve.

PropertiesoftheNormalCurve
 The mean, mode and median are all equal.
 The curve is symmetric at the center (i.e. around the mean, μ).
 Exactly half of the values are to the left of center and exactly half the values are to the right.
 The total area under the curve is 1.

Normalcurvesuperimposed onthedistribution ofheights.

4.1 THENORMALCURVE

Importance of Mean and Standard Deviation

When using the normal curve, two bits of information are indispensable: values for the
meanand the standard deviation. For example, before the normal curve can be used to
answer thequestion about eligible FBI applicants, it must be established that, for the original
distributionof3091men, themean heightequals69 inchesand thestandard deviationequals3
inches.
DifferentNormalCurves

Particular normal curve has a mean of 69inchesand a standarddeviation of 3 inches, wecan’t

arbitrarily change these values, as any change in the value of either the mean or thestandard
deviation (or both) would create a new normal curve that no longer describes theoriginal
distribution of heights. Nevertheless, as a theoretical exercise, it is instructive to notethe
various types of normal curves that are produced by an arbitrary change in the value
ofeitherthemean(μ) orthe standard deviation (σ).

For example, changing the mean height from 69 to 79 inches produces a new
normalcurve that, as shown in panel Ais displaced 10 inches to the right of the original
curve.Dramaticallynewnormalcurvesareproducedbychangingthevalueofthestandarddeviation
changing the standard deviation from 3 to 1.5 inches produces a more peakednormal curve
with smaller variability, whereas changing the standard deviation from 3 to 6inchesproduces
ashallower normalcurve with greatervariability.

Obvious differences in appearance among normal curves are less important than you
mightsuspect.Becauseoftheircommonmathematicalorigin,everynormalcurvecanbeinterpreted
in exactly the same way once any distance from the mean is expressed in standarddeviation
units

Thenormalcurveistodescribeacompletesetofobservationsorapopulation,thesymbolsμand
σ,representing themean andstandard deviationof thepopulation, respectively.
4.2 zSCORES

A zscore is a unit-free,standardized score that,regardless of the original units of measurement,

indicates how many standard deviations a score is above or below the mean of its
distribution.

Where X is the original score and μandσ are the mean and the standarddeviation,respectively,
A zscore consists of two parts:
1. apositiveornegativesignindicating whetherit’saboveor belowthemean;and
2. a number indicating the size of its deviation from the mean in standard deviation units.

Example:

You have a test score of 190. The test has a mean (μ) of 150 and a standard deviation (σ)
of 25. Assuming a normal distribution, your z score would be:

z = (x – μ) / σ

= (190 – 150) / 25 = 1.6.

4.3 STANDARDNORMALCURVE
If the original distribution approximates a normal curve, then the shift to standard or z
scoreswill always produce a new distribution that approximates the standard normal curve.
This isthe one normal curve for which a table is actually available. It is a mathematical
factnotproven in this bookthat the standard normal curve always has a mean of 0 and a
standarddeviation of 1. However, to verify that the mean of a standard
normaldistributionequals 0,

Replace X in the zscore formula withμ,the mean ofany(nonstandard)normaldistribution,

andthen solve forz

Replace X in the zscore formula with μ + 1σ, the value corresponding to one
standarddeviationabovethe mean forany (nonstandard)normal distribution,and then
solveforz:
Although there is an infinite number of different normal curves, each with its own mean
andstandard deviation, there is only one standard normal curve, with a mean of 0 and a
standarddeviation of1.

Convertingthree normalcurvestothestandardnormalcurve.

StandardNormalTable
Standardnormaltableconsistsofcolumnsofzscorescoordinatedwithcolumnsofproportions. In a
typical problem, access to the table is gained through a z score, such as –1.00,and theanswer
isread as aproportion,suchas theproportion ofeligibleFBIapplicants.

UsingtheTopLegendofthetable
The entries in column A are z scores, beginning with 0.00 and ending (in the full-length
tableof AppendixC) with4.00. Given a z score of zeroor more, columnsB and Cindicate
howthezscoresplitstheareaintheupperhalfofthenormalcurve.Assuggestedbytheshadinginthe
toplegend, columnBindicatesthe proportionof area between the meanand thezscore, and
column C indicates the proportion of area beyond the z score, in the upper tail ofthestandard
normalcurve.
UsingtheBottomLegendoftheTable
The symmetry of the normal curve, the entries .Table A of Appendix C also can refer to
thelowerhalfofthenormalcurve.NowthecolumnsaredesignatedasA′,B′,andC′inthe
legend at the bottom of the table. When using the bottom legend, all entries refer to the
lowerhalf of the standard normal curve. Imagine that the nonzero entries in column A′ are
negativezscores,beginningwith–0.01andending(inthefull-lengthtableofAppendixC)with–
4.00. Given a negative z score, columns B′ and C′ indicate how that zscore splits the
lowerhalf of the normal curve. As suggested by the shading in the bottom legend of the
table,column B′ indicates the proportion of area between the mean and the negative zscore,
andcolumn C′ indicates the proportion of area beyond the negative z score, in the lower tail of
thestandardnormal curve.

SOLVINGNORMALCURVEPROBLEMS
Two main types of normal curve problems.In the first type of problem, we use a known
score(or scores) to find an unknown proportion. For instance, we use the known score of 66
inchesto find the unknown proportion of eligible FBI applicants. In the second type of
problem, theprocedureis reversed.Nowweuse aknownproportion to findan unknown
score(orscores).
InterpretationofTableA, AppendixC.

When using the standard normal table, it is important to remember that for any zscore,
thecorresponding proportions in columns B and C (or columns B′ and C′) alwayssum
to .5000.Similarly,thetotalareaunderthenormalcurvealwaysequals1.0000,thesumoftheproporti
ons in the lower and upper halves, that is, .5000 + .5000. Finally, although a z scorecan be
either positive or negative, the proportions of area under thecurve are always positiveorzero
but never negative(becausean areacannotbenegative).

4.4 FINDINGPROPORTIONS
1. Sketch a normal curve and shade in the target area, as in the left part of Figure.Being
lessthanthemean of 69, 66 is located to the left ofthemean.
Furthermore, since the unknown proportion represents those applicants who are shorter
than66inches, theshaded target sector is located to theleft of 66.
2. Plan your solution according to the normal table. Decide precisely how you will find
thevalue of the target area. In the present case, theanswer will be obtained from column C′
ofthe standard normal table, since the target area coincides with the type of area identified
withcolumnC′, that is,the areainthelower tail beyond anegativez.
3. ConvertXtoz. Express66 as azscore:
z=
X−μ

σ
=66-69/3=-1

4. Find the target area. Refer to the standard normal table, using the bottom legend, as the
zscoreisnegative.ThearrowsinTable5.1showhowtoreadthetable.LookupcolumnA’to
1.00 (representing a z score of –1.00), and note the corresponding proportion of .1587
incolumnC’:Thisistheanswer,assuggestedintherightpartofFigure.Itcanbeconcludedthatonly .15
87 (or.16)of allof theFBIapplicants willbe shorter than 66 inches.

FindingProportionsbetweenTwoScores
Assume that, when not interrupted artificially, the gestation periods for human foet uses
approximate a normal curve with a mean of 270 days (9 months) and a standard deviation
of15days. What proportionof gestation periods willbebetween 245 and255days?
1. Sketch a normal curve and shade in the target area, as in the top panel to the shaded
arearepresentsjust thosegestation periods between 245 and 255days.
2. Plan your solution according to the normal table. This type of problem requires more
effortto solve because the value of the target area cannot be read directlyfrom Table A.
Assuggested in the bottom two panels of Figure 5.7, the basic idea is to identify the target
areawith the difference between two overlapping areas whose values can be read from
column C′of Table A. The larger area (less than 255 days) contains two sectors: the target
area (between245 and 255 days) and a remainder (less than 245 days). The smaller area
contains only theremainder (less than 245 days). Subtracting the smaller area (less than 245
days) from thelarger area (less than 255 days), therefore, eliminates the common remainder
(less than 245days),leaving only the target area(between 245and255 days).
3. Convert X to z by expressing 255
asZ=225-270/15=-15/150=-1.00
andby
expressing245asZ=245-
270/15=-25/15=-1.67

4. To the target area. Look up column A′ to a negative z score of –1.00 (remember, you
mustimagine the negative sign), and note the corresponding proportion of .1587 in column C
′.Likewise,lookupcolumnA′toazscoreof–1.67,andnotethecorrespondingproportionof
.0475 in column C′. Subtract the smaller proportion from the larger proportion to obtain
theanswer, .1112. Thus, only .11, or 11 percent, of all gestation periods will be between 245
and255 days.
FindingProportionsbeyondTwoScores
Assume that high school students’ IQ scores approximate a normal distribution with a
meanof 105 and a standard deviation of 15. What proportion of IQs are more than 30 points
eitheraboveorbelow themean?
1. Sketch a normal curve and shade in the two target areas, as in the top
panelofFigure.
2. Plan yoursolution accordingtothenormaltable.Thesolutiontothis
typeproblemisstraight becauseeach ofthetargetareascan bereaddirectly fromtable A

3. Convert X to z by expressing IQ scores of 135 and 75

asZ=135-105/15=30/15=2.00
Z=75-135/15=-30/15 =-2.00
4. Find the target area. In Table A, locate a z score of 2.00 in column A, and note
thecorresponding proportion of .0228 in column C. Because of the symmetry of the
normalcurve, you need not enter the table again to find the proportion below a z score of –
2.00.Instead, merely double the above proportion of .0228 to obtain .0456, which represents
theproportionof students with IQs morethan 30 pointseither aboveorbelow themean.

4.5 FINDINGSCORES

Table A must be consulted to find the unknown proportion (of area) associated with
someknown score or pair of known scores. For instance, given a GRE score of 650, we found
thattheunknownproportionofscoreslargerthan650equals.07.Nowwewillconcentrateonthe
opposite type of normal curve problem for which Table A must be consulted to find
theunknown score or scores associated with some known proportion. For instance, given that
aGRE score must be in the upper 25 percent of the distribution (in order for an applicant to
beconsideredfor admission to graduate school),we must find theunknown minimum
GREscore.
FindingOneScore
Exam scores for a large psychology class approximate a normal curve with a mean of
230and a standard deviation of 50. Furthermore, students are graded “on a curve,” with only
theupper 20 percent being awarded grades of A. What is the lowest score on the exam
thatreceivesan A?
1. Sketch a normal curve and, on the correct side of the mean, draw a line
representingthe target score, as in Figure. This is often the most difficult step, and it
involves semanticsratherthan statistics.It’soften helpful tovisualizethe target.

2. Planyoursolutionaccordingtothenormaltable.
The target score is on the right side of the mean, concentrate on the area in the upper half
ofthe normal curve, as described in columns B and C. The right panel of Figure 5.9
indicatesthat either column B or C can be used to locate a z score in column A. It is crucial,
however,to search for the single value (.3000) that is valid for columnB or the single value
(.2000)that is valid for column C. Note that we look in column B for .3000, not for .8000.
Table A isnotdesignedfor sectors,such as the lower.8000, thatspan themean ofthe
normalcurve.

3. Find z.
the entry in column C closest to .2000 is .2005, and the corresponding z score in column
Aequals 0.84. Verify this by checking Table A. Also note that exactly the same z score of
0.84wouldhavebeenidentifiedif columnBhadbeensearchedtofindtheentry(.2995)nearestto
.3000. The z score of 0.84 represents the point that separates the upper 20 percent of the
areafromtherest of theareaunder thenormalcurve.
4. Convert z to the target score. Finally, convert the z score of 0.84 into an exam
score,given a distribution with a mean of 230 and a standard deviation of 50. You’ll recall
that a zscore indicates how many standard deviations the originalscore is above or below its
mean.In the present case, thetarget score must belocated .84 of a standard deviation above
itsmean. The distance of the target score above its mean equals 42 (from .84 . 50), which,
whenadded to the mean of 230, yields a value of 272. Therefore, 272 is the lowest score on
theexamthat receives anA.

FindingTwoScores
Assume that the annual rainfall in the San Francisco area approximates a normal curve with
amean of 22 inches and a standard deviation of 4 inches. What are the rainfalls for the
moreatypical years, defined as the driest 2.5 percent of all years and the wettest 2.5 percent of
allyears?

1. Sketchanormalcurve.Oneithersideofthe mean,drawtwolinesrepresentingthetwo
target scores,
Thesmaller(driest)targetscoresplitsthetotalareainto.0250totheleftand.9750totheright,and
thelarger (wettest) targetscoredoes theexact opposite.
2. Planyoursolutionaccordingtothenormaltable.
The target z score can be found by scanning either column B′ for .4750 or column C′whereXis
the target score, expressed in original units of measurement; μ and σ are the mean and
thestandard deviation, respectively, for the originalnormal curve;andz isthe standard
scorereadfrom column A orA′
4.6 MOREABOUTzSCORES
For instance, if the original distribution is positively skewed, the distribution of z scores
alsowill be positively skewed. Regardless of the shape of the distribution, the shift to z
scoresalways produces a distribution of standard scores with a mean of 0 and a standard
deviationof 1.
zScoresforNon-normalDistributions
Theoriginaldistributionispositivelyskewed,thedistributionofzscoresalsowillbepositively
skewed. Regardless of the shape of the distribution, the shift to z scores
alwaysproducesadistributionof standardscoreswithameanof 0andastandarddeviationof 1.
InterpretingTestScoresI
The evaluation of her test performance is greatly facilitated by converting her raw scores
intothe z scores listed in the final column A glance at the z scores suggests that although she
didrelatively well on the math test, her performance on the English test was only slightly
aboveaverage, as indicated by a z score of 0.50, and her performance on the psychology test
wasslightly below average, as indicated by a z score of –0.67. The use of z scores can help
youidentifyaperson’srelativestrengths andweaknesses.

StandardScore
Whenever any unit-free scores are expressed relative to a known mean and a known
standarddeviation, they are referred to as standard scores. Although z scores qualify as
standard scoresbecause they are unit-free and expressed relative to a known mean of 0 and a
known standarddeviationof 1, other scores also qualify as standardscores.
TransformedStandardScores
Being by far the most important standard score,z scoresare often viewed as synonymouswith
standard scores. For convenience, particularly when reporting test results to a wideaudience,
z scores can be changed to transformed standard scores, other types of unit-
freestandardscores that lacknegativesigns anddecimal points.

Commontransformedstandardscoresassociatedwithnormalcurves.

ConvertingtoTransformedStandardScores

wherez′(calledzprime)isthetransformedstandardscoreandzistheoriginalstandardscore. For
instance, if you wish to convert a z score of –1.50 into a new distribution of z’scores for
which the desired mean equals 500 and the desired standard deviation equals
100,substitutethesenumbers into Formula 5.3to obtain

Z’=500+(-1.50)(100)
=500-150
=350
The change from a z score of −1.50 to a z′ score of 350 eliminates negative signs and
decimalpoints without distorting the relative location of the original score, expressed as a
distancefromthemean in standard deviation units.
Substitute Pairs of Convenient Numbers
The substitution of other arbitrary pairs of numbers serves no purpose; indeed, because
oftheir peculiarity, they mightmake the new distribution,eventhough it lacks the negativesigns
and decimal points common to z scores, slightly less comprehensible to people whohavebeen
exposed to thetraditional pairs ofnumbers.

CORRELATION
Twovariablesarerelatedifpairsofscoresshowanorderlinessthatcanbedepictedgraphicallywith
ascatterplotand numericallywith a correlationcoefficient.

4.7 AN INTUITIVE APPROACH

Thesuspectedrelationshipdoesexistbetweencardssentandcardsreceived,then an inspection of
the data might reveal, as one possibility, a tendency for “big senders” to be
“bigreceivers”andfor“smallsenders”tobe“smallreceivers.”Moregenerally,thereisatendencyforp
airsofscorestooccupysimilarrelativepositionsintheirrespectivedistributions.

PositiveRelationship
Positiverelationshipsarerelativelylowvaluesarepairedwithrelativelylowvalues,andrelatively
high values are paired with relatively high values, the relationship is
positive.Thisrelationshipimplies “Youget what you give.”

NegativeRelationship
Negativerelationshipsarerelativelylowvaluesarepairedwithrelativelyhighvalues,andrelativelyh
ighvaluesarepairedwithrelativelylowvalues,therelationshipisnegative.“Youget theopposite of
what you give.”

LittleorNoRelationship
Ifany,relationshipexistsbetweenthetwovariablesandthat“Whatyougethasnobearingonwhat you
give.”
4.8 SCATTERPLOTS
A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.With
alittletraining, youcan useanydot clusterasapreviewofafully measuredrelationship.
Twovariablesarepositivelyrelatedifpairsofscorestendtooccupysimilarrelativepositions (high
with high and low with low) in their respective distributions, and they arenegatively related if
pairs of scores tend to occupy dissimilar relative positions (high with lowandviceversa)
intheirrespectivedistributions.

Scatterplotforgreetingcardexchange
Exampleinvolvinggreetingcardshasshownthebasicideaofcorrelationandtheconstructionofascatt
erplot.
The first step is to note the tilt or slope, if any, of a dot cluster. A dot cluster that has a
slopefromthelowerlefttotheupperright,asinpanelAofFigure6.2,reflectsapositiverelationsh
ip. Small values of one variable are paired with small values of the other
variable,andlargevaluesarepairedwithlargevalues.InpanelA,shortpeopletendtobelight,andtall
peopletend to beheavy.

In other hand, a dot cluster that has a slope from the upper left to the lowerright, as in
panel Bof Figure reflects a negative relationship. Small values ofone variable tend to be
paired withlarge values of the other variable, and vice versa.

Finally, a dot cluster that lacks any apparentslope, as in panel C of Figure 6.2,reflects
littleor no relationship. Small values of onevariable are just as likely tobe paired with small,
medium, or large values of the othervariable.

StrongorWeak Relationship?
Having established that a relationship isestablished that a relationship is either positive
ornegative, note how closely the dot cluster approximates a straight line. The more closely
thedotclusterapproximatesastraight line, thestronger (the moreregular)therelationship.

PerfectRelationship
A dot cluster that equals (rather than merely approximates) a straight line reflects a
perfectrelationshipbetweentwovariables.
CurvilinearRelationship
Thepreviousdiscussionassumesthatadotclusterapproximatesastraightlineand,therefore, reflects
a linear relationship. But this is not always the case. Sometimes a dotclusterapproximates
abent orcurvedline,asinFigure6.4, andthereforereflectsa
The scatterplot in Figure for the greeting card data. Although the small number of dots
inFigure hinders anyinterpretation, the dotcluster appearsto approximate a straight
line,stretching from the lower left to the upper right. This suggests a positive relationship
betweengreetingcardssentandreceived,in agreementwiththeearlierintuitiveanalysis
ofthesedata.

Look again at the scatterplot in Figure for the greeting card data. Although the small
numberof dots in Figure 6.1 hinders any interpretation, the dot cluster appears to approximate
astraightline,stretchingfromthelowerlefttotheupperright.Thissuggestsapositiverelationship
between greeting cards sent and received, in agreement with the earlier intuitiveanalysisof
thesedata.

4.9 A CORRELATION COEFFICIENT

FOR QUANTITATIV DATA :r
Acorrelationcoefficientisanumberbetween–1and1thatdescribestherelationshipbetween pairs of
variables.The type of correlation coefficient, designated as r, that describesthe linear
relationship between pairs of variables form quantitative data. Many other types
ofcorrelationcoefficientshave beenintroducedtohandle specific typesof data,
includingrankedand qualitative data.
KeyPropertiesofr
ThePearsoncorrelationcoefficient,r,canequalanyvaluebetween–
1.00and+1.00.Furthermore,the followingtwo properties apply:
1. Thesign ofrindicates the typeoflinearrelationship,whetherpositiveor negative.
2. Thenumericalvalueofr,withoutregardtosign,indicatesthestrengthofthelinearrelationship.

Sign ofr
Anumberwithaplussign(ornosign)indicatesapositiverelationship,andanumberwithaminussign
indicates anegativerelationship.

NumericalValueofr
The more closely a value of r approaches either –1.00 or +1.00, the stronger (more
regular)the relationship. Conversely, the more closely the value of r approaches 0, the weaker
(lessregular) the relationship. For example, an r of –.90 indicates a stronger relationship than
doesanrof–.70,andanr of–.70indicatesastrongerrelationship thandoesanrof.50.(Remember, if
no sign appears, it is understood to be plus.) the value of r is a measure of howwell a straight
line (representing the linear relationship) describes the cluster of dots in thescatterplot.

Interpretationofr
Located along a scale from –1.00 to +1.00, the value of r supplies information about
thedirection of a linear relationship—whether positive or negative—and, generally,
informationabout the relative strength of a linear relationship—whether relatively weak (and
a poordescriber of the data) because r is in the vicinity of 0, or relatively strong (and a
gooddescriberof thedata) becauserdeviatesfrom0 in thedirectionofeither +1.00 or–1.00.

rIsIndependentofUnitsofMeasurement
Apositivevalueofrreflectsatendencyforpairsofscorestooccupysimilarrelativelocations (high
with high and low with low) in their respective distributions, while a negativevalue of r
reflects a tendency for pairs of scores to occupy dissimilar relative locations (highwithlow
and viceversa)in theirrespectivedistributions.
Effectofrangerestrictionon thevalue ofr.
Thevalue of rcan’tbeinterpretedas aproportionor percentageof someperfectrelationship.

6.4DETAILS:COMPUTATIONFORMULAFOR CORRELATIONCOEFFICIENT
Calculateavalueforrbyusing the followingcomputation formula:

wherethetwosumof squarestermsinthedenominatoraredefinedas

Thesumof theproducts termin thenumerator,SPxy, isdefined

In the case of SPxy, instead of summing the squared deviation scores for either X or Y, as
withSSxand SSy, we find the sum of the products for each pair of deviation scores. Notice
inFormula 6.1 that, since the terms in the denominator must be positive, only the sum of
theproducts, SPxy, determines whether the value of r is positive or negative. Furthermore,
thesize of SPxymirrors the strength of the relationship; stronger relationships are
associatedwithlarger positive ornegativesums of products.
Calculationofr

OUTLIERS
Outliers were defined as very extreme scores that require special attention because of
theirpotential impact on a summary of data. This is also true when outliers appear among sets
ofpaired scores. Althoughquantitative techniques can beused to detectthese outliers,
wesimplyfocuson dotsin scatterplotsthat deviate visiblyfrom themain dotcluster.

4.10 OTHERTYPESOFCORRELATIONCOEFFICIENTS
There are many other types of correlation coefficients, but we will discuss only several
thatare direct descendants of the Pearson correlation coefficient. Although designed
originally foruse with quantitative data, the Pearsonr has been extended, sometimesunder the
guise ofnewnamesandcustomizedversionsofFormula6.1,tootherkindsofsituations.Forexample,
to describe the correlation between ranks assigned independently by two judges to
asetofscienceprojects,simplysubstitutethenumericalranksintoFormula,thensolvefora
value of the Pearson r (also referred to as Spearman’s rho coefficient for ranked or
ordinaldata).

Computationalformulaforcorrelationcoefficient

Theformulaforthesamplecorrelationcoefficientis:

whereCov(x,y) isthe covarianceofxand ydefined as

and arethesamplevariancesof x andy, definedasfollows:

and

The variances of x and y measure the variability of the x scores and y scores around
theirrespective sample means of X and Y considered separately. The covariance measures
thevariabilityof the(x,y)pairsaroundthemean ofxandmean of y,consideredsimultaneously.

Example:

Tocomputethesamplecorrelationcoefficient,weneedtocomputethevarianceofgestational
age, the variance of birth weight, and also the covariance of gestational age
andbirthweight.

Wefirst summarizethegestationalagedata. Themean gestationalageis:

To compute the variance of gestational age, we need to sum the squared deviations
(ordifferences)betweeneachobservedgestationalageandthemeangestationalage.Thecomputatio
nsaresummarized below.
InfantID # GestationalAge(weeks)

1 34.7 -3.7 13.69

2 36.0 -2.4 5.76

3 29.3 -9.1 82,81

4 40.1 1.7 2.89

5 35.7 -2.7 7.29

6 42.4 4.0 16.0

7 40.3 1.9 3.61

8 37.3 -1.1 1.21

9 40.9 2.5 6.25

10 38.3 -0.1 0.01

11 38.5 0.1 0.01

12 41.4 3.0 9.0

13 39.7 1.3 1.69

14 39.7 1.3 1.69

15 41.1 2.7 7.29

16 38.0 -0.4 0.16

17 38.7 0.3 0.09

Thevarianceof gestational ageis:

Next,wesummarizethebirthweightdata.Themeanbirthweightis:
The variance of birth weight is computed just as we did for gestational age as shown in
thetablebelow.

InfantID# BirthWeight

1 1895 -1007 1,014,049

2 2030 -872 760,384

3 1440 -1462 2,137,444

4 2835 -67 4,489

5 3090 188 35,344

6 3827 925 855,625

7 3260 358 128,164

8 2690 -212 44,944

9 3285 383 146,689

10 2920 18 324

11 3430 528 278,764

12 3657 755 570,025

13 3685 783 613,089

14 3345 443 196,249

15 3260 358 128,164

16 2680 -222 49,284

17 2005 -897 804,609

Thevarianceof birth weight is:

Nextwecomputethecovariance:

To compute the covariance of gestational age and birth weight, we need to multiply
thedeviation from the mean gestational age by the deviation from the mean birth weight for
eachparticipant,that is:

The computations are summarized below. Notice that we simply copy the deviations from
themean gestational age and birth weight from the two tables above into the table below
andmultiply.

InfantID#

1 -3.7 -1007 3725.9

2 -2.4 -872 2092.8

3 -9,1 -1462 13,304.2

4 1.7 -67 -113.9

5 -2.7 188 -507.6

6 4.0 925 3700.0

7 1.9 358 680.2

8 -1.1 -212 233.2

9 2.5 383 957.5

10 -0.1 18 -1.8

11 0.1 528 52.8

12 3.0 755 2265.0

13 1.3 783 1017.9

14 1.3 443 575.9

15 2.7 358 966.6

16 -0.4 -222 88.8

17 0.3 -897 -269.1

Total= 28,768.4

Thecovarianceof gestationalageand birthweight is:

Finally,wecan owcomputethesamplecorrelationcoefficient:

Notsurprisingly,thesamplecorrelationcoefficientindicatesastrongpositivecorrelation.

4.11 Regression

A predictive modeling technique that evaluates the relation between dependent (i.e. the
targetvariable) and independent variables is known as regression analysis. Regression
analysis canbe usedfor forecasting,time series modeling, or finding therelation between the
variablesand predict continuous values. For example, the
relationshipbetweenhouseholdlocationsandthe powerbill ofthehousehold by adriverisbest
studiedthrough regression.

We can analyze data and perform data modeling using regression analysis. Here, we create
adecision boundary/line according to the data points, such that the differences between
thedistancesof data points from the curveor lineareminimized.

NeedforRegressiontechniques
Theapplicationsofregressionanalysis,advantagesoflinearregression,aswellasthebenefits of
regression analysis and the regression method of forecasting can help a smallbusiness, and
indeed any business, create a better understanding of the variables (or factors)thatcan impact
itssuccess in thecoming weeks, monthsand years intothefuture.

Data are essential figures that define the complete business. Regression analysis helps
toanalyzethedatanumbersandhelpbigfirmsand businesses to make betterdecisions.
Regression forecasting is analyzing the relationships between data points, whichcanhelp you
to peek into the future.

9TypesofRegressionAnalysis
Thetypes ofregression analysis thatwearegoingtostudyhereare:

1. SimpleLinearRegression
2. MultipleLinearRegression
3. PolynomialRegression
4. LogisticRegression
5. RidgeRegression
6. LassoRegression
7. BayesianLinearRegression

There are some algorithms we use to train a regression model to create predictions
withcontinuousvalues.

8. DecisionTreeRegression
9. RandomForestRegression

There are various different types of regression models to create predictions. These
techniquesare mostly driven by three prime attributes: one the number of independent
variables, secondthetypeof dependent variables, and lastlythe shapeof theregression
line.
1) SimpleLinearRegression
Linear regression is the most basic form of regression algorithms in machine learning.
Themodel consists of a single parameter and a dependent variable has a linear relationship.
Whenthe number of independent variables increases, it is called the multiple linear
regressionmodels.

Wedenotesimplelinear regressionbythe followingequationgiven below.

y =mx + c+e

wherem istheslope oftheline, cisan intercept,and erepresentstheerrorinthemodel.

The best-fit decision boundary is determined by varying the values of m and c for
differentcombinations. The difference between the observed values and the predicted value is
called apredictorerror. Thevalues ofm and cget selectedto minimum predictorerror.
2) MultipleLinear Regression
Simple linear regression allows a data scientist or data analyst to make predictions about
onlyone variable by trainingthe model and predicting another variable. In a similar way,
amultipleregression model extends to severalmorethan onevariable.

Simple linear regression uses the following linear function to predict the value of a
targetvariabley, with independent variablex?

y=ß0+ ß1 x1+ …………..ßnxn + ϵ

Tominimizethesquareerrorweobtaintheparametersb?andb?thatbestfitsthedataafterfittingthe
linear equationto observed data.

3) PolynomialRegression
Inapolynomialregression,thepoweroftheindependentvariableismorethan1.Theequationbelow
represents apolynomial equation:

y =a+bx2

Inthisregressiontechnique,thebestfitlineisnotastraightline.Itisratheracurvethatfitsintothe data
points.
4) LogisticRegression
Logisticregressionisatypeofregressiontechniquewhenthedependentvariableisdiscrete.Example:
0or1,trueorfalse,etc.Thismeansthetargetvariablecanhaveonlytwovalues,
andasigmoidfunctionshowstherelationbetweenthetargetvariableandtheindependentvariable.

ThelogisticfunctionisusedinLogisticRegressiontocreatearelationbetweenthetargetvariableandi
ndependentvariables.Thebelowequation denotesthelogisticregression.

herepis theprobability ofoccurrenceof thefeature.

5) RidgeRegression
Ridge Regression is another type of regression in machine learning and is usually used
whenthereisahighcorrelationbetweentheparameters.Thisisbecauseasthe
correlationincreasesthe least square estimates give unbiased values. But if the collinearity is
very high, there canbe some bias value. Therefore, we introduce a bias matrix in the equation
of RidgeRegression. It is a powerful regression method where the model is less susceptible
tooverfitting.
BelowistheequationusedtodenotetheRidgeRegression,λ(lambda)resolvesthemulticollinearityis
sue:

β=(X^{T}X+λ*I)^{-1}X^{T}y

6) LassoRegression
Lasso Regression performs regularization along with feature selection. It avoids the
absolutesize of the regression coefficient. This results in the coefficient value getting nearer
to zero,thisproperty is differentfrom what inridgeregression.

Therefore we use feature selection in Lasso Regression. In the case of Lasso Regression,
onlythe required parameters are used, and the rest is made zero. This helps avoid the
overfitting
inthemodel.Butifindependentvariablesarehighlycollinear,thenLassoregressionchooses
only one variable and makes other variables reduce to zero. Below equation represents
theLassoRegression method:

N^{-1}Σ^{N}_{i=1}f(x_{i},y_{I},α,β)

7) BayesianLinearRegression
Bayesian Regression isused to find out the value of regression coefficients. In Bayesianlinear
regression, the posterior distribution of the features is determined instead of finding theleast-
squares. Bayesian Linear Regression is a combination of Linear Regression and
RidgeRegressionbut is morestablethan simpleLinear Regression.

Now, we will learn some types of regression analysis which can be used to train
regressionmodelsto create predictions with continuous values.
8) DecisionTreeRegression
The decision tree as the name suggests works on the principle of conditions. It is efficient
andhasstrongalgorithmsusedforpredictiveanalysis.Ithasmainlyattributedthatincludeinternalnod
es, branches,and aterminal node.

Every internal node holds a “test” on an attribute, branches hold the conclusion of the test
andevery leaf node means the class label. It is used for both classifications as well as
regressionwhich are both supervised learning algorithms. Decisions trees are extremely
delicate to theinformation they are prepared on — little changes to the preparation set can
bring aboutfundamentallydifferent treestructures.

9) RandomForestRegression

Random forest, as its name suggests, comprises an enormous amount of individual

decisiontrees that work as a group or as they say, an ensemble. Every individual decision tree
in therandom forest lets out a class prediction and the class with the most votes is considered
as themodel'sprediction.

Random forest uses this by permitting every individual tree to randomly sample from
thedatasetwith replacement,bringingabout various trees. Thisis known as bagging.
Regression
A correlation analysis of the exchange of greeting cards by five friends for the mostrecent
holiday season suggests a strong positive relationship between cards sent andcards received.
When informed of these results, another friend, Emma, who enjoysreceiving greeting cards,
asks you to predict how many cards she will receive duringthe next holiday season, assuming
that she plans to send 11 cards.
TWO ROUGH PREDICTIONS
Predict “Relatively Large Number”
You could offer Emma a very rough prediction by recalling that cards sent and received tend
to occupy similar relative locations in their respective distributions. Therefore, Emma can
expect to receive a relatively large number of cards, since she plans to send a relatively large
number of cards.
Predict “between 14 and 18 Cards”
To obtain a slightly more precise prediction for Emma, refer to the scatter plot for the original
five friends shown in Figure 7.1. Notice that Emma’s plan to send 11 cards locates her along
the X axis between the 9 cards sent by Steve and the 13 sent by Doris. Using the dots for
Steve and Doris as guides, construct two strings of arrows, one beginning at 9 and ending at
18 for Steve and the other beginning at 13 and ending at 14 for Doris. [The direction of the
arrows reflects our attempt to predict cards received (Y) from cards sent (X). Although not
required, it is customary to predict from X to Y.] Focusing on the interval along the Y axis
between the two strings of arrows, you could predict that Emma’s return should be between
14 and 18 cards, the numbers received by Doris and Steve.
Figure: A rough prediction for Emma (using dots for Steve and Doris)

Regressionline

A regression line is a line which is used to describe the behavior of a set of data.

All five dots contribute to the more precise prediction, illustrated

inFigure4.13.1,thatEmmawillreceive15.20cards.Lookmorecloselyatthesolidlinedesignatedast
heregression line in Figure4.13.1, which guides the string of arrows, beginning at 11,
towardthe predicted value of 15.20. The regression line is a straight line rather than a curved
linebecause of the linear relationship between cards sent and cards received. As will
becomeapparent, it can be used repeatedly to predict cards received. Regardless of whether
Emmadecides to send 5, 15, or 25 cards, it will guide a new string of arrows, beginning at 5
or 15 or25,toward anew predicted value along the Y axis.

PlacementofLine

For the time being, forget about any prediction for Emma and concentrate on how the
fivedots dictate the placement of the regression line. If all five dots had defined a single
straightline, placement of the regression line would have been simple; merely let it pass
through alldots. When the dots fail to define a single straight line, as in the scatterplotfor the
fivefriends, placement of the regression line represents a compromise. It passes through the
maincluster,possibly touching somedots but missingothers.
PredictiveErrors

Figure 4.13.2 illustrates the predictive errors that would have occurred if the
regressionline had been used to predict the number of cards received by the five
friends. Soliddotsreflecttheactual numberof cardsreceived, and opendots,
alwayslocatedalong

Figure4.13.2Predictionof15.20forEmma (usingtheregressionline).

Figure4.13.3Predictiveerrors.
the regression line, reflect the predicted number of cards received. Thelargest predictive
error, shown as a broken vertical line, occurs for Steve, who sent 9 cards.Although he
actually received 18 cards, he should have received slightly fewer than 14 cards,according to
the regression line. The smallest predictiveerror—none whatsoever—occurs forMike, who
sent 7 cards. He actually received the 12 cards that he should have received,accordingto
theregression line.

TotalPredictiveError

Engageintheseeminglysillyactivityofpredictingwhatisknownalreadyforthefivefriends to check
the adequacy of our predictive effort. The smaller the total for all
predictiveerrorsinFigure4.13.3,themorefavorablewillbetheprognosisforourpredictions.Clearly,
it is desirable for the regression line to be placed in a position that minimizes the
totalpredictive error, that is, that minimizes the total of the verticaldiscrepancies between
thesolidand open dots shown in Figure4.13.3.

Progress Check *4.13.1 To check your understanding of the first part of this chapter,
makepredictionsusing thefollowing graph.

(a) Predicttheapproximaterateofinflation,givenanunemploymentrate of 5percent.

(b) Predicttheapproximaterateofinflation, givenanunemploymentrate of15percent.
4.12 Leastsquaresregressionline

To avoid the arithmetic standoff of zero always produced by adding positive and
negativepredictiveerrors(associatedwitherrorsaboveandbelowtheregressionline,respectively),t
heplacementoftheregressionlineminimizesnotthetotalpredictiveerrorbutthetotalsquaredpredicti
veerror,thatis,thetotalforallsquaredpredictiveerrors.Whenlocatedinthisfashion,theregressionlin
eisoftenreferredtoastheleastsquaresregressionline.Althoughmoredifficulttovisualize,thisappro
achisconsistentwiththeoriginalaim—
tominimizethetotalpredictiveerrororsomeversionofthetotalpredictiveerror,therebyprovidinga
morefavorable prognosis forour predictions.

NeedaMathematical Solution

Without the aid of mathematics, the search for a least squares regression line would
befrustrating. Scatterplots would be proving grounds cluttered with tentative regression
lines,discarded because of their excessively large totals for squared discrepancies. Even the
mosttime-consuming, conscientious effort would culminate in only a close approximation to
theleastsquares regression line.
LeastSquaresRegressionEquation

Happily,anequationpinpointstheexactleastsquaresregressionlineforanyscatterplot.Mostge
nerally, this equation reads:

………………>1

whereY´ represents the predicted value (the predicted number of cards that will be receivedby
any newfriend, suchasEmma);Xrepresentsthe known value (theknownnumber ofcards sent by
any new friend); and b and a represent numbers calculated from the
originalcorrelationanalysis, as describednext.*

FindingValuesof banda
Toobtain aworking regression equation,solveeach ofthefollowing expressions,
firstforb andthen fora,using datafromthe originalcorrelationanalysis.Theexpression
forb reads:

…………………………….>2

wherer represents the correlation between X and Y (cards sent and received by the
fivefriends); SSyrepresents the sum of squares for all Y scores (the cards received by the
fivefriends); and SSxrepresents the sum of squares for all X scores (the cards sent by the
fivefriends).

Theexpression fora reads:

……………………………>3

whereY and X refer to the sample means for all Y and X scores, respectively, and b
isdefinedby the preceding expression.
The values of all terms in the expressions for b and a can be obtained from
theoriginalcorrelationanalysiseither directly,aswith thevalueof r, orindirectly,aswith
thevaluesoftheremainingterms:SSy′SSx
′Y,andX.Table4.14.1illustratesthecomputationalsequencethatproducesaleastsquaresregression
equationforthegreetingcard
example,namely,
Y’.80(X)6.40
where.80and6.40representthevaluescomputedfor banda, respectively.
4.13 Standarderrorofestimatesy|x

Although we predicted that Emma’s investment of 11 cards will yield a return of 15.20
cards,we would be surprised if she actually received 15 cards. It is more likely that because of
theimperfect relationship between cards sent and cards received, Emma’s return will be
somenumber other than 15. Although designed to minimize predictive error, the least
squaresequation does not eliminate it. Therefore, our next task is to estimate the amount of
errorassociated with our predictions. The smaller the estimated error is, the better the
prognosiswill beforour predictions.

FindingtheStandardErrorofEstimate

The estimate of error for new predictions reflects our failure to predict the number of
cardsreceived by the original five friends, as depicted by the discrepancies between solid and
opendots in Figure7.15. Known as the standard error of estimate and symbolized as sy|x,
thisestimateofpredictiveerrorcomplieswiththegeneralformatforanysamplestandarddeviation,
that is, the square root of a sum of squares term divided by its degrees of freedom.
(SeeFormula4.10 on page76.)Theformulafor sy|xreads:

....................>4
where the sum of squares term in the numerator, SSy|x, represents the sum of the
squaresfor predictive errors, Y − Y′, and the degrees of freedom term in the denominator, n −
2,reflects the loss of two degrees of freedom because any straight line, including the
regressionline, can be made to coincide with two data points. The symbol sy|xis read as “s sub
y givenx.”
Although we can estimate the overall predictive error by dealing directly with
predictiveerrors,Y −Y′, itismoreefficient to usethe following computation formula:

.....................>5

Where SSy is the sum of the squares for Y

scoresthat is,

And r is the correlation coefficient.

4.14 Interpretationofr2:

The squared correlation coefficient,r2, provides us with not only a key interpretation of the
correlation coefficient but also a measure of predictive accuracy that supplements the
standard error of estimate,sy|x.(Remember,we engage in the seemingly silly activity of
predicting that which we already know not as an end-in-itself,but as a way to check the
adequacy of ourpredictive effort.)Paradoxically,
eventhoughourultimategoalistoshowtherelationshipbetweenr2andpredictiveaccuracy,wewillin
itiallyconcentrateontwokindsof predictive errors—those due to therepetitive prediction of
themean andthose
dueto theregression equation.

Repetitive Prediction of the Mean

Forthesakeofthepresentargument,pretendthatweknowtheYscoresbutnotthecorresponding X
scores.Lacking information about the relationship between X and Y scores,circumstances,
statisticians recommend repetitive predictions of the mean, Y, for a variety
ofreasons,including thefact that,although thepredictive errorforanyindividual might
be quite large, the sum of all of the resulting five predictive errors (deviations of Y
scoresabout Y) always equals zero, as you may recall from Section 3.3.] Most important for
ourpurposes, using the repetitive prediction of Y for each of the Y scores of all five friends
willsupply us with a frame of reference against which to evaluate our customary predictive
effortbased on the correlation between cards sent (X) and cards received (Y). Any predictive
effortthat capitalizes on an existing correlation between X and Y should be able to generate
asmallererror variability—and, conversely, more

FIGURE4.16
Violation of homoscedasticity assumption. (Dots lack
equalvariabilityabout all line segments.)
RegressionLine
Y
X

INTERPRETATIONOFr2

accurate predictions of Y—than a primitive effort based only on the repetitive predictionofY.

PredictiveErrors

Panel A of Figure 4.16 shows the predictive errors for all five friends when the mean for
allfive friends, Y, of 12 (shown as the mean line) is always used to predict each of their five
Yscores. Panel B shows the corresponding predictive errors for all five friends when a series
ofdifferent Y′ values, obtained from the least squares equation , is used to predict each of
theirfive Y scores. For example, panel A of Figure 7.5 shows the error for John when the
mean forallfivefriends,Y,of12isusedtopredicthisYscoreof6.Shownasabrokenverticalline,the
error of −6 for John (from Y − Y = 6 − 12 = −6) indicates that Y overestimates John’s Yscore
by 6 cards. Panel B shows a smaller error of −1.20 for John when aY′ value of 7.20 isused to
predict the same Y score of 6. This Y’ value of 7.20 is obtained from the least
squaresequation,
where the number of cards sent by John, 1, has been substituted for X.Positive and
negativeerrors indicate that Y scores are either above or below their corresponding predicted
scores.Overall, as expected, errors are smaller when customized predictions of Y′ from the
leastsquares equation can be used (because X scores are known) than when only the
repetitivepredictionofYcanbeused(becauseXscoresareignored.)Aswithmoststatisticalphenome
na, there are exceptions: The predictive error for Doris is slightly larger when
theleastsquaresequation is used.

ErrorVariability(SumofSquares)
To more precisely evaluate the accuracy of our two predictive efforts, we need some
measureof the collective errors produced by each effort. It probably will not surprise you that
the sumof squares qualifies for this role. The sum of squares of any set of deviations, now
callederrors,can becalculated by first squaring each error (to eliminatenegative signs),
thensumming all squared errors. The error variability for the repetitive prediction of the mean
canbe designated as SSy, since each Y score is expressed as a squared deviation from Y and
thensummed,that is

Usingtheerrors forthefivefriendsshown inPanel A ofFigure4.16,this becomes

The error variability for the customized predictions from the least squares equation can
bedesignatedasSSy|x,sinceeachYscoreisexpressedasasquareddeviationfromitscorrespondingY’
and then summed, that is

Usingtheerrorsforthe fivefriends shownin PanelB ofFigure4.16,obtain

Figure4.16Predictiveerrorsfor fivefriends.

ProportionofPredictedVariability
If you think about it, SSymeasures the total variability of Y scores that occurs after
onlyprimitive predictionsbasedonYaremade(becauseXscoresare ignored),whileSSy|
xmeasurestheresidualvariabilityofYscoresthatremainsaftercustomizedleastsquare
predictionsaremade(becauseXscoresareused).Theerrorvariabilityof28.8fortheleastsquarespredi
ctionsismuchsmallerthantheerrorvariabilityof80fortherepetitivepredictionofY,confirmingthegr
eateraccuracyoftheleastsquarespredictionsapparentinFigure 4.16 To obtain an SS measure of
the actual gain in accuracy due to the least
squarespredictions,subtracttheresidualvariabilityfromthetotalvariability,thatis,subtractSSy|
xfromSSy, to obtain

Toexpressthisdifference,51.2,asagaininaccuracyrelativetotheoriginalerrorvariabilityforthe
repetitiveprediction ofY, dividetheabovedifferencebySSy, that is,

This result, .64 or 64 percent, represents the proportion or percent gain in predictive
accuracywhen the repetitive prediction of Y is replaced by a series of customized Y′
predictions basedon the least squares equation. In other words, .64 or 64 percent represents
the proportion orpercent of the total variability of SSythat is predictable from its relationship
with the Xvariable. To the delight of statisticians, when squared, the value of the correlation
coefficientequalsthisproportionofpredictablevariability.Recallingthatanrof.80wasobtainedfort
he correlation between cards sentand cards received by the five friends, we canverify thatr2 =
(.80)(.80) = .64, which, of course, also is the proportion of predictable variability.
Giventhisperspective,

The square of the correlation coefficient, r2, always indicates the proportion
oftotal variability in one variable that is predictable from its relationship with
theothervariable.
Expressingtheequation forr2insymbols, wehave:

......................................>4.16
wheretheonenewsumofsquaresterm,SSy′,issimplythevariabilityexplainedbyorpredictable from
theregression equation, that is,
Accordingly, r2 provides us with a straightforward measure of the worth of our least
squarespredictiveeffort.*

r2DoesNotApply toIndividualScores
Do not attempt to apply the variability interpretation of r² to individual scores. For
instance,thefactthat64percentofthevariabilityincardsreceivedbythefivefriends(Y)ispredictable
from their cards sent (X) does not signify, therefore, that 64 percent of the fivefriends′ Y
scores can be predicted perfectly. As can be seen in Panel B of Figure 7.5, only oneof the Y
scores for the five friends, the 12 cards received by Mike, was predicted perfectly(because it
coincides with the regression line for the least squares equation), and even thisperfect
prediction is not guaranteed just because r2 equals .64. To the contrary, the 64 percentmust
be interpreted as applying to the variability for the entire set of Y scores. The totalvariability
of all Y scores—as measured by SSY—can be reduced by 64 percent when each Yscore is
replaced by its corresponding predicted Y’ score and then expressed as a squareddeviation
from the mean of all observed scores. Thus, the 64 percent represents a reduction inthe total
variability for the five Y scores when they are replaced by a succession of
predictedscores,given the least squaresequationand various values of X.

SmallValuesof r2
Whentransposedfromrtor2,Cohen’sguidelines,mentionedonpage114,statethatavalueof r2 in
the vicinity of .01, .09, or .25 reflects a weak, moderate, or strong
relationship,respectively.Donotexpecttoroutinelyencounterlargevaluesofr2inbehavioralanded
ucationalresearch.Intheseareas,wheremeasuresofcomplexphenomena,suchasintellectualaptitud
e,psychopathictendency,orself-
esteem,failtocorrelatehighlywithanysinglevariable,valuesofr2largerthanabout.25aremostunlik
ely.However,evenvaluesofr2closetozeromightmeritourattention.Forinstance,ifjust.04(or4perc
ent)ofthevariabilityofmentalhealthscoresofsixthgradersactuallycouldbepredictedfromasinglev
ariable,suchasdifferencesinweaningage,manyinvestigatorswouldprobablyviewthisasanimporta
nt finding, worthy of additionalinvestigation.

r2Doesn’tEnsureCause-Effect
Thequestionof cause-effect,raised inSection6.3,cannotbe
resolvedmerelybysquaringthecorrelationcoefficienttoobtainavalueofr2.Ifthecorrelationbetwee
nmentalhealthscores
of sixth graders and their weaning ages as infants equals .20, we cannot claim, therefore,
that(.20)(20) = .04 or 4 percent of the total variability in mental health scores is caused by
thedifferences in weaning ages. Instead, it is possible that this correlation reflects some
morebasic factor or factors, such as, for example, a tendency for more economically secure,
lessstressed mothers both to create a family environment that perpetuates good mental health
and,coincidentally,tonursetheirinfantslonger.Certainly,intheabsenceofadditionalevidence,it
would be foolhardy to encourage mothers, regardless of their circumstances, to
postponeweaningbecauseofitsprojectedeffectonmentalhealthscores.Althoughwehaveconsisten
tly referred to r2 as indicating the proportion or percent of predictable variability,you also
might encounter references to r2 as indicating the proportion or percent of
explainedvariability. In this context, “explained” signifies only predictability, not causality.
Thus, youcould assert that .04, or 4 percent, of the variability in mental health scores is
“explained” bydifferences in weaning age, insofar as .04, or 4 percent, is predictable from—
or statisticallyattributableto—differences in weaning age.

4.16 Multipleregressionequations
Anyseriouspredictive effortusually culminatesinamorecomplexequationthatcontains
not just one but several X, or predictor variables. For instance, a serious effort to
predictcollegeGPA might culminate in the following equation:
Y.410(X1) .005(X2) .001(X3)1.03
Where Y′ represents predicted college GPA and X1, X2, and X3 refer to high schoolGPA,IQ
score, and SAT score, respectively. By capitalizing on the combined predictive power
ofseveralpredictorvariables,thesemultipleregressionequationssupplymoreaccurateprediction
s for Y′ (often referred to as the criterion variable) than could be obtained from
asimpleregression equation.

CommonFeatures
Although more difficult to visualize, multiple regression equations possess many features
incommonwiththeirsimplecounterparts.Forinstance,theystillqualifyasleastsquaresequations,
since they minimize the sum of the squared predictive errors. By the same
token,theyareaccompaniedbystandarderrorsofestimatethatroughlymeasuretheaverageamounts
of predictive error. Be assured, therefore, that this chapter will serve as a good
pointofdepartureif, sometime inthefuture,you mustdeal with multiple regressionequations.
4.17 Regressiontowardthemean

Regressiontowardthemeanreferstoatendencyforscores,particularlyextremescores,toshrinkto
wardthemean.Thistendencyoftenappearsamongsubsetsofobservationswhosevaluesareextremea
ndatleastpartlyduetochance.Forexample,becauseofregressiontowardthemean,wewouldexpectt
hatstudentswhomadethetopfivescoresonthefirststatisticsexamwouldnotmake the
topfivescoresonthesecondstatisticsexam.Althoughallfivestudentsmightscoreabovethemeanont
hesecondexam,someoftheirscoreswouldregressbacktowardthemean.Mostlikely,thetopfivescor
esonthefirstexamreflecttwocomponents.Onerelativelypermanentcomponentreflectsthefactthatt
hesestudentsaresuperiorbecauseofgoodstudyhabits,astrongaptitudeforquantitativereasoning,an
dsoforth.Theotherrelativelytransitorycomponentreflectsthefactthat,onthedayoftheexam,atleast
someofthesestudentswereveryluckybecauseallsortsoflittlechancefactors,suchasrestfulsleep,apl
easantcommutetocampus,etc.,workedintheirfavor.Onthesecondtest,eventhoughthescoresofthes
efivestudentscontinuetoreflectanabove-
averagepermanentcomponent,someoftheirscoreswillsufferbecauseoflessgoodluckorevenbadlu
ck.Theneteffectisthatthescoresofatleastsomeoftheoriginalfivetopstudentswilldropbelowthetopf
ivescores—thatis,regressbacktowardthemean—onthesecondexam.
(Whensignificantregressiontowardthemeanoccursafteraspectacularperformanceby,forexample
,a rookie athlete or a first-time author, the term sophomore jinx often is
invoked.)Thereisgoodnewsforthosestudentswhomadethefivelowestscoresonthefirstexam.Alth
oughallfivestudentsmightscorebelowthemeanonthesecondexam,someoftheirscoresprobablywil
lregressuptowardthemean.Onthesecondexam,someofthemwillnotbeasunlucky.Theneteffectist
hatthescoresofatleastsomeoftheoriginalfivelowestscoringstudentswillmoveabovethebottomfiv
escores—thatis,regressuptowardthemean—onthe secondexam.

Appearsin ManyDistributions
Regression toward the mean appears among subsets of extreme observations for awidevariety
of distributions. Incidentally, it is not true that, viewed as a group, all major leaguehitters are
headed toward mediocrity. Hitters among the top 10 in 2014, who were not amongthe top 10
in 2015, were replaced by other mostly above-average hitters, who also were verylucky
during 2015. Observed regression toward the mean occurs for individuals or subsets
ofindividuals,not forentire groups.
TheRegressionFallacy
The regression fallacy is committed whenever regression toward the mean is interpreted as
areal, rather than a chance, effect. A classic example of the regression Regression
Fallacyfallacyoccurred in anIsraeli AirForcestudy of pilot training

Stat Week 3
100% (1)
Stat Week 3
46 pages
Chapter 2. Normal Distribution and Z-Scores
No ratings yet
Chapter 2. Normal Distribution and Z-Scores
54 pages
FDS Unit-3
No ratings yet
FDS Unit-3
64 pages
Understanding Normal Curve Destribution
No ratings yet
Understanding Normal Curve Destribution
50 pages
Unit - Iv DS
No ratings yet
Unit - Iv DS
39 pages
Normal Distribution and Z Scores
No ratings yet
Normal Distribution and Z Scores
26 pages
3.5 The Normal Curve
No ratings yet
3.5 The Normal Curve
63 pages
Chapter 2 Notes Edition 5
No ratings yet
Chapter 2 Notes Edition 5
45 pages
Chapter 2 Normal Distributions
No ratings yet
Chapter 2 Normal Distributions
36 pages
The Normal Distribution
No ratings yet
The Normal Distribution
45 pages
Stats Lecture-4
No ratings yet
Stats Lecture-4
30 pages
Normal Distribution
No ratings yet
Normal Distribution
25 pages
Lesson 7 Normal Distribution
No ratings yet
Lesson 7 Normal Distribution
38 pages
Normal Distrib MMW
No ratings yet
Normal Distrib MMW
37 pages
Application-for-Normal-Distribution (1) .PPTM
No ratings yet
Application-for-Normal-Distribution (1) .PPTM
20 pages
Normal Distribution - (91-100)
No ratings yet
Normal Distribution - (91-100)
33 pages
PROSTAT-Chapter 2.1 - Normal and Standard Normal Distribution
No ratings yet
PROSTAT-Chapter 2.1 - Normal and Standard Normal Distribution
35 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
18 pages
cHAPTER 6 STATISTICS
No ratings yet
cHAPTER 6 STATISTICS
22 pages
CLI 240 BIOSTATISTICS Normal and Standard Curve
No ratings yet
CLI 240 BIOSTATISTICS Normal and Standard Curve
22 pages
Unit Iv
No ratings yet
Unit Iv
50 pages
Chapter 03-Normal Distributions
No ratings yet
Chapter 03-Normal Distributions
28 pages
4 Normal Distribution
No ratings yet
4 Normal Distribution
40 pages
Using The Normal Distribution: Statistics Feati University
No ratings yet
Using The Normal Distribution: Statistics Feati University
25 pages
Lecture 6
No ratings yet
Lecture 6
34 pages
Normal Distribution
No ratings yet
Normal Distribution
32 pages
Normal Distribution Stud
No ratings yet
Normal Distribution Stud
19 pages
Magbanua Bsmar-E 1 Propeller (Explanation Essay)
No ratings yet
Magbanua Bsmar-E 1 Propeller (Explanation Essay)
15 pages
Statistics and Probability Module 1.2 - Normal Distributions
No ratings yet
Statistics and Probability Module 1.2 - Normal Distributions
22 pages
Week11-Normal Distribution (Ekstra-Week)
No ratings yet
Week11-Normal Distribution (Ekstra-Week)
59 pages
Normal Curve and Standard Score-1 2
No ratings yet
Normal Curve and Standard Score-1 2
31 pages
Quarter 2 Week7 8. Z Scores
No ratings yet
Quarter 2 Week7 8. Z Scores
7 pages
TLG 7.2 Areas Under The Normal Distribution
No ratings yet
TLG 7.2 Areas Under The Normal Distribution
6 pages
Chapter 2 Normal Distribution
No ratings yet
Chapter 2 Normal Distribution
31 pages
A-Level Statistics 1 - Normal Distribution - Notes
No ratings yet
A-Level Statistics 1 - Normal Distribution - Notes
5 pages
Continuous Probability Z-Score
No ratings yet
Continuous Probability Z-Score
51 pages
Module Using The Empirical Rule
No ratings yet
Module Using The Empirical Rule
14 pages
Statistics and Probability - Module 3
No ratings yet
Statistics and Probability - Module 3
13 pages
Stat and Prob M3 Adm
No ratings yet
Stat and Prob M3 Adm
9 pages
Normal Distributon
No ratings yet
Normal Distributon
10 pages
Unit II
No ratings yet
Unit II
4 pages
Let Let
No ratings yet
Let Let
29 pages
Assignment 5 - STAT
No ratings yet
Assignment 5 - STAT
8 pages
Working With Z-Scores 2:: Surveys & The Central Limit Theorem
No ratings yet
Working With Z-Scores 2:: Surveys & The Central Limit Theorem
3 pages
The Normal Curve
No ratings yet
The Normal Curve
16 pages
Final Powerpoint2
100% (1)
Final Powerpoint2
35 pages
Normal Distribution 2012
No ratings yet
Normal Distribution 2012
29 pages
Module 3 Stats
No ratings yet
Module 3 Stats
17 pages
Q3Basic Statistics Week 3
No ratings yet
Q3Basic Statistics Week 3
7 pages
Begging
100% (1)
Begging
8 pages
Data Science - Unit-4
No ratings yet
Data Science - Unit-4
30 pages
Statistics and Probability Quarter 3 Week 4 PDF
No ratings yet
Statistics and Probability Quarter 3 Week 4 PDF
26 pages
Normal Distribution
100% (1)
Normal Distribution
41 pages
Standard Scores
0% (1)
Standard Scores
26 pages
Normal 30
No ratings yet
Normal 30
2 pages
Normal 30
No ratings yet
Normal 30
2 pages
LESSON 4 Normal Distribution
No ratings yet
LESSON 4 Normal Distribution
60 pages
CHAPTER 4 Normal Distribution Z-Scores
100% (7)
CHAPTER 4 Normal Distribution Z-Scores
33 pages
Prathamesh Shukla SMDM Project 20.08.23
100% (1)
Prathamesh Shukla SMDM Project 20.08.23
34 pages
Introduction Econometrics
100% (1)
Introduction Econometrics
27 pages
Marketing Strategies of Rice Traders
50% (2)
Marketing Strategies of Rice Traders
3 pages
Normal Distribution
No ratings yet
Normal Distribution
29 pages
Guidance Syllabus
100% (1)
Guidance Syllabus
6 pages
Unit 2 - Os-Psna - Part 2
No ratings yet
Unit 2 - Os-Psna - Part 2
73 pages
TOPIC 8 - T - TEST Latest
No ratings yet
TOPIC 8 - T - TEST Latest
80 pages
Dynamic Analysis of Structures - Basics and Applications 3
No ratings yet
Dynamic Analysis of Structures - Basics and Applications 3
1 page
POS Tagging HMM Notes With Diagrams
No ratings yet
POS Tagging HMM Notes With Diagrams
4 pages
Chapter V PR2
No ratings yet
Chapter V PR2
3 pages
Multilingual Hong Kong Languages, Literacies and Identities
No ratings yet
Multilingual Hong Kong Languages, Literacies and Identities
249 pages
Median Test and Fisher Sign Test
No ratings yet
Median Test and Fisher Sign Test
17 pages
The Art of Problem Solving Edited
No ratings yet
The Art of Problem Solving Edited
22 pages
Intro To ML
No ratings yet
Intro To ML
4 pages
Probability and Statistics
No ratings yet
Probability and Statistics
16 pages
Get Fundamentals of Statistics For Aviation Research 1st Edition Michael A. Gallo Free All Chapters
100% (2)
Get Fundamentals of Statistics For Aviation Research 1st Edition Michael A. Gallo Free All Chapters
40 pages
NLP 2
No ratings yet
NLP 2
13 pages
Estimation New
No ratings yet
Estimation New
37 pages
After Woodstock Gentler Generation Gap
No ratings yet
After Woodstock Gentler Generation Gap
27 pages
To Study Investment Behaviour of Invest
No ratings yet
To Study Investment Behaviour of Invest
11 pages
Cost of Sustainable Materials
No ratings yet
Cost of Sustainable Materials
7 pages
University of Ghana Thesis Repository
100% (3)
University of Ghana Thesis Repository
6 pages
Pharmaceutical Sales Job Interview Questions and Answers
No ratings yet
Pharmaceutical Sales Job Interview Questions and Answers
5 pages
2019 - Yeung Et Al - Accuracy and Precision of 3d-Printed Implant Surgical Guideswith Different Implant Systems An in Vitro Study
No ratings yet
2019 - Yeung Et Al - Accuracy and Precision of 3d-Printed Implant Surgical Guideswith Different Implant Systems An in Vitro Study
8 pages
The Relationship Between The Lifelong Learning Tendencies and Teacher Self-Efficacy Levels of Social Studies Teacher Candidates
No ratings yet
The Relationship Between The Lifelong Learning Tendencies and Teacher Self-Efficacy Levels of Social Studies Teacher Candidates
18 pages
CuBr DMS
No ratings yet
CuBr DMS
5 pages
CV Susheewa-Web
No ratings yet
CV Susheewa-Web
2 pages
Raspberry Pi Components and Peripherals Detailed Notes
No ratings yet
Raspberry Pi Components and Peripherals Detailed Notes
2 pages
ABE Strategic Marketing Management Examination Tips June 2014
No ratings yet
ABE Strategic Marketing Management Examination Tips June 2014
6 pages
RFID Explanation
No ratings yet
RFID Explanation
3 pages
Artikel 2
No ratings yet
Artikel 2
10 pages
IPR Unit1 Summary
No ratings yet
IPR Unit1 Summary
2 pages
Motivational Factors of The Academically Performing Senior High School Students at International School of Asia and The Pacific
No ratings yet
Motivational Factors of The Academically Performing Senior High School Students at International School of Asia and The Pacific
13 pages
The Influence of The Internet and Mobile Educational Apps On Academic Performance Among First-Year Students of Technological University of The Philippines-Cavite
No ratings yet
The Influence of The Internet and Mobile Educational Apps On Academic Performance Among First-Year Students of Technological University of The Philippines-Cavite
15 pages
Interrater Reliability
No ratings yet
Interrater Reliability
4 pages
Socio Critical GELS
No ratings yet
Socio Critical GELS
13 pages
Chat GPT FDP Online
No ratings yet
Chat GPT FDP Online
2 pages
Colloquium of Linguistics. Johannes Gutenberg-: Theodore Marinis University of Potsdam/ZAS Berlin
No ratings yet
Colloquium of Linguistics. Johannes Gutenberg-: Theodore Marinis University of Potsdam/ZAS Berlin
10 pages
Citi - Test Lead - TCoE
No ratings yet
Citi - Test Lead - TCoE
2 pages
Standard-Slope Integration: A New Approach to Numerical Integration
From Everand
Standard-Slope Integration: A New Approach to Numerical Integration
Peter James Italia, MD
No ratings yet
Econometrics: A Simple Introduction
From Everand
Econometrics: A Simple Introduction
K.H. Erickson
3.5/5 (5)