Unit III-The Normal Curve
Unit III-The Normal Curve
The idealized normal curve has been superimposed on the original distribution for 3091
men.Irregularities in the original distribution, most likely due to chance, are ignored by the
smoothnormal curve. Accordingly, any generalizationsbased on the smooth normal curve will
tendtobe moreaccurate thanthose based onthe originaldistribution.
InterpretingtheShadedArea
The total area under the normal curve can be identified with all FBI applicants.
Viewedrelative to the total area, the shaded area represents the proportion of applicants who
will beeligible because they are shorter than exactly 66 inches. This new, more accurate
proportionwill differ from that obtained from the original histogram (.165) because of
discrepanciesbetweenthe two distributions.
FindingaProportionfortheShadedArea
To find this new proportion, we cannot rely on the vertical scale. It describes as
proportionsthe areas in the rectangular bars of histograms,not the areas in the various curved
sectors ofthenormal curve.
PropertiesoftheNormalCurve
The mean, mode and median are all equal.
The curve is symmetric at the center (i.e. around the mean, μ).
Exactly half of the values are to the left of center and exactly half the values are to the right.
The total area under the curve is 1.
4.1 THENORMALCURVE
When using the normal curve, two bits of information are indispensable: values for the
meanand the standard deviation. For example, before the normal curve can be used to
answer thequestion about eligible FBI applicants, it must be established that, for the original
distributionof3091men, themean heightequals69 inchesand thestandard deviationequals3
inches.
DifferentNormalCurves
For example, changing the mean height from 69 to 79 inches produces a new
normalcurve that, as shown in panel Ais displaced 10 inches to the right of the original
curve.Dramaticallynewnormalcurvesareproducedbychangingthevalueofthestandarddeviation
changing the standard deviation from 3 to 1.5 inches produces a more peakednormal curve
with smaller variability, whereas changing the standard deviation from 3 to 6inchesproduces
ashallower normalcurve with greatervariability.
Obvious differences in appearance among normal curves are less important than you
mightsuspect.Becauseoftheircommonmathematicalorigin,everynormalcurvecanbeinterpreted
in exactly the same way once any distance from the mean is expressed in standarddeviation
units
Thenormalcurveistodescribeacompletesetofobservationsorapopulation,thesymbolsμand
σ,representing themean andstandard deviationof thepopulation, respectively.
4.2 zSCORES
Where X is the original score and μandσ are the mean and the standarddeviation,respectively,
A zscore consists of two parts:
1. apositiveornegativesignindicating whetherit’saboveor belowthemean;and
2. a number indicating the size of its deviation from the mean in standard deviation units.
Example:
You have a test score of 190. The test has a mean (μ) of 150 and a standard deviation (σ)
of 25. Assuming a normal distribution, your z score would be:
z = (x – μ) / σ
4.3 STANDARDNORMALCURVE
If the original distribution approximates a normal curve, then the shift to standard or z
scoreswill always produce a new distribution that approximates the standard normal curve.
This isthe one normal curve for which a table is actually available. It is a mathematical
factnotproven in this bookthat the standard normal curve always has a mean of 0 and a
standarddeviation of 1. However, to verify that the mean of a standard
normaldistributionequals 0,
Replace X in the zscore formula with μ + 1σ, the value corresponding to one
standarddeviationabovethe mean forany (nonstandard)normal distribution,and then
solveforz:
Although there is an infinite number of different normal curves, each with its own mean
andstandard deviation, there is only one standard normal curve, with a mean of 0 and a
standarddeviation of1.
Convertingthree normalcurvestothestandardnormalcurve.
StandardNormalTable
Standardnormaltableconsistsofcolumnsofzscorescoordinatedwithcolumnsofproportions. In a
typical problem, access to the table is gained through a z score, such as –1.00,and theanswer
isread as aproportion,suchas theproportion ofeligibleFBIapplicants.
UsingtheTopLegendofthetable
The entries in column A are z scores, beginning with 0.00 and ending (in the full-length
tableof AppendixC) with4.00. Given a z score of zeroor more, columnsB and Cindicate
howthezscoresplitstheareaintheupperhalfofthenormalcurve.Assuggestedbytheshadinginthe
toplegend, columnBindicatesthe proportionof area between the meanand thezscore, and
column C indicates the proportion of area beyond the z score, in the upper tail ofthestandard
normalcurve.
UsingtheBottomLegendoftheTable
The symmetry of the normal curve, the entries .Table A of Appendix C also can refer to
thelowerhalfofthenormalcurve.NowthecolumnsaredesignatedasA′,B′,andC′inthe
legend at the bottom of the table. When using the bottom legend, all entries refer to the
lowerhalf of the standard normal curve. Imagine that the nonzero entries in column A′ are
negativezscores,beginningwith–0.01andending(inthefull-lengthtableofAppendixC)with–
4.00. Given a negative z score, columns B′ and C′ indicate how that zscore splits the
lowerhalf of the normal curve. As suggested by the shading in the bottom legend of the
table,column B′ indicates the proportion of area between the mean and the negative zscore,
andcolumn C′ indicates the proportion of area beyond the negative z score, in the lower tail of
thestandardnormal curve.
SOLVINGNORMALCURVEPROBLEMS
Two main types of normal curve problems.In the first type of problem, we use a known
score(or scores) to find an unknown proportion. For instance, we use the known score of 66
inchesto find the unknown proportion of eligible FBI applicants. In the second type of
problem, theprocedureis reversed.Nowweuse aknownproportion to findan unknown
score(orscores).
InterpretationofTableA, AppendixC.
When using the standard normal table, it is important to remember that for any zscore,
thecorresponding proportions in columns B and C (or columns B′ and C′) alwayssum
to .5000.Similarly,thetotalareaunderthenormalcurvealwaysequals1.0000,thesumoftheproporti
ons in the lower and upper halves, that is, .5000 + .5000. Finally, although a z scorecan be
either positive or negative, the proportions of area under thecurve are always positiveorzero
but never negative(becausean areacannotbenegative).
4.4 FINDINGPROPORTIONS
1. Sketch a normal curve and shade in the target area, as in the left part of Figure.Being
lessthanthemean of 69, 66 is located to the left ofthemean.
Furthermore, since the unknown proportion represents those applicants who are shorter
than66inches, theshaded target sector is located to theleft of 66.
2. Plan your solution according to the normal table. Decide precisely how you will find
thevalue of the target area. In the present case, theanswer will be obtained from column C′
ofthe standard normal table, since the target area coincides with the type of area identified
withcolumnC′, that is,the areainthelower tail beyond anegativez.
3. ConvertXtoz. Express66 as azscore:
z=
X−μ
σ
=66-69/3=-1
4. Find the target area. Refer to the standard normal table, using the bottom legend, as the
zscoreisnegative.ThearrowsinTable5.1showhowtoreadthetable.LookupcolumnA’to
1.00 (representing a z score of –1.00), and note the corresponding proportion of .1587
incolumnC’:Thisistheanswer,assuggestedintherightpartofFigure.Itcanbeconcludedthatonly .15
87 (or.16)of allof theFBIapplicants willbe shorter than 66 inches.
FindingProportionsbetweenTwoScores
Assume that, when not interrupted artificially, the gestation periods for human foet uses
approximate a normal curve with a mean of 270 days (9 months) and a standard deviation
of15days. What proportionof gestation periods willbebetween 245 and255days?
1. Sketch a normal curve and shade in the target area, as in the top panel to the shaded
arearepresentsjust thosegestation periods between 245 and 255days.
2. Plan your solution according to the normal table. This type of problem requires more
effortto solve because the value of the target area cannot be read directlyfrom Table A.
Assuggested in the bottom two panels of Figure 5.7, the basic idea is to identify the target
areawith the difference between two overlapping areas whose values can be read from
column C′of Table A. The larger area (less than 255 days) contains two sectors: the target
area (between245 and 255 days) and a remainder (less than 245 days). The smaller area
contains only theremainder (less than 245 days). Subtracting the smaller area (less than 245
days) from thelarger area (less than 255 days), therefore, eliminates the common remainder
(less than 245days),leaving only the target area(between 245and255 days).
3. Convert X to z by expressing 255
asZ=225-270/15=-15/150=-1.00
andby
expressing245asZ=245-
270/15=-25/15=-1.67
4. To the target area. Look up column A′ to a negative z score of –1.00 (remember, you
mustimagine the negative sign), and note the corresponding proportion of .1587 in column C
′.Likewise,lookupcolumnA′toazscoreof–1.67,andnotethecorrespondingproportionof
.0475 in column C′. Subtract the smaller proportion from the larger proportion to obtain
theanswer, .1112. Thus, only .11, or 11 percent, of all gestation periods will be between 245
and255 days.
FindingProportionsbeyondTwoScores
Assume that high school students’ IQ scores approximate a normal distribution with a
meanof 105 and a standard deviation of 15. What proportion of IQs are more than 30 points
eitheraboveorbelow themean?
1. Sketch a normal curve and shade in the two target areas, as in the top
panelofFigure.
2. Plan yoursolution accordingtothenormaltable.Thesolutiontothis
typeproblemisstraight becauseeach ofthetargetareascan bereaddirectly fromtable A
4.5 FINDINGSCORES
Table A must be consulted to find the unknown proportion (of area) associated with
someknown score or pair of known scores. For instance, given a GRE score of 650, we found
thattheunknownproportionofscoreslargerthan650equals.07.Nowwewillconcentrateonthe
opposite type of normal curve problem for which Table A must be consulted to find
theunknown score or scores associated with some known proportion. For instance, given that
aGRE score must be in the upper 25 percent of the distribution (in order for an applicant to
beconsideredfor admission to graduate school),we must find theunknown minimum
GREscore.
FindingOneScore
Exam scores for a large psychology class approximate a normal curve with a mean of
230and a standard deviation of 50. Furthermore, students are graded “on a curve,” with only
theupper 20 percent being awarded grades of A. What is the lowest score on the exam
thatreceivesan A?
1. Sketch a normal curve and, on the correct side of the mean, draw a line
representingthe target score, as in Figure. This is often the most difficult step, and it
involves semanticsratherthan statistics.It’soften helpful tovisualizethe target.
2. Planyoursolutionaccordingtothenormaltable.
The target score is on the right side of the mean, concentrate on the area in the upper half
ofthe normal curve, as described in columns B and C. The right panel of Figure 5.9
indicatesthat either column B or C can be used to locate a z score in column A. It is crucial,
however,to search for the single value (.3000) that is valid for columnB or the single value
(.2000)that is valid for column C. Note that we look in column B for .3000, not for .8000.
Table A isnotdesignedfor sectors,such as the lower.8000, thatspan themean ofthe
normalcurve.
3. Find z.
the entry in column C closest to .2000 is .2005, and the corresponding z score in column
Aequals 0.84. Verify this by checking Table A. Also note that exactly the same z score of
0.84wouldhavebeenidentifiedif columnBhadbeensearchedtofindtheentry(.2995)nearestto
.3000. The z score of 0.84 represents the point that separates the upper 20 percent of the
areafromtherest of theareaunder thenormalcurve.
4. Convert z to the target score. Finally, convert the z score of 0.84 into an exam
score,given a distribution with a mean of 230 and a standard deviation of 50. You’ll recall
that a zscore indicates how many standard deviations the originalscore is above or below its
mean.In the present case, thetarget score must belocated .84 of a standard deviation above
itsmean. The distance of the target score above its mean equals 42 (from .84 . 50), which,
whenadded to the mean of 230, yields a value of 272. Therefore, 272 is the lowest score on
theexamthat receives anA.
FindingTwoScores
Assume that the annual rainfall in the San Francisco area approximates a normal curve with
amean of 22 inches and a standard deviation of 4 inches. What are the rainfalls for the
moreatypical years, defined as the driest 2.5 percent of all years and the wettest 2.5 percent of
allyears?
1. Sketchanormalcurve.Oneithersideofthe mean,drawtwolinesrepresentingthetwo
target scores,
Thesmaller(driest)targetscoresplitsthetotalareainto.0250totheleftand.9750totheright,and
thelarger (wettest) targetscoredoes theexact opposite.
2. Planyoursolutionaccordingtothenormaltable.
The target z score can be found by scanning either column B′ for .4750 or column C′whereXis
the target score, expressed in original units of measurement; μ and σ are the mean and
thestandard deviation, respectively, for the originalnormal curve;andz isthe standard
scorereadfrom column A orA′
4.6 MOREABOUTzSCORES
For instance, if the original distribution is positively skewed, the distribution of z scores
alsowill be positively skewed. Regardless of the shape of the distribution, the shift to z
scoresalways produces a distribution of standard scores with a mean of 0 and a standard
deviationof 1.
zScoresforNon-normalDistributions
Theoriginaldistributionispositivelyskewed,thedistributionofzscoresalsowillbepositively
skewed. Regardless of the shape of the distribution, the shift to z scores
alwaysproducesadistributionof standardscoreswithameanof 0andastandarddeviationof 1.
InterpretingTestScoresI
The evaluation of her test performance is greatly facilitated by converting her raw scores
intothe z scores listed in the final column A glance at the z scores suggests that although she
didrelatively well on the math test, her performance on the English test was only slightly
aboveaverage, as indicated by a z score of 0.50, and her performance on the psychology test
wasslightly below average, as indicated by a z score of –0.67. The use of z scores can help
youidentifyaperson’srelativestrengths andweaknesses.
StandardScore
Whenever any unit-free scores are expressed relative to a known mean and a known
standarddeviation, they are referred to as standard scores. Although z scores qualify as
standard scoresbecause they are unit-free and expressed relative to a known mean of 0 and a
known standarddeviationof 1, other scores also qualify as standardscores.
TransformedStandardScores
Being by far the most important standard score,z scoresare often viewed as synonymouswith
standard scores. For convenience, particularly when reporting test results to a wideaudience,
z scores can be changed to transformed standard scores, other types of unit-
freestandardscores that lacknegativesigns anddecimal points.
Commontransformedstandardscoresassociatedwithnormalcurves.
ConvertingtoTransformedStandardScores
wherez′(calledzprime)isthetransformedstandardscoreandzistheoriginalstandardscore. For
instance, if you wish to convert a z score of –1.50 into a new distribution of z’scores for
which the desired mean equals 500 and the desired standard deviation equals
100,substitutethesenumbers into Formula 5.3to obtain
Z’=500+(-1.50)(100)
=500-150
=350
The change from a z score of −1.50 to a z′ score of 350 eliminates negative signs and
decimalpoints without distorting the relative location of the original score, expressed as a
distancefromthemean in standard deviation units.
Substitute Pairs of Convenient Numbers
The substitution of other arbitrary pairs of numbers serves no purpose; indeed, because
oftheir peculiarity, they mightmake the new distribution,eventhough it lacks the negativesigns
and decimal points common to z scores, slightly less comprehensible to people whohavebeen
exposed to thetraditional pairs ofnumbers.
CORRELATION
Twovariablesarerelatedifpairsofscoresshowanorderlinessthatcanbedepictedgraphicallywith
ascatterplotand numericallywith a correlationcoefficient.
PositiveRelationship
Positiverelationshipsarerelativelylowvaluesarepairedwithrelativelylowvalues,andrelatively
high values are paired with relatively high values, the relationship is
positive.Thisrelationshipimplies “Youget what you give.”
NegativeRelationship
Negativerelationshipsarerelativelylowvaluesarepairedwithrelativelyhighvalues,andrelativelyh
ighvaluesarepairedwithrelativelylowvalues,therelationshipisnegative.“Youget theopposite of
what you give.”
LittleorNoRelationship
Ifany,relationshipexistsbetweenthetwovariablesandthat“Whatyougethasnobearingonwhat you
give.”
4.8 SCATTERPLOTS
A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.With
alittletraining, youcan useanydot clusterasapreviewofafully measuredrelationship.
Twovariablesarepositivelyrelatedifpairsofscorestendtooccupysimilarrelativepositions (high
with high and low with low) in their respective distributions, and they arenegatively related if
pairs of scores tend to occupy dissimilar relative positions (high with lowandviceversa)
intheirrespectivedistributions.
Scatterplotforgreetingcardexchange
Exampleinvolvinggreetingcardshasshownthebasicideaofcorrelationandtheconstructionofascatt
erplot.
The first step is to note the tilt or slope, if any, of a dot cluster. A dot cluster that has a
slopefromthelowerlefttotheupperright,asinpanelAofFigure6.2,reflectsapositiverelationsh
ip. Small values of one variable are paired with small values of the other
variable,andlargevaluesarepairedwithlargevalues.InpanelA,shortpeopletendtobelight,andtall
peopletend to beheavy.
In other hand, a dot cluster that has a slope from the upper left to the lowerright, as in
panel Bof Figure reflects a negative relationship. Small values ofone variable tend to be
paired withlarge values of the other variable, and vice versa.
Finally, a dot cluster that lacks any apparentslope, as in panel C of Figure 6.2,reflects
littleor no relationship. Small values of onevariable are just as likely tobe paired with small,
medium, or large values of the othervariable.
StrongorWeak Relationship?
Having established that a relationship isestablished that a relationship is either positive
ornegative, note how closely the dot cluster approximates a straight line. The more closely
thedotclusterapproximatesastraight line, thestronger (the moreregular)therelationship.
PerfectRelationship
A dot cluster that equals (rather than merely approximates) a straight line reflects a
perfectrelationshipbetweentwovariables.
CurvilinearRelationship
Thepreviousdiscussionassumesthatadotclusterapproximatesastraightlineand,therefore, reflects
a linear relationship. But this is not always the case. Sometimes a dotclusterapproximates
abent orcurvedline,asinFigure6.4, andthereforereflectsa
The scatterplot in Figure for the greeting card data. Although the small number of dots
inFigure hinders anyinterpretation, the dotcluster appearsto approximate a straight
line,stretching from the lower left to the upper right. This suggests a positive relationship
betweengreetingcardssentandreceived,in agreementwiththeearlierintuitiveanalysis
ofthesedata.
Look again at the scatterplot in Figure for the greeting card data. Although the small
numberof dots in Figure 6.1 hinders any interpretation, the dot cluster appears to approximate
astraightline,stretchingfromthelowerlefttotheupperright.Thissuggestsapositiverelationship
between greeting cards sent and received, in agreement with the earlier intuitiveanalysisof
thesedata.
Sign ofr
Anumberwithaplussign(ornosign)indicatesapositiverelationship,andanumberwithaminussign
indicates anegativerelationship.
NumericalValueofr
The more closely a value of r approaches either –1.00 or +1.00, the stronger (more
regular)the relationship. Conversely, the more closely the value of r approaches 0, the weaker
(lessregular) the relationship. For example, an r of –.90 indicates a stronger relationship than
doesanrof–.70,andanr of–.70indicatesastrongerrelationship thandoesanrof.50.(Remember, if
no sign appears, it is understood to be plus.) the value of r is a measure of howwell a straight
line (representing the linear relationship) describes the cluster of dots in thescatterplot.
Interpretationofr
Located along a scale from –1.00 to +1.00, the value of r supplies information about
thedirection of a linear relationship—whether positive or negative—and, generally,
informationabout the relative strength of a linear relationship—whether relatively weak (and
a poordescriber of the data) because r is in the vicinity of 0, or relatively strong (and a
gooddescriberof thedata) becauserdeviatesfrom0 in thedirectionofeither +1.00 or–1.00.
rIsIndependentofUnitsofMeasurement
Apositivevalueofrreflectsatendencyforpairsofscorestooccupysimilarrelativelocations (high
with high and low with low) in their respective distributions, while a negativevalue of r
reflects a tendency for pairs of scores to occupy dissimilar relative locations (highwithlow
and viceversa)in theirrespectivedistributions.
Effectofrangerestrictionon thevalue ofr.
Thevalue of rcan’tbeinterpretedas aproportionor percentageof someperfectrelationship.
6.4DETAILS:COMPUTATIONFORMULAFOR CORRELATIONCOEFFICIENT
Calculateavalueforrbyusing the followingcomputation formula:
wherethetwosumof squarestermsinthedenominatoraredefinedas
OUTLIERS
Outliers were defined as very extreme scores that require special attention because of
theirpotential impact on a summary of data. This is also true when outliers appear among sets
ofpaired scores. Althoughquantitative techniques can beused to detectthese outliers,
wesimplyfocuson dotsin scatterplotsthat deviate visiblyfrom themain dotcluster.
4.10 OTHERTYPESOFCORRELATIONCOEFFICIENTS
There are many other types of correlation coefficients, but we will discuss only several
thatare direct descendants of the Pearson correlation coefficient. Although designed
originally foruse with quantitative data, the Pearsonr has been extended, sometimesunder the
guise ofnewnamesandcustomizedversionsofFormula6.1,tootherkindsofsituations.Forexample,
to describe the correlation between ranks assigned independently by two judges to
asetofscienceprojects,simplysubstitutethenumericalranksintoFormula,thensolvefora
value of the Pearson r (also referred to as Spearman’s rho coefficient for ranked or
ordinaldata).
Computationalformulaforcorrelationcoefficient
Theformulaforthesamplecorrelationcoefficientis:
and
The variances of x and y measure the variability of the x scores and y scores around
theirrespective sample means of X and Y considered separately. The covariance measures
thevariabilityof the(x,y)pairsaroundthemean ofxandmean of y,consideredsimultaneously.
Example:
Tocomputethesamplecorrelationcoefficient,weneedtocomputethevarianceofgestational
age, the variance of birth weight, and also the covariance of gestational age
andbirthweight.
To compute the variance of gestational age, we need to sum the squared deviations
(ordifferences)betweeneachobservedgestationalageandthemeangestationalage.Thecomputatio
nsaresummarized below.
InfantID # GestationalAge(weeks)
Next,wesummarizethebirthweightdata.Themeanbirthweightis:
The variance of birth weight is computed just as we did for gestational age as shown in
thetablebelow.
InfantID# BirthWeight
10 2920 18 324
To compute the covariance of gestational age and birth weight, we need to multiply
thedeviation from the mean gestational age by the deviation from the mean birth weight for
eachparticipant,that is:
The computations are summarized below. Notice that we simply copy the deviations from
themean gestational age and birth weight from the two tables above into the table below
andmultiply.
InfantID#
10 -0.1 18 -1.8
Total= 28,768.4
Finally,wecan owcomputethesamplecorrelationcoefficient:
Notsurprisingly,thesamplecorrelationcoefficientindicatesastrongpositivecorrelation.
4.11 Regression
A predictive modeling technique that evaluates the relation between dependent (i.e. the
targetvariable) and independent variables is known as regression analysis. Regression
analysis canbe usedfor forecasting,time series modeling, or finding therelation between the
variablesand predict continuous values. For example, the
relationshipbetweenhouseholdlocationsandthe powerbill ofthehousehold by adriverisbest
studiedthrough regression.
We can analyze data and perform data modeling using regression analysis. Here, we create
adecision boundary/line according to the data points, such that the differences between
thedistancesof data points from the curveor lineareminimized.
NeedforRegressiontechniques
Theapplicationsofregressionanalysis,advantagesoflinearregression,aswellasthebenefits of
regression analysis and the regression method of forecasting can help a smallbusiness, and
indeed any business, create a better understanding of the variables (or factors)thatcan impact
itssuccess in thecoming weeks, monthsand years intothefuture.
Data are essential figures that define the complete business. Regression analysis helps
toanalyzethedatanumbersandhelpbigfirmsand businesses to make betterdecisions.
Regression forecasting is analyzing the relationships between data points, whichcanhelp you
to peek into the future.
9TypesofRegressionAnalysis
Thetypes ofregression analysis thatwearegoingtostudyhereare:
1. SimpleLinearRegression
2. MultipleLinearRegression
3. PolynomialRegression
4. LogisticRegression
5. RidgeRegression
6. LassoRegression
7. BayesianLinearRegression
There are some algorithms we use to train a regression model to create predictions
withcontinuousvalues.
8. DecisionTreeRegression
9. RandomForestRegression
There are various different types of regression models to create predictions. These
techniquesare mostly driven by three prime attributes: one the number of independent
variables, secondthetypeof dependent variables, and lastlythe shapeof theregression
line.
1) SimpleLinearRegression
Linear regression is the most basic form of regression algorithms in machine learning.
Themodel consists of a single parameter and a dependent variable has a linear relationship.
Whenthe number of independent variables increases, it is called the multiple linear
regressionmodels.
y =mx + c+e
The best-fit decision boundary is determined by varying the values of m and c for
differentcombinations. The difference between the observed values and the predicted value is
called apredictorerror. Thevalues ofm and cget selectedto minimum predictorerror.
2) MultipleLinear Regression
Simple linear regression allows a data scientist or data analyst to make predictions about
onlyone variable by trainingthe model and predicting another variable. In a similar way,
amultipleregression model extends to severalmorethan onevariable.
Simple linear regression uses the following linear function to predict the value of a
targetvariabley, with independent variablex?
Tominimizethesquareerrorweobtaintheparametersb?andb?thatbestfitsthedataafterfittingthe
linear equationto observed data.
3) PolynomialRegression
Inapolynomialregression,thepoweroftheindependentvariableismorethan1.Theequationbelow
represents apolynomial equation:
y =a+bx2
Inthisregressiontechnique,thebestfitlineisnotastraightline.Itisratheracurvethatfitsintothe data
points.
4) LogisticRegression
Logisticregressionisatypeofregressiontechniquewhenthedependentvariableisdiscrete.Example:
0or1,trueorfalse,etc.Thismeansthetargetvariablecanhaveonlytwovalues,
andasigmoidfunctionshowstherelationbetweenthetargetvariableandtheindependentvariable.
ThelogisticfunctionisusedinLogisticRegressiontocreatearelationbetweenthetargetvariableandi
ndependentvariables.Thebelowequation denotesthelogisticregression.
5) RidgeRegression
Ridge Regression is another type of regression in machine learning and is usually used
whenthereisahighcorrelationbetweentheparameters.Thisisbecauseasthe
correlationincreasesthe least square estimates give unbiased values. But if the collinearity is
very high, there canbe some bias value. Therefore, we introduce a bias matrix in the equation
of RidgeRegression. It is a powerful regression method where the model is less susceptible
tooverfitting.
BelowistheequationusedtodenotetheRidgeRegression,λ(lambda)resolvesthemulticollinearityis
sue:
β=(X^{T}X+λ*I)^{-1}X^{T}y
6) LassoRegression
Lasso Regression performs regularization along with feature selection. It avoids the
absolutesize of the regression coefficient. This results in the coefficient value getting nearer
to zero,thisproperty is differentfrom what inridgeregression.
Therefore we use feature selection in Lasso Regression. In the case of Lasso Regression,
onlythe required parameters are used, and the rest is made zero. This helps avoid the
overfitting
inthemodel.Butifindependentvariablesarehighlycollinear,thenLassoregressionchooses
only one variable and makes other variables reduce to zero. Below equation represents
theLassoRegression method:
N^{-1}Σ^{N}_{i=1}f(x_{i},y_{I},α,β)
7) BayesianLinearRegression
Bayesian Regression isused to find out the value of regression coefficients. In Bayesianlinear
regression, the posterior distribution of the features is determined instead of finding theleast-
squares. Bayesian Linear Regression is a combination of Linear Regression and
RidgeRegressionbut is morestablethan simpleLinear Regression.
Now, we will learn some types of regression analysis which can be used to train
regressionmodelsto create predictions with continuous values.
8) DecisionTreeRegression
The decision tree as the name suggests works on the principle of conditions. It is efficient
andhasstrongalgorithmsusedforpredictiveanalysis.Ithasmainlyattributedthatincludeinternalnod
es, branches,and aterminal node.
Every internal node holds a “test” on an attribute, branches hold the conclusion of the test
andevery leaf node means the class label. It is used for both classifications as well as
regressionwhich are both supervised learning algorithms. Decisions trees are extremely
delicate to theinformation they are prepared on — little changes to the preparation set can
bring aboutfundamentallydifferent treestructures.
9) RandomForestRegression
Random forest uses this by permitting every individual tree to randomly sample from
thedatasetwith replacement,bringingabout various trees. Thisis known as bagging.
Regression
A correlation analysis of the exchange of greeting cards by five friends for the mostrecent
holiday season suggests a strong positive relationship between cards sent andcards received.
When informed of these results, another friend, Emma, who enjoysreceiving greeting cards,
asks you to predict how many cards she will receive duringthe next holiday season, assuming
that she plans to send 11 cards.
TWO ROUGH PREDICTIONS
Predict “Relatively Large Number”
You could offer Emma a very rough prediction by recalling that cards sent and received tend
to occupy similar relative locations in their respective distributions. Therefore, Emma can
expect to receive a relatively large number of cards, since she plans to send a relatively large
number of cards.
Predict “between 14 and 18 Cards”
To obtain a slightly more precise prediction for Emma, refer to the scatter plot for the original
five friends shown in Figure 7.1. Notice that Emma’s plan to send 11 cards locates her along
the X axis between the 9 cards sent by Steve and the 13 sent by Doris. Using the dots for
Steve and Doris as guides, construct two strings of arrows, one beginning at 9 and ending at
18 for Steve and the other beginning at 13 and ending at 14 for Doris. [The direction of the
arrows reflects our attempt to predict cards received (Y) from cards sent (X). Although not
required, it is customary to predict from X to Y.] Focusing on the interval along the Y axis
between the two strings of arrows, you could predict that Emma’s return should be between
14 and 18 cards, the numbers received by Doris and Steve.
Figure: A rough prediction for Emma (using dots for Steve and Doris)
Regressionline
A regression line is a line which is used to describe the behavior of a set of data.
PlacementofLine
For the time being, forget about any prediction for Emma and concentrate on how the
fivedots dictate the placement of the regression line. If all five dots had defined a single
straightline, placement of the regression line would have been simple; merely let it pass
through alldots. When the dots fail to define a single straight line, as in the scatterplotfor the
fivefriends, placement of the regression line represents a compromise. It passes through the
maincluster,possibly touching somedots but missingothers.
PredictiveErrors
Figure 4.13.2 illustrates the predictive errors that would have occurred if the
regressionline had been used to predict the number of cards received by the five
friends. Soliddotsreflecttheactual numberof cardsreceived, and opendots,
alwayslocatedalong
Figure4.13.2Predictionof15.20forEmma (usingtheregressionline).
Figure4.13.3Predictiveerrors.
the regression line, reflect the predicted number of cards received. Thelargest predictive
error, shown as a broken vertical line, occurs for Steve, who sent 9 cards.Although he
actually received 18 cards, he should have received slightly fewer than 14 cards,according to
the regression line. The smallest predictiveerror—none whatsoever—occurs forMike, who
sent 7 cards. He actually received the 12 cards that he should have received,accordingto
theregression line.
TotalPredictiveError
Engageintheseeminglysillyactivityofpredictingwhatisknownalreadyforthefivefriends to check
the adequacy of our predictive effort. The smaller the total for all
predictiveerrorsinFigure4.13.3,themorefavorablewillbetheprognosisforourpredictions.Clearly,
it is desirable for the regression line to be placed in a position that minimizes the
totalpredictive error, that is, that minimizes the total of the verticaldiscrepancies between
thesolidand open dots shown in Figure4.13.3.
Progress Check *4.13.1 To check your understanding of the first part of this chapter,
makepredictionsusing thefollowing graph.
To avoid the arithmetic standoff of zero always produced by adding positive and
negativepredictiveerrors(associatedwitherrorsaboveandbelowtheregressionline,respectively),t
heplacementoftheregressionlineminimizesnotthetotalpredictiveerrorbutthetotalsquaredpredicti
veerror,thatis,thetotalforallsquaredpredictiveerrors.Whenlocatedinthisfashion,theregressionlin
eisoftenreferredtoastheleastsquaresregressionline.Althoughmoredifficulttovisualize,thisappro
achisconsistentwiththeoriginalaim—
tominimizethetotalpredictiveerrororsomeversionofthetotalpredictiveerror,therebyprovidinga
morefavorable prognosis forour predictions.
NeedaMathematical Solution
Without the aid of mathematics, the search for a least squares regression line would
befrustrating. Scatterplots would be proving grounds cluttered with tentative regression
lines,discarded because of their excessively large totals for squared discrepancies. Even the
mosttime-consuming, conscientious effort would culminate in only a close approximation to
theleastsquares regression line.
LeastSquaresRegressionEquation
Happily,anequationpinpointstheexactleastsquaresregressionlineforanyscatterplot.Mostge
nerally, this equation reads:
………………>1
whereY´ represents the predicted value (the predicted number of cards that will be receivedby
any newfriend, suchasEmma);Xrepresentsthe known value (theknownnumber ofcards sent by
any new friend); and b and a represent numbers calculated from the
originalcorrelationanalysis, as describednext.*
FindingValuesof banda
Toobtain aworking regression equation,solveeach ofthefollowing expressions,
firstforb andthen fora,using datafromthe originalcorrelationanalysis.Theexpression
forb reads:
…………………………….>2
wherer represents the correlation between X and Y (cards sent and received by the
fivefriends); SSyrepresents the sum of squares for all Y scores (the cards received by the
fivefriends); and SSxrepresents the sum of squares for all X scores (the cards sent by the
fivefriends).
……………………………>3
whereY and X refer to the sample means for all Y and X scores, respectively, and b
isdefinedby the preceding expression.
The values of all terms in the expressions for b and a can be obtained from
theoriginalcorrelationanalysiseither directly,aswith thevalueof r, orindirectly,aswith
thevaluesoftheremainingterms:SSy′SSx
′Y,andX.Table4.14.1illustratesthecomputationalsequencethatproducesaleastsquaresregression
equationforthegreetingcard
example,namely,
Y’.80(X)6.40
where.80and6.40representthevaluescomputedfor banda, respectively.
4.13 Standarderrorofestimatesy|x
Although we predicted that Emma’s investment of 11 cards will yield a return of 15.20
cards,we would be surprised if she actually received 15 cards. It is more likely that because of
theimperfect relationship between cards sent and cards received, Emma’s return will be
somenumber other than 15. Although designed to minimize predictive error, the least
squaresequation does not eliminate it. Therefore, our next task is to estimate the amount of
errorassociated with our predictions. The smaller the estimated error is, the better the
prognosiswill beforour predictions.
FindingtheStandardErrorofEstimate
The estimate of error for new predictions reflects our failure to predict the number of
cardsreceived by the original five friends, as depicted by the discrepancies between solid and
opendots in Figure7.15. Known as the standard error of estimate and symbolized as sy|x,
thisestimateofpredictiveerrorcomplieswiththegeneralformatforanysamplestandarddeviation,
that is, the square root of a sum of squares term divided by its degrees of freedom.
(SeeFormula4.10 on page76.)Theformulafor sy|xreads:
....................>4
where the sum of squares term in the numerator, SSy|x, represents the sum of the
squaresfor predictive errors, Y − Y′, and the degrees of freedom term in the denominator, n −
2,reflects the loss of two degrees of freedom because any straight line, including the
regressionline, can be made to coincide with two data points. The symbol sy|xis read as “s sub
y givenx.”
Although we can estimate the overall predictive error by dealing directly with
predictiveerrors,Y −Y′, itismoreefficient to usethe following computation formula:
.....................>5
4.14 Interpretationofr2:
The squared correlation coefficient,r2, provides us with not only a key interpretation of the
correlation coefficient but also a measure of predictive accuracy that supplements the
standard error of estimate,sy|x.(Remember,we engage in the seemingly silly activity of
predicting that which we already know not as an end-in-itself,but as a way to check the
adequacy of ourpredictive effort.)Paradoxically,
eventhoughourultimategoalistoshowtherelationshipbetweenr2andpredictiveaccuracy,wewillin
itiallyconcentrateontwokindsof predictive errors—those due to therepetitive prediction of
themean andthose
dueto theregression equation.
Forthesakeofthepresentargument,pretendthatweknowtheYscoresbutnotthecorresponding X
scores.Lacking information about the relationship between X and Y scores,circumstances,
statisticians recommend repetitive predictions of the mean, Y, for a variety
ofreasons,including thefact that,although thepredictive errorforanyindividual might
be quite large, the sum of all of the resulting five predictive errors (deviations of Y
scoresabout Y) always equals zero, as you may recall from Section 3.3.] Most important for
ourpurposes, using the repetitive prediction of Y for each of the Y scores of all five friends
willsupply us with a frame of reference against which to evaluate our customary predictive
effortbased on the correlation between cards sent (X) and cards received (Y). Any predictive
effortthat capitalizes on an existing correlation between X and Y should be able to generate
asmallererror variability—and, conversely, more
FIGURE4.16
Violation of homoscedasticity assumption. (Dots lack
equalvariabilityabout all line segments.)
RegressionLine
Y
X
INTERPRETATIONOFr2
accurate predictions of Y—than a primitive effort based only on the repetitive predictionofY.
PredictiveErrors
Panel A of Figure 4.16 shows the predictive errors for all five friends when the mean for
allfive friends, Y, of 12 (shown as the mean line) is always used to predict each of their five
Yscores. Panel B shows the corresponding predictive errors for all five friends when a series
ofdifferent Y′ values, obtained from the least squares equation , is used to predict each of
theirfive Y scores. For example, panel A of Figure 7.5 shows the error for John when the
mean forallfivefriends,Y,of12isusedtopredicthisYscoreof6.Shownasabrokenverticalline,the
error of −6 for John (from Y − Y = 6 − 12 = −6) indicates that Y overestimates John’s Yscore
by 6 cards. Panel B shows a smaller error of −1.20 for John when aY′ value of 7.20 isused to
predict the same Y score of 6. This Y’ value of 7.20 is obtained from the least
squaresequation,
where the number of cards sent by John, 1, has been substituted for X.Positive and
negativeerrors indicate that Y scores are either above or below their corresponding predicted
scores.Overall, as expected, errors are smaller when customized predictions of Y′ from the
leastsquares equation can be used (because X scores are known) than when only the
repetitivepredictionofYcanbeused(becauseXscoresareignored.)Aswithmoststatisticalphenome
na, there are exceptions: The predictive error for Doris is slightly larger when
theleastsquaresequation is used.
ErrorVariability(SumofSquares)
To more precisely evaluate the accuracy of our two predictive efforts, we need some
measureof the collective errors produced by each effort. It probably will not surprise you that
the sumof squares qualifies for this role. The sum of squares of any set of deviations, now
callederrors,can becalculated by first squaring each error (to eliminatenegative signs),
thensumming all squared errors. The error variability for the repetitive prediction of the mean
canbe designated as SSy, since each Y score is expressed as a squared deviation from Y and
thensummed,that is
The error variability for the customized predictions from the least squares equation can
bedesignatedasSSy|x,sinceeachYscoreisexpressedasasquareddeviationfromitscorrespondingY’
and then summed, that is
ProportionofPredictedVariability
If you think about it, SSymeasures the total variability of Y scores that occurs after
onlyprimitive predictionsbasedonYaremade(becauseXscoresare ignored),whileSSy|
xmeasurestheresidualvariabilityofYscoresthatremainsaftercustomizedleastsquare
predictionsaremade(becauseXscoresareused).Theerrorvariabilityof28.8fortheleastsquarespredi
ctionsismuchsmallerthantheerrorvariabilityof80fortherepetitivepredictionofY,confirmingthegr
eateraccuracyoftheleastsquarespredictionsapparentinFigure 4.16 To obtain an SS measure of
the actual gain in accuracy due to the least
squarespredictions,subtracttheresidualvariabilityfromthetotalvariability,thatis,subtractSSy|
xfromSSy, to obtain
Toexpressthisdifference,51.2,asagaininaccuracyrelativetotheoriginalerrorvariabilityforthe
repetitiveprediction ofY, dividetheabovedifferencebySSy, that is,
This result, .64 or 64 percent, represents the proportion or percent gain in predictive
accuracywhen the repetitive prediction of Y is replaced by a series of customized Y′
predictions basedon the least squares equation. In other words, .64 or 64 percent represents
the proportion orpercent of the total variability of SSythat is predictable from its relationship
with the Xvariable. To the delight of statisticians, when squared, the value of the correlation
coefficientequalsthisproportionofpredictablevariability.Recallingthatanrof.80wasobtainedfort
he correlation between cards sentand cards received by the five friends, we canverify thatr2 =
(.80)(.80) = .64, which, of course, also is the proportion of predictable variability.
Giventhisperspective,
The square of the correlation coefficient, r2, always indicates the proportion
oftotal variability in one variable that is predictable from its relationship with
theothervariable.
Expressingtheequation forr2insymbols, wehave:
......................................>4.16
wheretheonenewsumofsquaresterm,SSy′,issimplythevariabilityexplainedbyorpredictable from
theregression equation, that is,
Accordingly, r2 provides us with a straightforward measure of the worth of our least
squarespredictiveeffort.*
r2DoesNotApply toIndividualScores
Do not attempt to apply the variability interpretation of r² to individual scores. For
instance,thefactthat64percentofthevariabilityincardsreceivedbythefivefriends(Y)ispredictable
from their cards sent (X) does not signify, therefore, that 64 percent of the fivefriends′ Y
scores can be predicted perfectly. As can be seen in Panel B of Figure 7.5, only oneof the Y
scores for the five friends, the 12 cards received by Mike, was predicted perfectly(because it
coincides with the regression line for the least squares equation), and even thisperfect
prediction is not guaranteed just because r2 equals .64. To the contrary, the 64 percentmust
be interpreted as applying to the variability for the entire set of Y scores. The totalvariability
of all Y scores—as measured by SSY—can be reduced by 64 percent when each Yscore is
replaced by its corresponding predicted Y’ score and then expressed as a squareddeviation
from the mean of all observed scores. Thus, the 64 percent represents a reduction inthe total
variability for the five Y scores when they are replaced by a succession of
predictedscores,given the least squaresequationand various values of X.
SmallValuesof r2
Whentransposedfromrtor2,Cohen’sguidelines,mentionedonpage114,statethatavalueof r2 in
the vicinity of .01, .09, or .25 reflects a weak, moderate, or strong
relationship,respectively.Donotexpecttoroutinelyencounterlargevaluesofr2inbehavioralanded
ucationalresearch.Intheseareas,wheremeasuresofcomplexphenomena,suchasintellectualaptitud
e,psychopathictendency,orself-
esteem,failtocorrelatehighlywithanysinglevariable,valuesofr2largerthanabout.25aremostunlik
ely.However,evenvaluesofr2closetozeromightmeritourattention.Forinstance,ifjust.04(or4perc
ent)ofthevariabilityofmentalhealthscoresofsixthgradersactuallycouldbepredictedfromasinglev
ariable,suchasdifferencesinweaningage,manyinvestigatorswouldprobablyviewthisasanimporta
nt finding, worthy of additionalinvestigation.
r2Doesn’tEnsureCause-Effect
Thequestionof cause-effect,raised inSection6.3,cannotbe
resolvedmerelybysquaringthecorrelationcoefficienttoobtainavalueofr2.Ifthecorrelationbetwee
nmentalhealthscores
of sixth graders and their weaning ages as infants equals .20, we cannot claim, therefore,
that(.20)(20) = .04 or 4 percent of the total variability in mental health scores is caused by
thedifferences in weaning ages. Instead, it is possible that this correlation reflects some
morebasic factor or factors, such as, for example, a tendency for more economically secure,
lessstressed mothers both to create a family environment that perpetuates good mental health
and,coincidentally,tonursetheirinfantslonger.Certainly,intheabsenceofadditionalevidence,it
would be foolhardy to encourage mothers, regardless of their circumstances, to
postponeweaningbecauseofitsprojectedeffectonmentalhealthscores.Althoughwehaveconsisten
tly referred to r2 as indicating the proportion or percent of predictable variability,you also
might encounter references to r2 as indicating the proportion or percent of
explainedvariability. In this context, “explained” signifies only predictability, not causality.
Thus, youcould assert that .04, or 4 percent, of the variability in mental health scores is
“explained” bydifferences in weaning age, insofar as .04, or 4 percent, is predictable from—
or statisticallyattributableto—differences in weaning age.
4.16 Multipleregressionequations
Anyseriouspredictive effortusually culminatesinamorecomplexequationthatcontains
not just one but several X, or predictor variables. For instance, a serious effort to
predictcollegeGPA might culminate in the following equation:
Y.410(X1) .005(X2) .001(X3)1.03
Where Y′ represents predicted college GPA and X1, X2, and X3 refer to high schoolGPA,IQ
score, and SAT score, respectively. By capitalizing on the combined predictive power
ofseveralpredictorvariables,thesemultipleregressionequationssupplymoreaccurateprediction
s for Y′ (often referred to as the criterion variable) than could be obtained from
asimpleregression equation.
CommonFeatures
Although more difficult to visualize, multiple regression equations possess many features
incommonwiththeirsimplecounterparts.Forinstance,theystillqualifyasleastsquaresequations,
since they minimize the sum of the squared predictive errors. By the same
token,theyareaccompaniedbystandarderrorsofestimatethatroughlymeasuretheaverageamounts
of predictive error. Be assured, therefore, that this chapter will serve as a good
pointofdepartureif, sometime inthefuture,you mustdeal with multiple regressionequations.
4.17 Regressiontowardthemean
Regressiontowardthemeanreferstoatendencyforscores,particularlyextremescores,toshrinkto
wardthemean.Thistendencyoftenappearsamongsubsetsofobservationswhosevaluesareextremea
ndatleastpartlyduetochance.Forexample,becauseofregressiontowardthemean,wewouldexpectt
hatstudentswhomadethetopfivescoresonthefirststatisticsexamwouldnotmake the
topfivescoresonthesecondstatisticsexam.Althoughallfivestudentsmightscoreabovethemeanont
hesecondexam,someoftheirscoreswouldregressbacktowardthemean.Mostlikely,thetopfivescor
esonthefirstexamreflecttwocomponents.Onerelativelypermanentcomponentreflectsthefactthatt
hesestudentsaresuperiorbecauseofgoodstudyhabits,astrongaptitudeforquantitativereasoning,an
dsoforth.Theotherrelativelytransitorycomponentreflectsthefactthat,onthedayoftheexam,atleast
someofthesestudentswereveryluckybecauseallsortsoflittlechancefactors,suchasrestfulsleep,apl
easantcommutetocampus,etc.,workedintheirfavor.Onthesecondtest,eventhoughthescoresofthes
efivestudentscontinuetoreflectanabove-
averagepermanentcomponent,someoftheirscoreswillsufferbecauseoflessgoodluckorevenbadlu
ck.Theneteffectisthatthescoresofatleastsomeoftheoriginalfivetopstudentswilldropbelowthetopf
ivescores—thatis,regressbacktowardthemean—onthesecondexam.
(Whensignificantregressiontowardthemeanoccursafteraspectacularperformanceby,forexample
,a rookie athlete or a first-time author, the term sophomore jinx often is
invoked.)Thereisgoodnewsforthosestudentswhomadethefivelowestscoresonthefirstexam.Alth
oughallfivestudentsmightscorebelowthemeanonthesecondexam,someoftheirscoresprobablywil
lregressuptowardthemean.Onthesecondexam,someofthemwillnotbeasunlucky.Theneteffectist
hatthescoresofatleastsomeoftheoriginalfivelowestscoringstudentswillmoveabovethebottomfiv
escores—thatis,regressuptowardthemean—onthe secondexam.
Appearsin ManyDistributions
Regression toward the mean appears among subsets of extreme observations for awidevariety
of distributions. Incidentally, it is not true that, viewed as a group, all major leaguehitters are
headed toward mediocrity. Hitters among the top 10 in 2014, who were not amongthe top 10
in 2015, were replaced by other mostly above-average hitters, who also were verylucky
during 2015. Observed regression toward the mean occurs for individuals or subsets
ofindividuals,not forentire groups.
TheRegressionFallacy
The regression fallacy is committed whenever regression toward the mean is interpreted as
areal, rather than a chance, effect. A classic example of the regression Regression
Fallacyfallacyoccurred in anIsraeli AirForcestudy of pilot training