0% found this document useful (0 votes)
299 views100 pages

Stat 231 Final Slides

This document provides an outline for the STAT 231 Final. It covers 6 chapters: 1) Data types, graphical and numerical representations of data, and bivariate data 2) Probability distributions and random variables 3) Binomial model, response model, regression model, and maximum likelihood estimation 4) Sampling distributions, confidence intervals, hypothesis testing, and the likelihood function 5) Testing independence with categorical variables and model checking 6) Comparison tests, causality, prediction, and examples of confidence intervals and hypothesis testing using the likelihood function.

Uploaded by

Rachel L
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
299 views100 pages

Stat 231 Final Slides

This document provides an outline for the STAT 231 Final. It covers 6 chapters: 1) Data types, graphical and numerical representations of data, and bivariate data 2) Probability distributions and random variables 3) Binomial model, response model, regression model, and maximum likelihood estimation 4) Sampling distributions, confidence intervals, hypothesis testing, and the likelihood function 5) Testing independence with categorical variables and model checking 6) Comparison tests, causality, prediction, and examples of confidence intervals and hypothesis testing using the likelihood function.

Uploaded by

Rachel L
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

STAT231Final

Outline
Chapter1
Datatypes(discrete,continuous,categorical) Problem(3differentaspects) Populations(target,study,sample) Representationsofdata
Graphical:histograms,CDFs,boxplots Numerical:mean,standarddeviation,IQR

Bivariate Data
Relativerisk Correlationcoefficient

Outline
Chapter2
Reviewofprobabilitydistributions RandomPPDACexamples

Outline
Chapter3
BinomialModel ResponseModel RegressionModel MaximumLikelihoodEstimation

Outline
Chapter4
Samplingdistributionsforestimators Introductiontonewdistributions Gaussian Chisquared t ConfidenceInterval HypothesisTesting ConfidenceIntervalsandHypothesisTestingwiththelikelihood function

Outline
Chapter5
Testingforindependencewithcategoricalvariates Modelcheckingandassessmentforassumptions

Outline
Chapter 6
Comparison 2 sample t-tests Paired t-test Causality Testing for association Blocking Randomization and repetition Matching Prediction Prediction intervals for response Prediction intervals for regression

ConfidenceIntervalsusingthe RelativeLikelihoodFunction
Definethelikelihoodfunction

L( ) = f ( xi )
i =1

Definetherelativelikelihoodfunctionas:

L( ) ) L( )

ConfidenceIntervalsusingthe RelativeLikelihoodFunction
Graphtherelativelikelihoodfunction:

Drawahorizontallineat0.1,theintersectionofthetwo xcoordinatesformsanapproximate95%confidenceinterval

HypothesisTestingusingthe LikelihoodFunction
1)Definethenullhypothesis,definethealternate hypothesis 2)Definetheteststatistic,identifythedistribution, calculatetheobservedvalue 3)Calculatethepvalue Theteststatistic: DistributionofD:

D = 2[l ( ) l ( 0 )]

HypothesisTestingusingthe LikelihoodFunction

ObservedvalueofD: Pvalue: P ( D d )

d = 2[l ( ) l ( 0 )]
D ~ 2 n p

Example

Example
Theobservedvalueoftheteststatistic

d = 2[l ( ) l ( 0 )]

Example
l ( ) = n ln( + 1) + ln xi
i =1 n

Example

Example
d = 2[l ( ) l ( 0 )]
)
l ( ) = n ln( + 1) + ln xi
i =1 n

ModelAssessment
Wevebeenassumingourdatacollectedfits toaspecificmodel(Binomial,Response,etc.) Withthesemodelscomemanyassumptions, includingindependence Inthischapter,weanalyzeourdatato actuallyseeifwereabletousethesemodels tofitourdata

Independencewith BinaryVariates
Wewanttoseeifwecanassumetwobinary variates (representedby2randomvariablesX andY)areindependent Thisisessentiallyanothertypeofhypothesis testing Sinceabinaryvariateisjustacategorical variatewith2categories,thistestcanbe extendedtotwocategoricalvariates

Independencewith BinaryVariates
Define: LetXrepresentthebinaryvariategender(Male=0,Female=1) LetYrepresentthebinaryvariatesmoker(NonSmoker=0, Smoker=1) Letnbethesamplesize Letuscollectourobserveddataandpresentinthefollowing frequencytable:
Male (X=0) Non-Smoker (Y=0) Smoker (Y=1) Total a c a+c Female (X=1) b d b+d Total a+b c+d n=a+b+c+d

Independencewith BinaryVariates
IfXandYareindependentthen: Expectedfrequencyofmalesmokersis
n P ( X = 0) P (Y = 1)

Expectedfrequencyofmalenonsmokersis
n P ( X = 0) P (Y = 0)

Expectedfrequencyoffemalesmokersis
n P ( X = 1) P (Y = 1)

Expectedfrequencyoffemalenonsmokersis
n P ( X = 1) P (Y = 0)

Independencewith BinaryVariates
Usingtheobservedfrequencytable
Non-Smoker (Y=0) Smoker (Y=1) Total Male (X=0) a c a+c Female (X=1) b d b+d Total a+b c+d n=a+b+c+d

P ( X = 0)

P(Y = 0)

P( X = 1)

P (Y = 1)

Independencewith BinaryVariates
Creatingourexpectedfrequencytable
Male (X=0) Non-Smoker (Y=0) Female (X=1) Total a+b

n P( X = 0) P(Y = 0) n P( X = 1) P(Y = 0)

= e1
Smoker (Y=1)

= e2
n P( X = 1) P(Y = 1)
c+d

n P ( X = 0) P (Y = 1)

= e3
Total a+c

= e4
b+d n=a+b+c+d

Independencewith BinaryVariates
Aswithanyotherhypothesistestingquestion, weneedtodefinetheteststatistic. TestStatistic:
(oi ei ) 2 S = ei i =1
n

Distributionoftheteststatistic: S ~ 2 ( r 1)( c 1) Observedvalue:


(oi ei ) 2 s= ei i =1
n

Independencewith BinaryVariates
pvalue

= P( S s)
Makeyourconclusion: Reject: XandYarenotindependent Accept: XandYareindependent

Example

Example

Example

Observedvalue:

(oi ei ) 2 s= ei i =1
n

Example
Pvalue:

ModelAssessment
Fortheregressionmodel,wehavethefollowing assumptionswhenfittingourdata
1)TheexpectationofYisalinearfunctionoftheexplanatory variate 2)ThemodelusedisGaussian 3)Yisareindependent 4)Themodelhasaconstantvariance

ModelAssessment
TheexpectationofYisalinearfunctionofthe explanatoryvariate
ThemodelassumesthatE[Yi]isalinearcombinationofxi IfweplotYi vs.xi weshouldseealinearrelationship

ModelAssessment
ThemodelusedisGaussian
Inthemodel,weassume R ~ G (0, ) andthus Y ~ G ( + x, ) Howdowecheckifthisassumptionisreasonable? Residuals Rearrangingthemodel, R = Y ( + x) ArealizationofRbecomes ri = yi ( + xi ) ) ) ) ) Anestimatedresidualis,ri = yi ( + xi ) = y yi ) ri Graphically,isthedistancefromthelineofbestfittoour observedresponsevariate

ModelAssessment
WecancheckfortheGaussianassumptionsbyplottingaQQ plot Plotthesamplequantiles againstthetheoreticalquantiles of theestimatedresiduals,ifthelineisrelativelystraight,then theGaussianassumptionholds

ModelAssessment
Yisareindependent
Wewillchecktheseassumptionsbyplottingthefitted ) ) ) ) response,againsttheestimatedresiduals, ri yi = + xi Ifourassumptionsaretrue,weshouldseearandompattern centeredaround0

ModelAssessment

ModelAssessment
YishaveConstantVariance
IfYishaveconstantvariance,weshouldseeresidualsevenly distributedaroundzero

Nonconstantvariance:funnelshaped

Comparison
RecallinChapter1welearnedtherewerethree differentaspects(typeofproblem) Descriptive Causative Predictive Chapter6looksattechniquesforsolvingeachof the3problems

Comparison
Thedescriptiveaspectoftheproblemcouldinvolvelooking andcomparingbetweentwodifferentpopulations Inthissection,wewilllearnhowtoconducthypothesistests thatwillallowustomaketheconclusionwhethertheresa differencebetween2populations Thequestionaskedisisthereadifferencebetweenthe meanvaluesofthe2populations? Essentially,thehypothesistestediswhethertheparameter foreachpopulationisequal H 0 : 1 = 2

Comparison
2samplettests(ResponseModel)
Twopopulations

Y1 j = 1 + R1 j

Y2 j = 2 + R2 j

Theestimatorforeachpopulationis
~ 1 =

Y
j =1

n1

1j

n1

~ 2 =

Y
j =1

n2

2j

n2

Thesamplingdistributionforeachestimatoris ~ ~ G( , ) 1 1 n1 ~ ~ G( , ) 2 2 n2

Comparison
Inthehypothesistests,wewanttoseeifthetwoparameters ~ ~ 1 andareequal,soletslookatther.v. 1 2 2 ~ ~ Whatisthesamplingdistributionofunderthe 1 2 assumption1 = 2 ~ ~ G( , ) 1 1 n1 ~ ~ G( , ) 2 2 n2

Comparison
~ ~ 1 2 ~ G (0,
Standardize

1 1 + ) n1 n2

~ ~ 1 2 1 1 + n1 n2

~ G (0,1)

Replace with estimate

~ ~ 1 2 ~ 1+ 1 n1 n2

~ t n1 + n2 2

Comparison
(n1 1) 1 + (n2 1) 2 = (n1 + n2 2)
)
2

T=

~ ~ 1 2 ~ 1 + 1 n1 n2

~ t n1 + n2 2

Example

Example
1 = 71.3 2 = 68.7
) )
1 = 10.2 2 = 11.3
) )
n1 = 47

n2 = 36

Example
(n1 1) 1 + (n2 1) 2 (47 1)10.2 2 + (36 1)11.32 = = = 10.6892 (n1 + n2 2) (47 + 36 2) )
2 2

71.3 68.7 = 1.097 t= 1 1 10.6892 + 47 36

PairedTTests
Inthepriorpages,welookedattwosamplettests Astrongertestiscalledthepairedttest Thistestonlyworksifthetwosampleswecollectareactually dataforthesamegroupofnunits,butatdifferenttimes Thepairedttestinvolvessimplifyingthetwodatasetsinto onebyfindingthedifferenceofeachpairofdata,and workingwiththissingledataset Thenweconductausualttest/hypothesistestonthissingle datasetofdifferences

Causation
Thecausativeaspectofaproblemlooksatthe relationshipbetweentheexplanatoryandresponse variates Recallinchapter1welookedat2typesofconceptsthat looksattherelationshipbetweenXandY
RelativeRisk Association

Associationinvolvescalculatingthecorrelation coefficient n
r== S XY S XX SYY =

(x
i =1

x ) ( yi y )
n 2

( xi x )
i =1

( yi y ) 2
i =1

Causation
Inthiscourse,weonlyhavetheskillstotestfor association H0 : = 0 Thisinvolvestestingthehypothesis intheregressionmodel H0 : = 0 If,thenwecansaythereisno associationbetweenXandY

Example

Example

0 t= ~ SE ( )

Causation
AssociationdoesNOTimplycausation Thecoursenotestalksaboutwhythisisthe caseandhowwecanavoidmakingthewrong assumptionusingthreetechniques
Blocking RepetitionandRandomization Matching

Causation
Confounding Associationdoesnotimplycausation Therecouldbeathirdhiddenvariatethatisrelatedtoboth theexplanatoryandresponseandcausesthiscausal relationship:thisiscalledconfounding Thedifficultywithconfoundingvariates isidentifyingthemin thefirstplace,orelsewewillmakeawrongconclusionabout therelationshipbetweentheexplanatoryandresponse variates Ifwecanidentifytheconfoundingvariates,thenthereare toolswecanusewhendesigningexperimentalplansto accountforthesevariates

Causation
Blocking Ifweveidentifiedtheconfoundingvariate,weneutralizeits effectbycollectingsampleswheretheunitshavethesame valuefortheconfoundingvariate TheChickenExample:
Responsevariate:growthrateofchickens Explanatoryvariate:proteinindiet Confoundingvariate:genderofthechickens Blocking:lookatsamplesofonlymalechickensandsamplesofonly femaleschickens Thiseliminatesthegendereffectandtheexperimenterisabletolook attheeffectsofproteinindietonthegrowthrateofchickens

Causation
ReplicationandRandomization Ifwecannotidentifyorcontroltheconfoundingvariate,wecan alsotrytoneutralizeitseffectsbyrandomlyallocatingour controlledvariateintheexperimentalplan TheMedicineExample:
Responsevariate:survivalrate Explanatoryvariate:typeoftreatment Confoundingvariates:medicalhistory/healthofthepatient Usingrandomizationandreplicationtoassignthetreatmenttype toeach unitwillresultintwoverybalancedgroupsintermsoftheir health/medicalhistory Thiswilleliminatetheconfoundingvariates asmuchaspossible

Causation
MatchingandObservationalPlans Inobservationalplans,theexperimentercannot controlthevariates Themethodofmatchingisusedwheretheunitsthat arebeingobservedarecomparedwithacontrolunit thathasverysimilarcharacteristicstotheunitinthe plan,(thisissimilartoblocking) Thusifthereisadifferenceinthevalueobserved betweenthesampledunitandthecontrolunit,the differencemustbelegitimate

Prediction
Thepredictiveaspectofaprobleminvolves usingourcollecteddatatoestimateavalue foraunittoberandomlyselectedfromthe population Wewilllookatpredictionintervalsfor
Response Regression

Prediction
TheModel

Y =+R

Y ~ G( , )
Thepredictedunit:Y0 Sincefollowstheresponsemodelthen Y0

Y 0~ G ( , )

Prediction
Whatwouldbealogicalchoicetouseasourpredicted value? Theaverage

~ Weneedtheestimatorforthemeanparameter:

~ =

Y
i =1

~ ~ G( , ) n
Sampling Distribution

From MLE

Prediction
Ifwelookatthedifferencebetweenourpredictedvalueandthe populationaverage,thenwehavetherandomvariable

~ Y0
Y 0~ G ( , )
~ ~ G( , ) n

Prediction
~ ~ G (0, 1 + 1 ) Y0 n
Standardizinggives

~ Y0 1 1+ n ~ Y0 ~ 1+ 1 n

~ G (0,1)

Replacewithanestimatorgives

~ t n1

Prediction
Constructinga95%PredictionIntervalforY0 ( Ourultimategoal:a Y0 b
~ t n1 Sincewecanmaketheprobabilitystatement: 1 ~ ~ Y0

unknown)

1+

P(

~ Y0 ~ 1+ 1 n

c) = 0.95

Prediction
~ Y0 P ( c c) = 0.95 ~ 1+ 1 n

Example
LetYbetheresponsevariaterepresentingbodyweight(kg).The followingsampleiscollected: 60 54 72 65 64
Constructa95%predictionintervalforthebodyweightofsomeonewe randomlyselectfromthepopulation.

c 1+

1 n

Example
c 1+
) ) 1 n

Prediction
TheModel

Y = + xi + R
Y = + ( xi x ) + R

Butforourpurposes,wewilluseashiftedversionofthemodel

Prediction
TheModel
Y = + ( xi x ) + R

Thepredictedunit:Y0 Wewanttopredictgiventhesubgroup xi = x0 Y0

Y0 Sincefollowstheregressionmodelthen

Y0 ~ G ( + ( x0 x ), )

Prediction
Whatwouldbealogicalchoicetouseasourpredicted value? xi = x 0 Theaveragegiventhesubgroupwhichwe ~ willdenote ( x0 )
Y = + ( xi x ) + R
Regression Model

~ ~ ~( x ) = E[Y | x ] = + ( x x ) 0 0 0
Average of the subgroup

xi = x 0

Prediction
UsingMaximumLikelihoodEstimationweobtaintheestimators

~ =

Yi
i =1

(Y Y )( x
i =1 i n i =1

x)

( xi x ) 2

S XY = S XX

Thesamplingdistributionsofthesetwoestimatorsare

~ ~ G ( , ) n

~ G( ,

S XX

Prediction
~ ~ ~ Whatisthesamplingdistributionof ( x0 ) = + ( x0 x )
~ ~ G ( ,

~ G( ,

S XX

1 ( x0 x ) 2 ~ ( x0 ) ~ G ( + ( x0 x ), ( + )) n S xx

Prediction
Ifwelookatthedifferencebetweenourpredictedvalueandthe populationaverage,thenwehavetherandomvariable

~ Y0 ( x0 )
Y0 ~ G ( + ( x0 x ), )
1 ( x0 x ) 2 ~ ( x0 ) ~ G ( + ( x0 x ), ( + )) n S xx

Theobviousnextstepwouldbetodeterminethesampling ~ distributionof Y0 ( x0 )

Prediction
Y0 ~ G ( + ( x0 x ), )
1 ( x0 x ) 2 ~ ( x0 ) ~ G ( + ( x0 x ), ( + )) n S xx

Prediction
~( x ) ~ G (0, 1 + 1 + ( x0 x ) ) Y0 0 n S xx
2

Standardizinggives
~ Y0 ( x0 ) 1 ( x0 x ) 1+ + n S xx
2

~ G (0,1)

Estimatingsigmagives
~ Y0 ( x0 ) ~ 1 + 1 + ( x0 x ) n S xx
2

~ t n2

Prediction
Constructinga95%PredictionIntervalforY0 ( Ourultimategoal:a Y0 b
~ Y0 ( x0 )
2

unknown)

1 (x x) ~ Sincewecanmaketheprobability 1+ + 0 n statement: S xx

~ tn2

P(

~ Y0 ( x0 ) ~ 1 + 1 + ( x0 x ) n S xx
2

c) = 0.95

Prediction
P( ~ Y0 ( x0 ) ~ 1+ + 1 n ( x0 x ) S xx
2

c) = 0.95

P ( c

~ Y0 ( x0 ) ~ 1 + 1 + ( x0 x ) n S xx
2

c) = 0.95

1 ( x0 x ) 2 1 ( x0 x ) 2 ~ ~ ~ Y0 ( x0 ) c 1 + + ) = 0.95 P ( c 1 + + n S xx n S xx 1 ( x0 x ) 2 1 ( x0 x ) 2 ~ ~ ~ ~ Y0 ( x0 ) + c 1 + + ) = 0.95 P ( ( x0 ) c 1 + + n S xx n S xx

Prediction

1 ( x0 x ) + ( x0 x ) c 1 + + n S xx ) ) )
Upper and Lower bounds of a regression prediction interval

Example
LetYbetheresponsevariaterepresentingbodyweight(kg)and Xbetheexplanatoryvariaterepresentingbodyheight(cm). Thefollowingsampleiscollected:
i xi yi 1 172 60 2 162 54 3 180 72 4 170 65 5 174 64

Constructa95%predictionintervalforthebodyweightof someonewerandomlyselectfromthepopulationwhose ) = 2.97 heightis175cm.Use

Example
i xi yi 1 172 60 2 162 54 3 180 72 4 170 65 5 174 64

1 ( x0 x ) 2 + ( x0 x ) c 1 + + n S xx ) ) )

Example
1 ( x0 x ) 2 + ( x0 x ) c 1 + + n S xx ) ) )

Outline
Chapter1
Datatypes(discrete,continuous,categorical) Problem(3differentaspects) Populations(target,study,sample) Representationsofdata
Graphical:histograms,CDFs,boxplots Numerical:mean,standarddeviation,IQR

Bivariate Data
Relativerisk Correlationcoefficient

Chapter2
Reviewofprobabilitydistributions RandomPPDACexamples

PPDAC

PPDAC

DrawafrequencyhistogramoftheFlashdata,withbinsgivenby theintervals(45 49.9),(50 54.9),etc. Firstmakeafrequencytablewiththebinwidths


Interval (45 49.9) (50 54.9) (55 59.9) (60 64.9) (65 69.9) (70 74.9) (75 79.9) (80 84.9) (85 89.9) (90 94.9) Frequency 1 1 2 5 5 1 1 1 2 1

PPDAC

ConceptReview
Fromthepreviousexample:
Targetpopulation,studypopulation,sample,unit Responsevs.explanatoryvariates Aspects
Descriptive Causative Predictive

Histograms
BinWidth Frequencyhistogram

Outline
Chapter3
BinomialModel ResponseModel RegressionModel MaximumLikelihoodEstimation

MLE

L( ) = f ( xi ; )
i =1

MLE
l ( ) = n ln ( + 1) ln( xi )
i 1 n

ConceptReview
Fromthepreviousexample:
MaximumLikelihoodEstimationMethod
Definelikelihoodfunction Defineloglikelihoodfunction Differentiatewithrespecttotheparameter Settozero Solvefortheparameter

Outline
Chapter4
Samplingdistributionsforestimators Introductiontonewdistributions Gaussian Chisquared t ConfidenceInterval HypothesisTesting ConfidenceIntervalsandHypothesisTestingwiththelikelihood function

ConfidenceIntervals

ConfidenceInterval

ConceptsReview
Fromthepreviousexample:
ConfidenceIntervalsfortheresponsemodel,sigma unknown Structureofasymmetricconfidenceinterval

HypothesisTesting

HypothesisTesting
Forapairedttest,wecreateanewsetofdata
Diff 1 0.48 9 0.46 2 0.53 10 0.76 3 0.52 11 3.09 4 0.21 12 0.26 5 -0.05 13 0.34 6 0.44 14 0.32 7 0.41 15 -0.07 8 0.68 16 0.33

Diff

HypothesisTesting
Teststatistic:
~ D 0 T= ~ ~ t n1 D n

HypothesisTesting
Pvalue

HypothesisTesting

HypothesisTesting
Fora2samplettest,wehavetwopopulations,with2setsofdata

HypothesisTesting
Teststatistic: T =
~ ~ 1 2 ~ 1 + 1 n1 n2 ~ t n1 + n2 2

HypothesisTesting
)2 ) 2 (n1 1) 1 + (n2 1) 2 (16 1)2.48 2 + (16 1)2.912 ) = = 2.704 = (n1 + n2 2) (16 + 16 2)

Observedvalueoftheteststatistic: ) )
t=

1 2

) 1 1 + n1 n2

HypothesisTesting
Pvalue

ConceptsReview
Fromthepreviousexample:
HypothesisTesting
Definethenullhypothesis Definetheteststatistic,identifythedistribution,calculate theobservedvalueoftheteststatistic Calculatethepvalue

2samplettest Pairedttest

You might also like