0% found this document useful (0 votes)
18 views42 pages

MSD Discrete Count Models 2

The document discusses various types of discrete data and their corresponding distributions, including Bernoulli, Binomial, Hypergeometric, Poisson, Negative-Binomial, and Multinomial distributions. It also covers sampling schemes, maximum likelihood estimation, goodness-of-fit tests, and model diagnostics, emphasizing the importance of understanding overdispersion in discrete data analysis. Key statistical concepts such as likelihood functions, test statistics, and residuals are explained to assess model fit and variability in data.

Uploaded by

BIPUL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views42 pages

MSD Discrete Count Models 2

The document discusses various types of discrete data and their corresponding distributions, including Bernoulli, Binomial, Hypergeometric, Poisson, Negative-Binomial, and Multinomial distributions. It also covers sampling schemes, maximum likelihood estimation, goodness-of-fit tests, and model diagnostics, emphasizing the importance of understanding overdispersion in discrete data analysis. Key statistical concepts such as likelihood functions, test statistics, and residuals are explained to assess model fit and variability in data.

Uploaded by

BIPUL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Discrete and count models

Suryakant Yadav, IIPS, Mumbai


Types of Discrete data
• Nominal (e.g., gender, ethnic background, religious or political
affiliation)
• Ordinal (e.g., extent of agreement, school letter grades)
• Quantitative variables with relatively few values (e.g., number of
times married)
• Did you get the flu? (Yes or No) -- is a binary nominal categorical
variable
• What was the severity of your flu? (Low, Medium, or High) -- is an
ordinal categorical variable
Nominal variables
Discrete Distributions
• Bernoulli distribution
𝜋 𝑓𝑜𝑟 𝑥 = 1
𝑓 𝑥 = 1−𝜋 𝑓𝑜𝑟 𝑥 = 0
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

𝑓 𝑥 = 𝜋 (1 − 𝜋) 𝑓𝑜𝑟 𝑥 = 0, 1

𝐸 𝑋 =1∗ 𝜋 +0 1−𝑥 =𝜋

𝑉 𝑋 =𝐸 𝑥 − 𝐸 𝑋 = 𝜋(1 − 𝜋)
Binomial Distribution
Suppose that X1,X2,…,Xn are independent and identically distributed
(iid) Bernoulli random variables, each having the distribution

𝑓 𝑥 = 𝜋 (1 − 𝜋) 𝑓𝑜𝑟 𝑥 = 0, 1 𝑎𝑛𝑑 0 ≤ 𝜋 ≤ 1
Let 𝑋 = 𝑋 + ⋯ + 𝑋 then 𝑋~𝐵𝑖𝑛(𝑛, 𝜋)
The binomial distribution has PMF
𝑛!
𝑓 𝑥 = 𝜋 (1 − 𝜋)
𝑥! 𝑛 − 𝑥 !
𝑓𝑜𝑟 𝑥 = 0, 1, 2, 3, … 𝑛, 𝑎𝑛𝑑 0 ≤ 𝜋 ≤ 1
𝐸 𝑋 = 𝑛𝜋 𝑉 𝑋 = 𝑛𝜋(1 − 𝜋)
Hypergeometric distribution
• Suppose there's a population of n objects with 𝑛 of type 1
(success) and 𝑛 = 𝑛 − 𝑛 of type 2 (failure), and m (less than n)
objects are sampled without replacement from this population. Then,
the number of successes X among the sample is a hypergeometric
random variable with PMF
𝑓 𝑥 = , 𝑥 ∈ [max 0, 𝑚 − 𝑛 ; min(𝑛 , 𝑚)]
( )
𝐸 𝑋 = and 𝑉𝑎𝑟 𝑋 =
( )
Poisson distribution
• The PMF of a Poisson distribution is given by
𝜆 𝑒
𝑓 𝑥 =𝑃 𝑋=𝑥 = , 𝑥 = 0, 1,2, … . , 𝑎𝑛𝑑 𝜆 > 0
𝑥!
• Poisson is also the limiting case of the binomial.
Suppose that X∼Bin (n,π) and let n→∞ and π→0 in such a way
that nπ→λ where λ is a constant.
Then, in the limit, X∼Poisson(λ).
• That is, if n is large and π is small, then
!
• 𝜋 (1 − 𝜋) ≈ ; where λ=nπ.
! ! !
• Another interesting property of the Poisson distribution is
that E(X)=V(X)=λ.
Negative-Binomial distribution
• The PMF of Negative binomial distribution is :
f(x)= 𝜋 (1 − 𝜋) ; for x = 0,1 … . .

Where Expectation and Variance is given by:


( )
𝐸 𝑋 = = μ and 𝑉 𝑋 = μ + μ
Multinomial distribution

The PMF of a Multinomial distribution is:


!
f (𝑥 , … . , 𝑥 )= 𝜋 𝜋 …… 𝜋 , 𝑥 = (𝑥 , … . , 𝑥 )
! !………….. !

In addition to the mean and variance of 𝑋 , given by


𝐸 𝑋 =n 𝜋 and 𝑉 𝑋 = n 𝜋 (1- 𝜋 )

There is also a covariance between different outcome 𝑋 𝑎𝑛𝑑𝑋 :


Cov(𝑋 , 𝑋 )=-n𝜋 𝜋
Sampling Schemes

The following sampling methods correspond to the distributions


considered:
• Unrestricted sampling (corresponds to Poisson distribution)

• Sampling with fixed total sample size (corresponds to Binomial or


Multinomial distributions)
Poisson Sampling
• Poisson sampling assumes that the random mechanism to generate the
data can be described by a Poisson distribution. It is useful for
modeling counts or events that occur randomly over a fixed period of
time or in a fixed space.
• Let X be the number of goals scored in a professional soccer game. We
may model this as X ∼ Poisson(λ):
•𝑃 𝑋=𝑥 = , 𝑥 = 0, 1,2, … .
!
• The parameter λ represents the expected number of goals in the game
or the long-run average among all possible such games.
The Poisson Model (distribution) Assumptions:
• Independence: Events must be independent (e.g. the number of goals
scored by a team should not make the number of goals scored by
another team more or less likely.)
• Homogeneity: The mean number of goals scored is assumed to be the
same for all teams.
• Time period (or space) must be fixed
• Note: mean and variance of Poisson distribution are the same;
E(X)=Var(X)=λ.
• However, in practice, the observed variance is usually larger than the
theoretical variance and in the case of Poisson, larger than its mean.
Binomial Sampling
• Data are collected on a pre-determined number of units and are then classified
according to two levels of a categorical variable thus a binomial sampling
emerges.
• Binomial distributions are characterized by two parameters:
• n, which is fixed, where n denotes the number of trials or the total sample size
• 𝜋 , which usually denotes a probability of "success".
• Binomial Model (distribution) is based on three assumptions:
• Fixed n: the total number of trials/events, (or total sample size) is fixed.
• Each event has two possible outcomes: referred to as "success" or "failure"
• Independent and Identical Events/Trials:
• Identical trials mean that the probability of success is the same for each trial.
• Independent means that the outcome of one trial does not affect the outcome
of the other.
Multinomial Sampling

• a generalization of Binomial sampling.


• Data are collected on a pre-determined number of individuals or trials and
classified into one of k categorical outcomes.
• Multinomial Model (distribution) assumptions:
• n trials are independent
• parameter vector 𝜋 remains constant from trial to trial.

Note: The assumption is violated when there occurs clustering in the


data.
Maximum Likelihood Estimation

• Maximum Likelihood Estimation (MLE): a statistical method used


to estimate the parameters of a probability distribution by maximizing
the likelihood function.
• Identifies the values of parameters that make the observed data most
probable.
• The likelihood function is essentially the distribution of a random
variable (or joint distribution of all values if a sample of the random
variable is obtained) viewed as a function of the parameter(s).
Bernoulli and Binomial Likelihoods
• Consider a random sample of n Bernoulli random variables, 𝑋 ,…,𝑋 , each with
PMF
• 𝑓 𝑥 = (1 − 𝜋) ; 𝑥 = 0, 1
• The likelihood function is the joint distribution of these sample values, which we
can write by independence

• ℓ(𝜋)= 𝑓(𝑥 ,…, 𝑥 ; 𝜋)= 𝜋 ∑ 𝑖 𝑥 (1 − 𝜋)
• Where ℓ(𝜋) is the probability of observing 𝑋 ,…,𝑋 as a function of 𝜋, and the
maximum likelihood estimate (MLE) of 𝜋 is the value of 𝜋 that maximizes this
probability function.
• Equivalently, L(𝜋)=log ℓ(𝜋) is maximized at the same value and can be used
interchangeably.
The likelihood function for the sample of Bernoulli random variables
depends only on their sum, which we can write as Y= ∑ 𝑖 𝑋 .

Since Y has a binomial distribution with n trials and success probability


𝜋, we can write its log likelihood function as
• L(𝜋)=log ( )𝜋 (1 − 𝜋)
• The only difference between this log likelihood function and that for
the Bernoulli sample is the presence of the binomial coefficient .
• But since that doesn't depend on 𝜋, it has no influence on the MLE
and may be neglected.
Goodness-of-Fit Test
• A goodness-of-fit test, is done to measure how well the observed data
correspond to the fitted (assumed) model.
• Like in linear regression, the goodness-of-fit test compares the
observed values to the expected (fitted or predicted) values.
• A goodness-of-fit statistic tests the following hypothesis:
• 𝐻 : the model 𝑀 fits
• 𝐻 : the model 𝑀 does not fit (or, some other model 𝑀 fits)
• Most often the observed data represent the fit of the saturated
model, the most complex model possible with the given data.
• Example: Consider in a throw of a dice. We want to test the hypothesis
that there is an equal probability of six faces by comparing the
observed frequencies to those expected under the assumed
model: X∼Multi(n=30, 𝜋 ) where 𝜋 =(1/6,1/6,1/6,1/6,1/6,1/6).

• This can be thought of as simultaneously testing that the probability in


each cell is being equal or not to a specified value.

• In this Case, 𝐻 : 𝜋 = 𝜋 ;where the alternative hypothesis is that any


of these elements differ from the null value.
Test Statistics
• Pearson Goodness-of-fit Test Statistic
( )
• The Pearson goodness-of-fit statistic is: 𝑋 = ∑

• Likelihood-ratio Test Statistic: 𝐺 = -2log =-2(𝐿 - 𝐿 )


• Note: 𝑋 and 𝐺 are both functions of the observed data X and a
vector of probabilities π .
Testing the Goodness-of-Fit
• 𝑋 and 𝐺 both measure how closely the model, in this
case Mult(n,π0) "fits" the observed data. And both have an
approximate chi-square distribution with k−1 degrees of freedom
when 𝐻 is true. This allows us to use the chi-square distribution to
find critical values and p-values for establishing statistical
significance.
• If the sample proportions π (i.e., saturated model) are exactly equal to
the model's π for cells j=1,2,…,k, then Oj=Ej for all j, and
both 𝑋 and 𝐺 will be zero. That is, the model fits perfectly.
• If the sample proportions π deviate from the π ’s ,
then both 𝑋 and 𝐺 are both positive. Large values of both 𝑋 and
𝐺 mean that the data do not agree well with the assumed/proposed
model 𝑀 .
Residuals
Pearson Residuals
• Pearson Goodness-of-fit Test Statistic
• The Pearson goodness-of-fit statistic can be written as: 𝑋 = ∑ 𝑟 ,
( )
where 𝑟 = is called the Pearson residual for cell j, and it
compares the observed with the expected counts.
• The sign (positive or negative) indicates whether the observed
frequency in cell j is higher or lower than the value implied under the
null model, and the magnitude indicates the degree of departure.
Deviance Residuals
• Although not as intuitively as the 𝑋 statistic, the deviance statistic
𝐺 = ∑ 𝑑 can be regarded as the sum of squared deviance
residuals.

• |2𝑋 log | *sign (𝑋 - 𝑛𝜋 𝑗)


• Where sign function can take three values:
• -1 if (𝑋 - 𝑛𝜋 𝑗) <0,
• 0 if (𝑋 - 𝑛𝜋 𝑗) =0, or
• 1 if (𝑋 - 𝑛𝜋 𝑗) >0.
Model Diagnostics
The goodness-of-fit statistics tell us how well a particular model fits the
data, they don't tell us much about why a model may fit poorly. To
assess the lack of fit we need to look at regression diagnostics.
The standard linear regression model is given by:
𝑦 ∼N(μ , σ )
μ=𝑥 β
The two crucial features of this model are
1.the assumed mean structure, μ = 𝑥 β, and
2.the assumed constant variance σ (homoscedasticity).
• The most common diagnostic tool is the residuals, the
difference between the estimated and observed values of the
dependent variable.

• The most common way to check these assumptions is to fit the


model and then plot the residuals versus the fitted values ŷ =
𝑥 β.
Overdispersion
• Overdispersion is an important concept in the analysis of discrete data.
Many times, data admit more variability than expected under the
assumed distribution. The extra variability not predicted by the
generalized linear model random component reflects overdispersion.
• Overdispersion occurs because the mean and variance components of
a GLM are related and depend on the same parameter that is being
predicted through the predictor set.
• Overdispersion is not an issue in ordinary linear regression.
• In a linear regression model 𝑦 ∼N(𝑥 β, σ ) the variance σ is
estimated independently of the mean function 𝑥 β. With discrete
response variables, the possibility for overdispersion exists because
the commonly used distributions specify particular relationships
between the variance and the mean.
• In the context of logistic regression, overdispersion occurs when the
discrepancies between the observed responses 𝑦 and their predicted
values ȗ = 𝑛 π are larger than what the binomial model would
predict.
• Overdispersion arises when the 𝑛 Bernoulli trials that are summarized
in a line of the dataset are
• not identically distributed (i.e., the success probabilities vary from
one trial to the next), or
• not independent (i.e., the outcome of one trial influences the
outcomes of other trials).
• In practice, it is impossible to distinguish non-identically distributed
trials from non-independence.
Adjusting for Overdispersion
• Adjusting for overdispersion comes from the theory of quasi-
likelihood.
• Quasilikelihood has come to play a very important role in modern
statistics.
• e.g. Generalized Estimating Equations (GEE) for longitudinal data)
because they do not require the specification of a full parametric
model.
• In the quasilikelihood approach, we must first specify the "mean
function" which determines how μ = 𝐸 Y is related to the covariates.
• In the context of logistic regression, the mean function is:
μ = n exp(𝑥 β)
π
which implies, log = 𝑥 β.
π
Note:
• we must specify the "variance function," which determines the
relationship between the variance of the response variable and its
mean.
• There is no overdispersion for ungrouped data.
• overdispersion is not possible if 𝑛 =1.
• If𝑦 only takes values 0 and 1, then it must be distributed as Bernoulli
(π), and its variance must be π (1− π ).
• There is no other distribution with support {0,1}. Therefore, with
ungrouped data, we should always assume scale=1 and not try to
estimate a scale parameter and adjust for overdispersion.
Receiver Operating Characteristic Curve (ROC)
• A Receiver Operating Characteristic Curve (ROC) is a standard
technique for summarizing classifier performance over a range of trade-
offs between true positive (TP) and false positive (FP) error rates.

• ROC curve is a plot of sensitivity (the ability of the model to predict an


event correctly) versus (1-specificity) for the possible cut-off
classification probability values π .
• For logistic regression we can create a 2×2 classification table of
predicted values from model for the response if 𝑦^ =0 or 1 versus the
true value of 𝑦 =0 or 1.
• The prediction if 𝑦^ =1 depends on some cut-off probability, π .
• For example, 𝑦^ =1 if π^i> π and 𝑦^ =0 if π^i≤ π .

• The most common value for π =0.5. Then sensitivity=P(𝑦^ =1|


𝑦 =1) and specificity=P(y^=0| 𝑦 =0).
• The ROC curve is more informative than the classification table since
it summarizes the predictive power for all possible π .
• The position of the ROC on the graph reflects the accuracy of the
diagnostic test. It covers all possible thresholds (cut-off points).
• The ROC of random guessing lies on the diagonal line.
• The ROC of a perfect diagnostic technique is a point at the upper left
corner of the graph, where the TP proportion is 1.0 and the FP
proportion is 0.
• The Area Under the Curve (AUC), also referred to as index of
accuracy (A), or concordance index, c, and it is an accepted traditional
performance metric for a ROC curve.
• The higher the area under the curve
the better prediction power the
model has. c=0.8 can be interpreted
to mean that a randomly selected
individual from the positive group
has a test value larger than that for
a randomly chosen individual from
the negative group 80 percent of the
time.
• Here Area under the curve
is c=0.746 indicates good
predictive power of the model.
Generalized Linear Model
• The term "general" linear model (GLM) usually refers to conventional
linear regression models for a continuous response variable given
continuous and/or categorical predictors. It includes multiple linear
regression, as well as ANOVA and ANCOVA (with fixed effects only).
The form is yi∼N(xiTβ,σ2), where xi contains known covariates
and β contains the coefficients to be estimated. These models are fit by
least squares and weighted least squares.
GLM
• The term "generalized" linear model (GLIM or GLM) refers to a larger
class of models popularized by McCullagh and Nelder (1982, 2nd
edition 1989). In these models, the response variable yi is assumed to
follow an exponential family distribution with mean μi, which is
assumed to be some (often nonlinear) function of xiTβ. Some would
call these “nonlinear” because μi is often a nonlinear function of the
covariates, but McCullagh and Nelder consider them to be linear
because the covariates affect the distribution of yi only through the
linear combination xiTβ.
GLM
• There are three components to any GLM:
• Random Component - specifies the probability distribution of the response
variable; e.g., normal distribution for Y in the classical regression model, or
binomial distribution for Y in the binary logistic regression model. This is the only
random component in the model; there is not a separate error term.
• Systematic Component - specifies the explanatory variables (x1,x2,…,xk) in the
model, more specifically, their linear combination; e.g., β0+β1x1+β2x2, as we
have seen in a linear regression and the logistic regression.
• Link Function, η or g(μ) - specifies the link between the random and the
systematic components. It indicates how the expected value of the response relates
to the linear combination of explanatory variables; e.g., η=g(E(Yi))=E(Yi) for
classical regression, or η=log( )=logit(π) for logistic regression.
GLM
• Assumptions
• The data Y1,Y2,…,Yn are independently distributed, i.e., cases are independent.
• The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a
distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal, etc.).
• A GLM does NOT assume a linear relationship between the response variable and the explanatory
variables, but it does assume a linear relationship between the transformed expected response in
terms of the link function and the explanatory variables; e.g., for binary logistic
regression logit(π)=β0+β1x.
• Explanatory variables can be nonlinear transformations of some original variables.
• The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many
cases given the model structure.
• Errors need to be independent but NOT normally distributed.
• Parameter estimation uses maximum likelihood estimation (MLE) rather than ordinary least
squares (OLS).

Diagnostic analysis
• predict [type] newvarname [if exp] [in range] [, statistic]
• where statistic is
• xb fitted values; the default
• pr(a,b) Pr(y |a>y>b) (a and b may be numbers
• e(a,b) E(y |a>y>b) or variables; a==. means
• ystar(a,b) E(y*) -inf; b==. means inf)
• cooksd Cook's distance
• leverage | hat leverage (diagonal elements of hat matrix)
• residuals residuals
• rstandard standardized residuals
• rstudent Studentized (jackknifed) residuals
• stdp standard error of the prediction
• stdf standard error of the forecast
• stdr standard error of the residual
• (*) covratio COVRATIO
• (*) dfbeta(varname) DFBETA for varname
• (*) dfits DFITS
• (*) welsch Welsch distance
Diagnostic Test
• Detecting Unusual and Influential Data
• predict — used to create predicted values, residuals, and measures of
influence.
• rvpplot — graphs a residual-versus-predictor plot.
• rvfplot — graphs residual-versus-fitted plot.
• lvr2plot — graphs a leverage-versus-squared-residual plot.
• dfbeta — calculates DFBETAs for all the independent variables in the linear
model.
• avplot — graphs an added-variable plot, a.k.a. partial regression plot.
Diagnostic Test
• Tests for Normality of Residuals
• kdensity — produces kernel density plot with normal distribution overlayed.
• pnorm — graphs a standardized normal probability (P-P) plot.
• qnorm — plots the quantiles of varname against the quantiles of a normal
distribution.
• iqr — resistant normality check and outlier identification.
• swilk — performs the Shapiro-Wilk W test for normality.
Diagnostic Test
• Tests for Heteroscedasticity
• rvfplot — graphs residual-versus-fitted plot.
• hettest — performs Cook and Weisberg test for heteroscedasticity.
• whitetst — computes the White general test for Heteroscedasticity.
• Tests for Multicollinearity
• vif — calculates the variance inflation factor for the independent variables in
the linear model.
• collin — calculates the variance inflation factor and other multicollinearity
diagnostics
Diagnostic Test
• Tests for Non-Linearity
• acprplot — graphs an augmented component-plus-residual plot.
• cprplot — graphs component-plus-residual plot, a.k.a. residual plot.
• Tests for Model Specification
• linktest — performs a link test for model specification.
• ovtest — performs regression specification error test (RESET) for omitted
variables.

You might also like