Unit 2
Unit 2
Key Characteristics:
Example:
The Bernoulli distribution is often used as the building block for more complex
distributions, such as the Binomial distribution, which models multiple
Bernoulli trials, like scoring above or below a mark across multiple
assessments.
Link Function
A link function is used in Generalized Linear Models (GLMs) to connect the
linear predictor (i.e., the combination of the independent variables) to the
mean of the dependent variable through a specified transformation. It allows
the dependent variable to follow a non-normal distribution (e.g., binomial,
Poisson).
g(μ)=log( μ/μ1−μ)
g(μ)=log(μ)
3. Identity link:
The link is simply the identity function, often used in linear regression
where no transformation is needed.
g(μ)=μ
Transformation Function
Purpose: Converts the data into a form that is more appropriate for
analysis, often making the data distribution closer to normal or removing
skewness.
Common use cases: Data preprocessing, improving model fit, handling
non-linearity in regression models, etc.
Key idea: Applies transformations directly to the data, often before the
model is fitted.
Logarithmic transformation:
Used to stabilize variance when data shows exponential growth or right-
skewness.
Box-Cox transformation:
A family of power transformations that tries to find the best transformation
parameter (λ\lambdaλ) to normalize the data.
Reciprocal transformation:
Often used to reduce the influence of large values in data.
Key Differences:
Link Function: Specific to GLMs and used to relate the linear predictor to
the mean of the outcome variable in a way compatible with its
distribution.
Transformation Function: More general, applied to data to make it more
suitable for analysis by stabilizing variance, normalizing, or linearizing
relationships.
Binary responses refer to outcomes that can take on one of two possible
values, typically denoted as "success" and "failure" (or 1 and 0). This type of
response variable is common in various fields such as medicine, social sciences,
and machine learning, where researchers are often interested in predicting the
probability of an event occurring (e.g., whether a patient will recover from a
disease, whether a customer will purchase a product, etc.).
Bernoulli Distribution
P(X=x)=f(x)
where:
P(X=x) is the probability that the random variable X takes on the value x.
f(x) is the PMF.
g(p)=log(p/1-p)
log(p/p1−p)=β0+β1X1+β2X2+…+βnXn
1. Probit Model:
o An alternative to logistic regression, the probit model uses the
cumulative distribution function (CDF) of the standard normal
distribution to model binary outcomes.
o The relationship is expressed as:
P(Y=1∣X)=Φ(β0+β1X1+…+βnXn)
log(−log(1−p))=β0+β1X1+…+βnXnX
However, LPM can predict probabilities outside the [0, 1] range, leading
to issues in interpretation.
g(μ)=log(μ)
log(μ)=β0+β1X1+β2X2+…+βnXn
Where β0,β1,…,βn are the coefficients to be estimated, and X1,X2,…,Xn
are the independent variables.
1. Nature of Data:
o The Poisson distribution is used to model discrete count data
where the counts are non-negative integers (0, 1, 2, ...).
o It is suitable for data that represent the number of occurrences of
an event within a defined observation period or area.
2. Assumptions:
o Independence: Events occur independently of one another.
o Constant Rate: The average rate (mean number of occurrences) is
constant throughout the observation period.
3. Probability Mass Function (PMF):
P(X=x)=f(x)
where:
P(X=x) is the probability that the random variable X takes on the value x.
f(x) is the PMF.
Mean=λ,Variance=λ
Overdispersion occurs in count data when the observed variance exceeds the
mean. This is a common phenomenon in various fields, including ecology,
epidemiology, and social sciences. In many cases, count data may exhibit
greater variability than what the Poisson distribution assumes, where the
mean and variance are equal.
Causes of Overdispersion
P(X=x)=f(x)
where:
P(X=x) is the probability that the random variable X takes on the value x.
f(x) is the PMF.
Mean=λ,Variance=λ
Applications
Key Concepts
1. Rate Data:
o Rate data refers to the number of occurrences of an event per
unit of time or per unit of population. For example, the number of
accidents per 1,000 vehicles or the number of infections per
100,000 people.
2. Poisson Regression:
o When modeling count data, the Poisson regression model can be
used. However, it assumes that the mean and variance of the
counts are equal, which is not always the case in rate data,
especially if overdispersion is present.
3. Exposure Variable:
o In count regression for rate data, it is essential to include an
exposure variable to account for the amount of time or size of the
population at risk. This allows for the modeling of rates instead of
just counts.
Model Specification
When modeling rate data, the count of events YYY can be modeled using the
following approach:
Yi∼Poisson(μi)
where:
log(μi)=β0+β1X1i+…+βnXni+log(Exposurei)
Here, μi is the expected count of events for observation iii, X1i,…,Xni are
the independent variables, and log(Exposurei) is included to model the
rate correctly. The logarithm of the exposure accounts for the varying
amounts of time or population across observations.
Yi∼Negative Binomial(μi,ϕ)
log(μi)=β0+β1X1i+…+βnXni+log(Exposurei)
Example Application
Epidemiology