0% found this document useful (0 votes)
63 views30 pages

CH 1

Uploaded by

Oz G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views30 pages

CH 1

Uploaded by

Oz G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Chapter One

Introduction
Outline

• Categorical response data • Statistical inference for a proportion

• Probability distribution for • More on statistical inference for

categorical data discrete data

• Wald, likelihood- ratio, and score


• The binomial distribution
inference
• The multinomial distribution
• Small-sample binomial inference
• The Poisson distribution
1. Introduction
• A great advantage of studying categorical data analysis is that many concepts in
statistics are very transparent when discussed in a categorical data context . Any real
data collection procedure may lead only to finitely many different observations
(categories or measured values), not only in practice but also in theory. The various
relationships that are possible between the observed categories are defined in the
theory of levels of measurement. This chapter reviews the probability distributions
most often used for categorical data, such as the binomial distribution. It also
provides an introduction to methods for analyzing categorical data and introduces
maximum likelihood, the most popular method for estimating parameters.
1.1 Categorical Response Data
• A categorical variable has a measurement scale consisting of a set of categories. For
example, political philosophy may be measured as “liberal,” “moderate,” or
“conservative”. Diagnoses regarding breast cancer based on a mammogram use the
categories normal, benign, probably benign, suspicious, and malignant.
• The development of methods for categorical variables was stimulated by the need to
analyze the data generated in research studies in both the social and behavioral
science.
• In the social sciences they are pervasive (all-encompassing) for measuring attitudes
and opinions and in biomedical science they measure the outcomes such as whether
the medical treatment is successful.
Cont’d
• They are by no means restricted to the social and biomedical science. They also occur
frequently in the health sciences, for measuring responses such as whether a patient
survives an operation (yes, no), severity of an injury (none, mild, moderate, severe),
and stage of a disease (initial, advanced) and occur in the behavioral sciences (e.g.,
categories “schizophrenia,” “depression,” “neurosis” for diagnosis of type of mental
illness), zoology (e.g., categories “fish,” “invertebrate,” “reptile” for alligators’ primary
food choice), education (e.g., categories “correct” and “incorrect” for students’
responses to an exam question).
• They even occur in highly quantitative fields such as engineering sciences and
industrial quality control, when items are classified according to whether or not they
conform to certain standards.
Response and Explanatory Variable Distinction
• The response variable is sometimes called the dependent variable or Y variable, and the
explanatory variable is sometimes called the independent variable or X variable.
• Most statistical analyses distinguish between response variables and explanatory
variables. For instance, regression models describe how the distribution of a continuous
response variable, such as annual income, changes according to levels of explanatory
variables, such as number of years of education and number of years of job experience.
• Statistical models for categorical response variables analyze how such responses are
influenced by explanatory variables. For example, a model for political philosophy could
use predictors such as annual income, attained education, religious affiliation, age,
gender, and race. The explanatory variables can be categorical or continuous.
Binary, Nominal/Ordinal Scale Distinction
• Categorical variables which have two categories given generic labels “ success” and
“failure” are called binary variables.
• Categorical variables have two main types of measurement scales. Variables having
categories without natural ordering are said to be measured on nominal scale and
are called nominal variables. Example religious affiliation (categories: - Orthodox,
Catholic, Jewish, Protestant, Muslim & others), primary mode of transportation to
work (automobile, bicycle, bus, subway, walk), favorite type of music (classical,
country, folk, jazz, rock), and favorite place to shop (local mall, local downtown,
Internet, other).
• For nominal variables, the order of listing the categories is irrelevant and the
statistical analysis should not depend on that ordering.
Cont’d……
• Categorical scales which have natural ordered categories are said to be ordinal
variables. Examples are attitude toward legalization of abortion (disapprove in all
cases, approve only in certain cases, approve in all cases), appraisal (Assessment) of a
company’s inventory level (too low, about right, too high), response to a medical
treatment (excellent, good, fair, poor), and frequency of feeling symptoms of anxiety
(never, occasionally, often, always).
• Methods designed for nominal variables give the same results no matter how the
categories are listed. Methods designed for ordinal variables utilize the category
ordering. Whether we list the categories from low to high or from high to low is
irrelevant in terms of substantive(practical) conclusions, but results of ordinal analyses
would change if the categories were reordered in any other way.
Cont’d.....
• Methods designed for ordinal variables cannot be used with nominal variables, since
nominal variables do not have ordered categories. Methods designed for nominal
variables can be used with nominal or ordinal variables, since they only require a
categorical scale. When used with ordinal variables, however, they do not use the
information about that ordering. This can result in serious loss of power. It is usually
best to apply methods appropriate for the actual scale.
• Categorical variables are often referred to as qualitative, to distinguish them from
numerical-valued or quantitative variables such as weight, age, income, and number
of children in a family. However, we will see it is often advantageous to treat ordinal
data in a quantitative manner, for instance by assigning ordered scores to the
categories.
Probability Distributions For Categorical Data
• Inferential statistical analyses require assumptions about the probability
distribution of the response variable. For regression and analysis of variance
(ANOVA) models for continuous data, the normal distribution plays a central role.
This section presents the key distributions for categorical data: the binomial,
multinomial and Poisson distributions.
Binomial Distribution
Categorical data result from n independent and identical trials with two possible
outcomes for each, referred to as “success” and “failure.” Let denote observations from
n independent and identical trials such that and . We refer 1 as “success” and 0 as
“failure”.
Cont’d…….
• Identical trials mean that the probability of success is the same for each trial. Independent
trials means the response outcomes are independent random variables. In particular, the
outcome of one trial does not affect the outcome of another. These are often called Bernoulli
trials. The total number of success, has the binomial distribution with index n and parameter ,
dented by bin(n, )
• The probability mass function for possible outcomes y for Y is
• , y = 0, 1, 2… n
• Where the binomial coefficient , since

Cont’d…….
• The binomial distribution has mean and variance
• , and
• Example, suppose a quiz has 10 multiple-choice questions, with five possible answers
for each. A student who is completely unprepared randomly guesses the answer for
each question. Let Y denote the number of correct responses. The probability of a
correct response is 0.20 for a given question, so and . The probability of correct
responses, and hence n -y = 10 incorrect ones, equals

• The probability of 1 correct response equals


Cont’d……
• The binomial distribution is always symmetric when π = 0.50. For fixed n, it becomes more
skewed as π moves toward 0 or 1. For fixed π, it becomes more bell-shaped as n increases.
When n is large, it can be approximated by a normal distribution with μ=nπ and σ = A
guideline is that the expected number of outcomes of the two types, nπ and n(1 − π ), should
both be at least about 5. For π = 0.50 this requires only n ≥ 10, whereas π = 0.10 (or π = 0.90)
requires n ≥ 50. When π gets nearer to 0 or 1, larger samples are needed before a symmetric,
bell shape occurs.
Multinomial Distribution
• Some trials have more than two possible outcomes. For example, the outcome for a
driver in an auto accident might be recorded using the categories “uninjured,”
“injury not requiring hospitalization,” “injury requiring hospitalization,” “fatality.”
When the trials are independent with the same category probabilities for each trial,
the distribution of counts in the various categories is the multinomial.
• Let if trial i has outcome in category j and otherwise. Then represents multinomial
trial, with ; for instance, (0, 0, 1, 0) denotes outcome in category 3 of four possible
categories. Note that is redundant, being linearly dependent on the others.
Cont’d…..
• Let denote the number of trails having outcome in categories j. The counts have the
multinomial distribution.
• Let () denote the probability of outcome in category j for each trail.
• The multinomial probability mass function is
• P () =
• Since, this is (c-1) dimensional with .the binomial distribution is the special distribution with c
=2.
• For multinomial distribution
• E () =, Var () = () -
Poisson Distribution
• The Poisson distribution is used for counts of events that occur randomly over time or space,
when outcomes in disjoint periods or regions are independent. It also applies as approximation
for the binomial when n is large and is small, with =. For example, suppose Y= number of
death today in auto accident in Italy (rather than number of accidents). Then Y has an upper
bound. If each of 50 million people driving in Italy is an independent trail with probability
0.0000003 of dying today in auto accident the number of death Y is a bin (50000000,
0.0000003) varieties. This is approximately Poisson with = (50000000)(0.0000003) = 15.
The probability mass function for Poisson distribution is

• It satisfies E(Y) = Var(Y) =. It is unimodal with mode equal to the integer part.
• Note: A key feature of Poisson distribution is that its variance equals its mean.
Statistical Inference For A Proportion
• In practice, the probability distribution assumed for the response variable has
unknown parameter values. Using sample data, we estimate the parameters. This
section introduces the methods of sample data to make inference about the binomial
and multinomial parameter, called maximum likelihood method.
Likelihood Function And Maximum Likelihood Estimation
• The parametric approach to statistical modeling assumes a family of probability
distributions, for the response variable. For a particular family, we can substitute the
observed data into the formula for the probability function and then view how that
probability depends on the unknown parameter value. For example, in n = 10 trials,
suppose a binomial count equals y = 0. From the binomial formula with parameter
π, the probability of this outcome equals
Cont’d

• This probability is defined for all the potential values of π between 0 and 1. The probability of
the observed data, expressed as a function of the parameter, is called the likelihood function.
With y = 0 successes in n = 10 trials, the binomial likelihood function is . It is defined for π
between 0 and 1. From the likelihood function, if for instance, the probability that Y = 0 is
l(0.40) = Likewise, if π = 0.20 then ) = = 0.107, and if π = 0.0 then l (0.0) = = 1.0.
• The maximum likelihood estimate of a parameter is the parameter value for which the
probability of the observed data takes its greatest value (It is the parameter value at which the
likelihood function takes its maximum). Thus, when n = 10 trials have y = 0 successes, the
maximum likelihood estimate of π equals 0.0. This means that the result y = 0 in n = 10 trials is
more likely to occur when π = 0.00 than when π equals any other value.
Cont’d
• In general, for the binomial outcome of y successes in n trials, the maximum likelihood
estimate of π equals p = y/n. This is the sample proportion of successes for the n trials. If we
observe y = 6 successes in n = 10 trials, then the maximum likelihood estimate of π equals p =
6/10 = 0.60.
• Denote each success by a 1 and each failure by a 0. Then the sample proportion equals the
sample mean of the results of the individual trials. For instance, for four failures followed by six
successes in 10 trials, the data are 0,0,0,0,1,1,1,1,1,1, and the sample mean is
p = (0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 1 + 1)/10 = 0.60.
• Thus, results that apply to sample means with random sampling, such as the Central Limit
Theorem (large-sample normality of its sampling distribution) and the Law of Large Numbers
(convergence to the population mean as n increases) apply also to sample proportions. The
abbreviation ML symbolizes the term maximum likelihood. The ML estimate is often denoted
by the parameter symbol with (a “hat”) over it.
Cont’d…….
• Before we observe the data, the value of the ML estimate is unknown. The estimate is
then a variant having some sampling distribution. We refer to this variant as an
estimator and its value for observed data as an estimate. Estimators based on the
method of maximum likelihood are popular because they have good large-sample
behavior. Most importantly, it is not possible to find good estimators that are more
precise, in terms of having smaller large-sample standard errors.
Significance Test About A Binomial Proportion
For the binomial distribution, we now use the ML estimator in statistical inference for
the parameter π. The ML estimator is the sample proportion, p. The sampling
distribution of the sample proportion p has mean and standard error
•=
• As the number of trials n increases, the standard error of p decreases toward zero; that is, the
sample proportion tends to be closer to the parameter value π. The sampling distribution of p
is approximately normal for large n. This suggests large-sample inferential methods for π.
• Consider the null hypothesis that the parameter equals some fixed value, . The test statistic,
divides the difference between the sample proportion p and the null hypothesis value by the
null standard error of p. The null standard error is the one that holds under the assumption
that the null hypothesis is true. For large samples, the null sampling distribution of the z test
statistic is the standard normal – the normal distribution having a mean of 0 and standard
deviation of 1. The z test statistic measures the number of standard errors that the sample
proportion falls from the null hypothesized proportion.
Confidence Intervals For A Binomial Proportion
• A significance test merely indicates whether a particular value for a parameter (such as 0.50) is
plausible. We learn more by constructing a confidence interval to determine the range of
plausible values. Let SE denote the estimated standard error of p. A large sample 100(1 − α) %
confidence interval for π has the formula

• where denotes the standard normal percentile having right-tail probability equal to α/2; for
example, for 95% confidence, α = 0.05, = = 1.96.
• This formula substitutes the sample proportion p for the unknown parameter π in
Cont’d
• Example; For the attitudes about abortion example just discussed, p = 0.448 for n = 893
observations. The 95% confidence interval equals

This is,
• We can be 95% confident that the population proportion of Americans in 2002 who favored
legalized abortion for married pregnant women who do not want more children is between
0.415 and 0.481.
• Formula (*) is simple. Unless π is close to 0.50, however, it does not work well unless n is very
large. Consider its actual coverage probability, that is, the probability that the method produces
an interval that captures the true parameter value. This may be quite a bit less than the
nominal value (such as 95%). It is especially poor when π is near 0 or 1.
Cont’d……
• Here is a simple alternative interval that approximates this one, having a similar
midpoint in the 95% case but being a bit wider: Add 2 to the number of successes
and 2 to the number of failures (and thus 4 to n) and then use the ordinary formula
(*) with the estimated standard error. For example, with nine successes in 10 trials,
you find and obtain confidence interval This simple method, sometimes called the
Agresti–Coull confidence interval, works well even for small samples
More On Statistical Inference For Discrete Data

• We have just seen how to construct a confidence interval for a proportion using an
estimated standard error or by inverting results of a significance test using the null
standard error. In fact, there are three ways of using the likelihood function to
conduct inference (confidence intervals and significance tests) about parameters.
They apply to any parameter in a statistical model, but we will illustrate using the
binomial parameter.
Wald, Likelihood-Ratio, And Score Inference
• Let β denote an arbitrary parameter. Consider a significance test of (such
• As , for which = 0).
• The simplest test statistic uses the large-sample normality of the ML estimator .
• Let SE denote the standard error of, evaluated by substituting the ML estimate for the
unknown parameter in the expression for the true standard error. The first large sample
inference method has test statistic using the estimated standard error has approximately a
standard normal distribution. Equivalently, has approximately a chi-squared distribution with
df = 1. This type of statistic, which uses the standard error evaluated at the ML estimate, is
called a Wald statistic. The z or chi-squared test using this test statistic is called a Wald test.
You can refer z to the standard normal table to get one-sided or two-sided P values.
Equivalently, for the two-sided alternative , has a chi-squared distribution with df = 1.
• The P-value is then the right-tail chi-squared probability above the observed value. The two-tail
probability beyond ±z for the standard normal distribution equals the right-tail probability above for
the chi-squared distribution with df = 1. For example, the two-tail standard normal probability of
0.05 that falls below −1.96 and above 1.96 equals the right-tail chi-squared probability above = 3.84
when df = 1.
• The second general purpose method uses the likelihood function through the ratio of two
maximizations of it:
• (i) The maximum over the possible parameter values that assume the null hypothesis,
• (ii) The maximum over the larger set of possible parameter values, permitting the null or the
alternative hypothesis to be true.
• Let denote the maximized value of the likelihood function under the null hypothesis, and let denote
the maximized value more generally. For instance, when there is a single parameter β, is the likelihood
function calculated at, and is the likelihood function calculated at the ML estimate. Then is always at
least as large as, because refers to maximizing over a larger set of possible parameter values.
• The likelihood-ratio test statistic equals

• Where denote the maximized log-likelihood functions.


• The test statistic must be non negative and relatively small values of yield large values of and
strong evidence against. The reason for taking the log transform and doubling is that it yields
an approximate chi-squared sampling distribution. Under, the likelihoodratio test statistic has
a large-sample chi-squared distribution with df = 1. Software can find the maximized likelihood
values and the likelihood-ratio test statistic.
• A third possible test is called the score test referred to in some literature as the Lagrange
multiplier test is based on the slope and expected curvature of the log likelihood functionat the
null value of .
Cont’d
• It utilizes the size of the score function

• The Wald, likelihood-ratio, and score tests are the three major ways of constructing
significance tests for parameters in statistical models. For ordinary regression
models assuming a normal distribution for Y, the three tests provide identical
results. In other cases, for large samples they have similar behavior when H 0 is true.
Small-Sample Binomial Inference
• For inference about a proportion, the large-sample two-sided z score test and the confidence
interval based on that test (using the null hypothesis standard error) perform reasonably well
when When π0 is not near 0.50 the normal P-value approximation is better for the test with a
two-sided alternative than for a one-sided alternative; a probability that is “too small” in one
tail tends to be approximately counter-balanced by a probability that is “too large” in the other
tail.
• For small sample sizes, it is safer to use the binomial distribution directly (rather than a
normal approximation) to calculate P-values. To illustrate, consider testing H 0: π = 0.50 against
Ha: π > 0.50. When the number of successes y = 9 in n = 10 trials. The exact P-value, based on
the right tail of the null binomial distribution with π = 0.50, is =
• For the two sided alternative Ha: 0.50, the P-value is
• P (Y ≥ 9 or Y ≤ 1) = 2 × P (Y ≥ 9) = 0.021

You might also like