Empirical IO Notes v4-1 PDF
Empirical IO Notes v4-1 PDF
Empirical IO Notes v4-1 PDF
Øyvind Thomassen∗
November 2, 2019
∗
[email protected], https://sites.google.com/site/oyvindthomassen, Seoul National Univer-
sity, Department of Economics.
1
Contents
1 Structural models 3
1.1 Reasons to use a structural model . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Counterfactuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 More on the merged firm’s pricing . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Practical issues in structural modelling . . . . . . . . . . . . . . . . . . . . . 7
2 Basic econometrics 8
2.1 Causal effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Structural equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Randomized experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Instrumental variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 OLS and IV are both method-of-moments estimators . . . . . . . . . . . . . 11
3 Large-sample theory 12
3.1 Convergence in probability and law of large numbers . . . . . . . . . . . . . 12
3.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Intuition for central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Discrete-choice demand 30
5.1 Utility maximization subject to a budget constraint . . . . . . . . . . . . . . 31
5.2 The outside good and normalizations . . . . . . . . . . . . . . . . . . . . . . 33
5.3 The role of εij and common choices for its distribution . . . . . . . . . . . . 33
5.4 The role of ξj in fitting the model to the data . . . . . . . . . . . . . . . . . 35
5.5 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.6 Finding ξj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.7 Random coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.8 Practical issues in estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.9 The firm’s pricing problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2
6 Berry, Levinsohn and Pakes (1995) [BLP] 51
6.1 Brief overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3 Demand model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Supply model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.6 Discussion of GMM requirements . . . . . . . . . . . . . . . . . . . . . . . . 55
6.7 Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7 Nevo (2001) 58
7.1 Brief overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.5 Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.6 Using implied markups to determine conduct . . . . . . . . . . . . . . . . . . 63
3
1 Structural models
• In the standard framework of econometrics there is a function f that relates explana-
tory variables x, parameters θ and unobservables e to a dependent variable y by a
claim that at the true parameter value θ0 ,
y = f (x, θ0 , e).
• Consider the challenge of estimating the demand for J differentiated products (where
each consumer chooses only one of the alternatives).
• A structural model may assume that a consumer i chooses the alternative j that
maximizes an indirect utility function (conditional on j)
• That is (when x and ei gather (xj , pj ) and eij , respectively, for all j),
• If dij = 1[consumer i is observed to choose j], we can match the predictions of the
model (1.2) to dij to estimate the parameters α and β.
4
the price of k):
J
X
ln qj = αjk ln pk , j = 1, . . . , J. (1.3)
k=1
• Most empirical research in economics attempts to answer a question of the type ‘how
would y change if x changed?’.
y = test scores
x = number of students per teacher,
• There might also be naturally occurring exogenous variation in x that could be used
to estimate this effect even without randomized experiments.
• One could imagine using results from other mergers, either in the same or in other
industries.
• But usually there are so many things that vary across settings, that it is unclear
whether effects found in one setting really imply anything at all about another setting.
• By contrast, the class size example can use observations for a large number of different
classes that can reasonably be assumed independent.
5
• The structural approach to the merger effect problem is to dig into the economic
structure that underlies the effect of y on x: individual demand functions and firms’
profit-maximizing behaviour.
• By using observations of individual beer demand, we can find out how market demand
responds to price changes, i.e. the derivatives of the demand of each beer brand j with
∂Q ∂Q
respect to its own price, ∂pjj , as well as the prices of competing products, ∂pkj .
• These derivatives (or elasticities) could be estimated with a model like (1.3), but it is
likely that when the market structure, and therefore prices, change, the price elasticities
also change.
• On the other hand, it seems more plausible that consumer preferences remain stable,
and therefore that a demand model like (1.2) will be applicable also when the ownership
structure changes.
1.2 Counterfactuals
• One of the great advantages of using a structural models is the ability to answer
counterfactual questions like the effect or merger or a tax reform that has not taken
place.
• Once a fully specified structural model has been estimated, provided we believe the
assumptions (such as functional forms etc.), we know everything we need to know
about the market, much like in a numerical example in a microeconomics textbook,
and we can find new equilibrium prices.
• Often solving for new equilibrium prices after a hypothetical change in the economic
structure (merger, tax) is called simply a counterfactual.
• This can be illustrated by looking at the merger example in some more detail.
• Suppose for simplicity that there are two products j = 1, 2 and that initially the
products are owned by two separate single-product firms, named 1 and 2 after their
products.
• For each j, the first-order condition for maximizing profit (pj − mcj )Qj (pj , pk ) (as-
suming constant marginal cost, mcj and that prices are a Nash equilibrium) gives an
expression for the markup, the so-called Lerner index holds for j = 1, 2,
∂Qj
(pj − mcj ) + Qj = 0 (1.4)
∂pj
pj − mcj 1
= ∂Q p
, (1.5)
pj − ∂pjj Qjj
6
which says that the more responsive demand is to price, the smaller the markup will
be.
∂Qj
• After estimating a demand function like (1.2) we know ∂pj
for both j, and pj and Qj
are observed.
• We can then solve (1.5) for mcj for j = 1, 2.
• We want to know how the markup changes if the two firms merge.
• The merged firm solves the problem
∂Qj ∂Qk
(pj − mcj ) + Qj + (pk − mck ) = 0. (1.6)
∂pj ∂pj
• Since mcj for j = 1, 2 are observed, this is a system of two—nonlinear, since prices
enter Q—equations in two unknowns, which can usually be solved numerically.
• The solutions, the new equilibrium prices, then answer the question of how prices will
react to the merger.
7
– The margin pk − mck relative to the price of pj ; the more profitable each unit of
k is, the more worthwhile it is to raise pj , since consumers who switch to k will
earn the merged firm a higher unit profit.
• As opposed to linear models like (1.3), structural models usually requires the solution to
some kind of optimization problem just to calculate the value of the model’s prediction
(f ) at each value of the parameters, like the max operator in (1.2).
• The second issue is that because the model (f ) is a nonlinear function of the parameters,
the estimates cannot be calculated simply using matrix operations on the data.
1. Coming up with a model f that makes economic sense and can be estimated in
practice. Writing a computer program to calculate the value of f for each value
of the parameters.
2. Coming up with sensible restrictions (moment conditions or error distributions)
to form an econometric objective function (GMM or likelihood function) and an
expression for standard errors.
3. Using an algorithm for numerical minimization to minimize the objective function.
4. Using the estimates to answer economic question of interest, e.g. do a counter-
factual.
8
2 Basic econometrics
• For any two random variables y and x, common ways of predicting y based on x:
Cov(x, y)
β1 = , β0 = E(y) − β1 E(x). (2.1)
Var(x)
or, equivalently
• LP(y|x) minimizes (2.3) in the class of all functions h that are linear in x.
• Consider a population of new car models, where y is units sold and x is price.
• Suppose we observe that sales are higher when price is higher: Cov(x, y) > 0.
• Something else is probably going on: cars with higher price have more space, more
powerful engine, leather seats, etc.
9
2.2 Structural equation
• As the term is used at the beginning of this document, a structural model is always
a structural equation, while a structural equation does not need to be a structural
model, i.e it does not have to explicitly model the optimizing behaviour of economic
agents.
y = δ0 + δ1 x + ω. (2.4)
• The error term ω contains all other factors (than x) that affect y (space, engine power
etc.).
• The linear projection can be written on the same linear form as the structural equation:
y = β0 + β1 x + u, (2.5)
where u = y − LP(x).
• But it is only if Cov(x, ω) = 0 and E(ω) = 0 that δ1 = β1 and δ0 = β0 , and that OLS
can be used to obtain the causal effect.
• β1 does not take into account the fact that some of the change in y as x increases
comes from ω, not from x itself.
• To see this:1
Cov(x, ω)
β 1 = δ1 + . (2.6)
Var(x)
Cov(x, y) = Cov(x, δ0 + δ1 x + ω)
= δ1 Var(x) + Cov(x, ω)
Cov(x, y) Cov(x, ω)
= δ1 + .
Var(x) Var(x)
10
• Note that the structural equation is not simply a feature of the joint distribution of y
and x.
• The joint distribution of y and x does not imply anything about the structural model:
• Only with additional assumptions can we draw conclusions about the structural equa-
tion from the joint distribution of y and x.
• The most standard such assumption is that Cov(x, ω) = 0, also called exogeneity of x.
• Use a computer program to generate an arbitrary price for each car model.
• We have now created a new population (or sample) where Cov(x, ω) = 0 by design.
• Next we offer the cars for sale at these prices, and obtain a value of y corresponding
to each value of x.
• Now the linear projection corresponds to the structural equation, since x is exogenous,
and OLS works.
• Suppose the price faced by consumers includes a sales tax, and that the tax is somewhat
arbitrary – not correlated with ω.
11
• Also a natural experiment: an event that creates exogenous variation in the endogenous
variable x.
• We can now use (2.7) and (2.4) to find the causal effect δ1 :
Cov(z, y) = Cov(z, δ0 + δ1 x + ω)
= δ1 Cov(z, x) + Cov(z, ω)
= δ1 Cov(z, x)
Cov(z, y)
δ1 = ,
Cov(z, x)
• With a random sample (yi , xi , zi ) we can use the sample versions of the covariances in
the last line to get a good estimator of δ1 (called the IV estimator ).
• The requirements for the linear projection can be rewritten as E(u) = 0 and E(xu) =
Cov(x, u) + E(x)E(u) = Cov(x, u) = 0.
• The OLS estimator (β̂0 , β̂1 ) and the IV estimator (γ̂0 , γ̂1 ) both solve the sample versions
of the respective exogeneity conditions:
n
X n
X
[xi (yi − β̂0 − β̂1 xi )] = 0, (yi − β̂0 − β̂1 xi ) = 0
i=1 i=1
X n Xn
[zi (yi − γ̂0 − γ̂1 xi )] = 0, (yi − γ̂0 − γ̂1 xi ) = 0.
i=1 i=1
12
3 Large-sample theory
• Let {xi } = x1 , x2 , . . . be a sequence of L × 1 random vectors that are independent and
identically distributed (i.i.d.) with mean E(xi ) = µ and (L × L) covariance matrix
Var(xi ) = Σ.
• Then the sample mean x̄n = n1 ni=1 xi has mean and variance
P
1
E(x̄n ) = nE(xi ) = µ
n
1
Var(x̄n ) = nVar(xi ) = Σ/n.
n2
• Concretely, the probability that the distance between x̄n and µ be greater than any
small number tends to zero as n goes to infinity. This is called convergence in proba-
bility, and written as
p
x̄n −→ µ.
• The fact that the sample mean across an i.i.d. sample converges in probability to the
population mean is called the Law of Large Numbers.
• The (Lindeberg-Lévy) Central Limit Theorem says that it converges to a random vector
with a normal distribution as n gets large:
√ d
n(x̄n − µ) −→ N (0, Σ).
13
3.3 Intuition for central limit theorem
• To get a sense of why (without any formal proof), consider a scalar (L = 1) example,
where xi is the number shown on a die when thrown the i-th time.
• Clearly xi has a (discrete) uniform distribution where the probability of each outcome
is 16 .
• How does x̄n for such uniformly distributed random variables get a bell-shaped distri-
bution as n increases?
• On the other hand, a point in the middle of the support, 27 , can come about in six
different ways, as {1, 6}, {2, 5}, {3, 4}, {4, 3}, {5, 2} or {6, 1}.
6
• Its probability is therefore 62
= 61 .
• Continuing to an arbitrary n, the extreme outcomes 1 and 6 have the low probability
1
6n
, while the probability of outcomes near 3.5 is much higher.
• As n increases, x̄n takes on values only between 1 and 6, and gets increasingly concen-
trated at 3.5 as the variance approaches zero.
√
• By contrast, n[x̄n − E(xi )] has a support that approaches (−∞, ∞) as n increases.
• Figure
√ 1 shows histograms representing the distributions of x̄n (left column) and
n[x̄n − E(xi )] (right column) for rolling a die n times, with n = 1, 2, 3, 10, 100, 10000.
• To see what the distribution of x̄n is for a given value of n, we must generate a large
number, M , of samples, each of size n and calculate x̄n for each of the M samples.
• The left column serves to illustrate the Law of Large Numbers, as x̄n concentrates
around 3.5 as n increases.
14
√
• The right column illustrates the Central Limit Theorem, as the distribution of n[x̄n −
E(xi )] approaches the bell shape (with centre zero), characteristic of the normal dis-
tribution.
1
Pn √
Figure 1: Distribution of 20,000 draws of x̄n = n i=1 xi (left column) and n[x̄n − E(xi )] (right column),
for n = 1, 2, 3, 10, 100, 10000, where xi has a discrete uniform distribution with support {1, . . . , 6}.
15
4 Generalized method of moments
• Let w1 , . . . , wn be a sample from a sequence of data vectors {wi }, and g(wi , θ) an L × 1
vector that depends on wi and a parameter vector θ.
• Suppose
A = G0 W G, B = G0 W V W G, V = E[g(wi , θ0 )g(wi , θ0 )0 ].
16
and n n
1X ∂ 1X
Ĝ = g(wi , θ̂), V̂ = g(wi , θ̂)g(wi , θ̂)0 .
n i=1 ∂θ n i=1
\θ̂).
• The standard errors s.e.(θ̂) are the square roots of the diagonal of Avar(
• We can form 1 − α confidence intervals for each element θk of θ as
where zα/2 is defined by Prob[Z > zα/2 ] = α/2 for Z ∼ N (0, 1).
• Most commonly, letting α = 0.05, zα/2 = z0.025 = 1.96. Then the confidence interval
does not contain zero if |θ̂k |/s.e.(θ̂k ) > 1.96, i.e. provided that the magnitude of the
parameter estimate is about twice as large as the standard error.
• The optimal weighting matrix (that minimizes the variance of the estimator) is W =
V̂ −1 , where V̂ is a consistent estimator of the covariance matrix V .
• Let θ̃ be a consistent estimator of θ and let V̂ = n1 ni=1 g(wi , θ̃)g(wi , θ̃)0 .
P
• Then, using W = V̂ −1 as weighting matrix means that B̂ = Â, and we get a simpler
expression for the asymptotic variance:
• It can be useful in practice to note that all the divisions by n cancel in this expression,
so we get:
n i0 h Xn i−1 h Xn io−1
nh X ∂ 0 ∂
g(wi , θ̂) g(wi , θ̂)g(wi , θ̂) g(wi , θ̂) .
i=1
∂θ i=1 i=1
∂θ
17
• The identification requirement 3. is now
or
E[yi − x0i θ0 ] = 0 and E[xik (yi − x0i θ0 )] = 0 for k = 2, . . . , K. (4.2)
∂
• Since ∂θ g(wi , θ0 ) = −xi x0i , the rank requirement 4. is that the variance matrix E[xi x0i ]
of xi be of rank K.
where W is a positive definite K × K matrix, which means that the value of (4.4) is
positive unless X 0 (y − Xθ) = 0, in which case its value is zero.
• Then the minimization problem in (4.1) is solved if there is a θ̂ that solves the equations
X 0 X θ̂ = X 0 y . (4.5)
(K×K) (K×1)
18
• Loosely speaking, since
n x2i1 . . . xi1 xiK
1 X .. ... ..
X 0 X/n = . .
n i=1 2
xiK xi1 . . . xiK
converges to E[xi x0i ] as n gets large, X 0 X has rank K with probability one for large
samples.
• In any case, as long as X 0 X has full rank in the sample, it is nonsingular and (4.5),
called the normal equation, has a unique solution,
θ̂ = (X 0 X)−1 X 0 y. (4.6)
• Since W does not enter (4.6), we do not really need to choose a weighting matrix W in
order to calculate the estimates from the sample (all positive definite K × K matrices
W will result in the same estimator).
and n n
1X 1X
V̂ = [xi (yi − x0i θ̂)][xi (yi − x0i θ̂)]0 = (yi − xi θ̂)2 xi x0i .
n i=1 n i=1
• Although the weighting matrix plays no role in deriving the OLS estimator, the choice
W = (X 0 X/n)−1 = −Ĝ−1 results in the usual (heteroskedasticity-robust) standard
errors.
19
• Under the common textbook assumption of homoskedasticity, we replace each squared
residual (yi − xi θ̂)2 with the sample average n1 ni=1 (yi − xi θ̂)2 , so that
P
h n n
\θ̂) = (X 0 X)−1 1 X(y − x θ̂)2 X x x0 (X 0 X)−1
i
Avar( i i i i
n i=1 i=1
n
−1 1
h X i
0
= (X X) (yi − xi θ̂)2 X 0 X(X 0 X)−1
n i=1
n
−1 1
h X i
0 2
= (X X) (yi − xi θ̂) ,
n i=1
or
E[zik (yi − x0i θ0 )] = 0 for k = 1, . . . , K. (4.7)
∂
• Since ∂θ g(wi , θ0 ) = −zi x0i , the rank requirement 4. is that the covariance matrix E(zi x0i )
be of rank K.
• As in the OLS case, the identification condition is a system of linear equations with θ0
as the vector of unknowns:
E[zi x0i ]θ0 = E[zi yi ]. (4.8)
• If L = K, the K × K matrix E[zi x0i ] is nonsingular, since it has rank K, so (4.8) has a
unique solution E[zi x0i ]−1 E[xi yi ].
• If L > K, as long as the rank of E[zi x0i ] is K, the system is equivalent to one with K
independent linear equations, so there cannot be multiple solutions to (4.8).
• The sample analogue of (4.8) is
20
• Because of sample variance, it likely that these L equations are all at least slightly
different, in the sense that the L × (K + 1) augmented matrix [Z 0 X|Z 0 Y ] of the system
will have full rank.
• Then if L > K, there is no K × 1 vector θ that exactly satisfies all the L equations.
• We can still try to approximately satisfy each equation, but we need to make a choice
as to how to trade off errors in different equations.
• This choice is determined by the weighting matrix, which for the 2SLS estimator is
W = (Z 0 Z/n)−1 where Z stacks the zi0 vectors vertically, in the same way as X:
z10
Z = ... . (4.10)
(n×L)
zn0
• This special case of 2SLS is called the instrumental variables (IV) estimator, and is
θ̂ = (Z 0 X)−1 Z 0 y.
• Otherwise, we proceed by solving the first-order conditions for minimization for the
estimator θ̂:
∂
0 = [(y − X θ̂)0 Z]W [Z 0 (y − X θ̂)]
∂θ
= −2X 0 ZW [Z 0 (y − X θ̂)]
X 0 ZW Z 0 X θ̂ = X 0 ZW Z 0 y (4.12)
θ̂ = (X 0 ZW Z 0 X)−1 X 0 ZW Z 0 y. (4.13)
21
and
n
1X
V̂ = [zi (yi − x0i θ̂)][zi (yi − x0i θ̂)]0
n i=1
n
1X
= (yi − x0i θ̂)2 zi zi0 .
n i=1
• It is easy to see from these results that OLS is a special case of 2SLS where zi = xi .
PT 0
PT 0
• Since E t=1 zit (y it − x it θ) = t=1 E[zit (yit − xit θ)], a more easily interpretable
assumption, that implies (4.14), is
22
be of rank K.
• We create the following data matrices for a sample of size n. They look like the data
matrices in (4.3) and (4.10), but for each observation i there is now T rows instead of
only one:
z10 x01 y1
.. .. ..
Z = . , X = . , y = . . (4.15)
(nT ×L) (nT ×K) (nT ×1)
0 0
zn xn yn
– the same expression as in (4.11), although the data matrices are now the panel data
versions defined in (4.15). The estimator is derived in the same way as (4.13), and
therefore yields the same expression:
θ̂ = (X 0 ZW Z 0 X)−1 X 0 ZW Z 0 y.
and
n T T i0
1 XhX 0
ih X
0
V̂ = zit (yit − xit θ̂) zit (yit − xit θ̂) , (4.18)
n i=1 t=1 t=1
which can also be written as n1 ni=1 [zi (yi − x0i θ̂)][zi (yi − x0i θ̂)]0 .
P
• Inspection of (4.16) reveals that the GMM estimator defined here is exactly the same
as if we ignored the panel structure of the data and simply formed a 2SLS estimator,
but where the sample size is nT instead of n.
• The panel structure of the data only shows up in the expression for V̂ in (4.18): the
sample average is here P
over i only, while the covariance matrix for each observation i
is between the objects Tt=1 zit (yit − x0it θ) that are sums over T .
23
• Nothing in this section, other than the simplification of the notation, depends on the
assumption that T is the same for each i; it would be possible to let the number of
time periods depend on i (this is called an unbalanced panel).
• No particular use has been made of the fact that panel data normally means that t
represents time periods.
• The estimator as developed so far would work equally well for a clustered sample, for
instance where i represents school classes and t individual students in a class
• The standard errors that will result from the V̂ given in (4.18) correspond to the
clustered standard errors recommended for this case (see Wooldridge (2010), (20.25),
p. 865 for the OLS case).
• The underlying principle in both the panel and clustered sample cases is that while
we are happy to assume that observations are i.i.d. across i, we believe there may be
some form of dependence between t for a given i.
• Both the consistency of the estimator and the variance calculation in (4.18) depend
only on the i-observations being i.i.d., and allows for any form of dependence in the t
dimension.
• We could use W = (Z 0 Z/n)−1 to get a panel 2SLS estimator, in which case the standard
errors would be as for 2SLS (but with the data matrices defined in (4.15)).
\θ̂)
Avar(
n hX T
n X T
X 0 i−1 o−1
= X 0Z zit (yit − x0it θ̂) zit (yit − x0it θ̂) Z 0X .
i=1 t=1 t=1
• Let wi = {{zi1 , . . . , ziM }, {xi1 , . . . , xiM }, (yi1 , . . . , yiM )0 }, where zim is Lm × 1, xim is
Km × 1, with Lm ≥ Km , and yim is a scalar.
• For a given i, different m are different variables rather than the same variables at
different times.
• For instance, with M = 2, yi1 and yi2 could be quantity supplied and demanded of
product i, respectively.
24
• Each m has its own set of moment conditions, so that the vector of moments is:
zi1 (yi1 − x0i1 θ1 )
g(wi , θ) = ..
.
.
0
ziM (yiM − xiM θM )
Define
zi1 yi1
zi2 yi2
P zi = , yi = .. , (4.19)
... .
( m Lm ×M ) (M ×1)
ziM yiM
where any blank entry of zi is zero.
• We then stack the data in the same way as for the panel data case:
z10 y1
.. ..
Z
P = . , y = . . (4.20)
(nM × m Lm ) (nM ×1)
0
zN yN
• Depending on the relationship between the different θm we may wish to organize the
xi in different ways.
• If θm = θ for all m, it follows that Km = K for all m. Then define the K × M matrix
xi = (xi1 , xi2 , . . . , xiM ).
• If each m has its own θm , with no overlap, define
xi1
xi2
x = . (4.21)
i ..
P
( m Km ×M ) .
xiM
• In intermediate cases where different m have some elements of the coefficient vector θ
in common, but not all, do something like (4.21), but with those elements of different
xim that share a coefficient moved so that they are placed in the same row in xi .
• In all three cases we stack the observations i vertically to get
x01
..
X = . , (4.22)
(nM ×K)
x0N
25
where K is the length of the θ vector.
θ̂ = (X 0 ZW Z 0 X)−1 X 0 ZW Z 0 y.
• The covariance matrix of the moments is also as before (given the new definitions of
the data matrices)
n i0
1 Xh ih
V̂ = zi (yi − x0i θ̂) zi (yi − x0i θ̂) , (4.23)
n i=1
and finally
h n
\θ̂) = X 0 Z X z (y − x0 θ̂) z (y − x0 θ̂) 0 −1 Z 0 X −1
n i o
Avar( i i i i i i (4.24)
i=1
• The expression V̂ in (4.23) shows that we allow for an unrestricted (since it is deter-
mined by the data) correlation pattern across the moments of different equations.
• This subsection mentions some alternative choices of weighting matrix, which may
work better in some cases.
i.e. a consistent estimator of V −1 , typically obtaining θ̃ with (Z 0 Z/n)−1 as the first-stage weighting
4
26
• Instead of the moment covariance matrix V̂ used above, one can use the centred mo-
ments to form the covariance matrix:
n i0
1 Xh ih
V̂ = g(wi , θ̂) − ḡ g(wi , θ̂) − ḡ ,
n i=1
1
Pn
where ḡ = n i=1 g(wi , θ̂). (See Bruce Hansen: Econometrics, version January 2018,
p. 385.)
• One disadvantage of the uncentred weighting matrix is that a moment whose sample
value is far from zero at the preliminary estimates will receive a low weight with the
resulting (estimated) optimal weighting matrix, even if its variance is not very large.
Centering the moments is a remedy for this problem.
• The standard approach is to minimize the GMM objective, updating the weighting
matrix, minimizing again, etc., possibly multiple times. Instead, it is possible to update
the weighting matrix continuously, i.e. defining the estimator as:
n
hX i0 n
hX i
−1
θ̂ = arg min g(wi , θ) V̂ (θ) g(wi , θ) , (4.25)
θ
i=1 i=1
where V̂ (θ) is the (centred or uncentred) covariance matrix of the moments evaluated
at θ. (See Bruce Hansen: Econometrics, version January 2018, p. 392.)
• Very often the moment is the product of instruments and an additive prediction error,
i.e. the l-th entry in the L × 1 vector g(wi , θ) is
• In this case, letting the weighting matrix be the diagonal matrix whose (l, l) entry is
1
Pn 2
( i=1 zli yi )
27
results in the estimator
L
"P #2
n
li yi − zi f (xi , θ)
i=1 zP
X
θ̂ = arg min n , (4.26)
i=1 zli yi
θ
l=1
i.e. that minimizes the sum of the squared percentage deviations of the moments.
• The advantage of (4.26) is that it is intuitive, in the sense that the contribution of
each moment to the total value of the objective function is transparent. This may
for instance reveal that one moment is very large, and suggest ways of improving the
specification of the model so as to fit this moment better.
• This estimator can be used either as a first stage, to obtain estimates for the optimal
weighting matrix, or for the final estimates. (See Low and Meghir (2017): The Use of
Structural Models in Econometrics, Journal of Economic Perspectives, p. 52-53.)
• You may find this section useful to get some sense of where the large-sample properties
of the GMM estimator comes from, but no further use will be made of this material,
so it can be skipped.
√
• By the Central Limit Theorem, n times the sample average of the moments has an
asymptotic normal distribution5
√
d
n ḡ(θ0 ) − 0 −→ N (0, V ), (4.27)
28
• Substituting (4.29) into (4.28) gives
h∂ i0 h∂ i0 h ∂ i
0 ≈ ḡ(θ̂) W ḡ(θ0 ) + ḡ(θ̂) W ḡ(θ0 ) (θ̂ − θ0 )
∂θ ∂θ ∂θ
√
h
∂ i0 h ∂ i −1 h ∂
i0 √
n(θ̂ − θ0 ) ≈ − ḡ(θ̂) W ḡ(θ0 ) ḡ(θ̂) W nḡ(θ0 )
∂θ ∂θ ∂θ
√
≈ −(G0 W G)−1 G0 W nḡ(θ0 ),
∂ ∂
where the last line holds for large n, since ∂θ
ḡ(θ̂) and ∂θ
ḡ(θ0 ) both converge in proba-
bility to G.
• Finally, by the delta method (which is the asymptotic version of the fact that z ∼
N (0, Σ) implies Cz ∼ N (0, CΣC 0 ) where C is a matrix and z a vector), (4.27) implies
4.2.
• This section attempts to provide some intuition for Requirement 4. of the GMM
estimator.
• Here ∂θ∂
g(wi , θ) = −zi x0i , so the requirement 4. is that the following 3 × 3 matrix has
full rank:
zi1 1 E(x i2 ) E(x i3 )
E zi2 xi2 xi2 xi3 = E(zi2 ) E(zi2 xi2 ) E(zi2 xi3 ) .
zi3 E(zi3 ) E(zi3 xi2 ) E(zi3 xi3 )
• If we subtract E(zi2 ) times the first row from the second row, and E(zi3 ) times the first
row from the third row, we get
1 E(xi2 ) E(xi3 )
∼ 0 Cov(zi2 , xi2 ) Cov(zi2 , xi3 )
0 Cov(zi3 , xi2 ) Cov(zi3 , xi3 )
– One of the variables xi2 or xi3 has zero covariance both with zi2 and zi3 .
– One of the instruments zi2 or zi3 has zero covariance both with xi2 and xi3 .
29
– More generally, there is a γ such that Cov(zil , xi3 ) = γCov(zil , xi2 ) for l = 2, 3 —
i.e. both instruments are related to xi2 and xi3 in the same way.
• Loosely speaking, we can sum up the rank condition as requiring there to be at least
as many independent sources of variation — i.e. relationships between a moment and
a parameter — as there are parameters to estimate.
30
5 Discrete-choice demand
• Consider again the utility model (1.1),
where the function g allows for interactions between zi and product attributes (xj , pj ),
γ are parameters to be estimated, and ẽij is the remaining error term.
• If instead we have no information about individual consumers, we can let eij = ξj + εij
where ξj is a component common to all consumers and εij is a random term, i.i.d.
across i, and often but not necessarily i.i.d. across j, so
• To use (5.3) for estimation, we need to transform it into something that corresponds
to our dependent variable, which is the market share, sj .
where f is the joint density function of the random variables εi = (εi1 , . . . , εiJ )0 .
• Since the only consumer-specific term in utility is εi , and f is assumed to be the same
for all i, the probability is the same for all consumers, so we have
• By contrast, if we observe demographics zi , assume that ẽij = ξj + εij in (5.2), and let
31
θ = (α, β, γ), we get
P (j|i, θ, ξ)
Z h i
= 1(j = arg max(xj β − αpj + g(zi , xj , pj , γ) + ξj + εij ) f (εi )dεi , (5.6)
k
• Broadly speaking, estimation will take the form of choosing the value of the parame-
ters θ and the error term ξ so that the model’s predictions match the corresponding
observations.
• If we have only product-level data in the form of market shares sj for each product,
we will choose θ and ξ to make P (j|θ, ξ) close to sj .
• We can frame this problem as a standard case of maximizing utility subject to a budget
constraint.
• Let z be the quantity consumed of all other goods. We normalize the price of z to
zero.
max U (q1 , . . . , qJ , z)
q1 ,...,qJ ,z
PJ
j=1 pj qj + z = yi
s.t. qj ∈ {0, 1}
PJ
j=1 qj = 1.
32
• Suppose the utility function is
J
X
U (q1 , q2 , . . . , qJ , z) = (xj β + eij )qj + αz.
j=1
PJ
• Substituting in the budget constraint z = yi − j=1 pj qj , we get, for each j, the
conditional (on j) indirect utility functions:
• The final step in the utility maximization problem is then to choose the j that has the
highest conditional indirect utility.
• Since the term αyi is does not affect this choice, we drop it for each j by defining
• For instance, if
J
Y
α
U (q1 , q2 , . . . , qJ , z) = z exp[(xj β + eij )qj ],
j=1
etc.
• Note that in this case the income term yi no longer cancels from the comparison across
j.
33
5.2 The outside good and normalizations
• We designate one of the alternatives as the ‘outside good’ denoted j = 0, so that there
is a total of J + 1 alternatives when we count the outside good.
• Usually the outside good is the alternative of not buying any of the products in the
market, but it could also be ’the other products’ aggregated into one alternative if we
only model alternatives above a minimum market share.
• Whether we have individual-level data or not, we always assume that the term ξj has
the value zero for j = 0:
ξ0 = 0.
• It is also common to assume that the remaining parts of uij , other than εij , are also
zero for j = 0, so that ui0 = εi0 .
• Either of these assumptions are normalizations. That is, they do not limit the pattern
of choices that the model can predict, but rather ensures that each set of parameters
and ξ’s correspond to different choice predictions.
• To see this more clearly, suppose we did not impose ξ0 = 0. Then we could add any
constant c to ξ0 as well as to all the other ξj , and predicted choices would be unchanged.
• We could also multiply every utility uij by some c > 0 without changing the model’s
predictions.
• This implies that we can impose another normalization, which is usually achieved by
fixing the variance of εi0 .
• It is also common to impose that all j have the same variance, but this is an actual
restriction, not a normalization.
5.3 The role of εij and common choices for its distribution
• Suppose for a moment that there is no εij -term in our model (so εij = 0 with probability
1 for all i and j).
• Then, with product-level data only, our model’s prediction for whether the consumer
chooses j is simply 1(j = arg maxk (xj β − αpj + ξj )).
• Since this indicator function takes on the same value for all i, and equals 1 for the
‘best’ j and 0 for all other j, the model would predict that all consumers choose the
product with the highest value for xj β − αpj + ξj .
34
• The heterogeneity represented by this random shock is therefore necessary to get pos-
itive choice probabilities for all alternatives j.
• With individual-level data, εij allows for the possibility that there are still consumer-
specific unobservables that affect choices. Otherwise our model would imply that all
consumers with the same zi choose the same alternative.
• The most common choice of distribution for εi = (εi0 , . . . , εiJ )0 is that they are i.i.d.
type-1 extreme value, in which case the integral in (5.4) has the following analytical
solution:6
Z h i
P (j|θ, ξ) = 1(j = arg max(xj β − αpj + ξj + εij )) f (εi )dεi
k
exp(xj β − αpj + ξj )
= PJ (5.7)
1 + j 0 =1 exp(xj 0 β − αpj 0 + ξj 0 )
1
P (0|θ, ξ) = PJ . (5.8)
1 + j 0 =1 exp(xj 0 β − αpj 0 + ξj 0 )
• The other common choice of distribution is multivariate normal with mean zero and an
estimated (symmetric) covariance matrix (but with one variance normalized to zero):
εi0 0 1 σ01 . . . σ01
εi1 0 σ01 σ11 . . . σ1J
εi = .. ∼ N (0, C) = N .. , ..
. . ..
. . . . .
εiJ 0 σ0J σ1J . . . σJJ
• It is of course possible to restrict some or all of the σjk -parameters to be zero, or equal
to each other, to reduce the number of parameters one needs to estimate.
• Usually the full covariance matrix of a probit model is only estimated in settings where
J is very small, like 2 or 3.
• For the probit model there is no analytical solution to the integral in (5.4), so it must
be approximated by a technique called simulation.
6
See Train (2009): Discrete-choice methods with simulation, 2nd ed. for a proof of this statement, and
also for more discussion of normalizations and the distribution of εi .
35
• Software such as Matlab has inbuilt functions that lets us take “draws”, realizations,
from random variables with a chosen distribution.
(r)
• Suppose we have taken R (independent) draws νi from a J + 1-dimensional standard
normal distribution.
• Then if AA0 is the Cholesky decomposition of the covariance matrix C, the distribution
(r) (r)
of the R J × 1 vectors εi = Aνi is approximately N (0, C).
• The ξj -term plays the double role of (i) ensuring that the model fits the data, and (ii)
allowing for unobserved demand shocks that are correlated with price. We will return
to (ii) later, and focus on (i) in this subsection.
• Suppose we can take repeated draws, i = 1, . . . , n from this distribution, each of which
results in a choice of one j between 1 and J. (We assume that the probabilities ρj
remain unchanged as we take these draws).
• Let Xj be the random variable representing the number of times alternative j is chosen
out of a total of n draws. The random variables Xj (for each j are said to have a
multinomial distribution.
• Random variables with a multinomial distribution have mean and variance E(Xj ) =
nρj and Var(Xj ) = nρj (1 − ρj ).
• Consider some differentiated products market, such as that for new cars. Assume there
are J different alternatives and that each buyer chooses exactly one of them.
• Suppose the probability that a randomly picked buyer of a new car chooses alternative
j is ρj = Prob(j).
• Then, if we take a random sample of size n of buyers of new cars, the number of buyers
of alternative j is given by the multinomial random variable Xj .
36
• Therefore the observed market share sj in a sample of size n is Xj /n and has mean
and variance
• It follows from this that the variance Var(sj ) becomes negligible for large n, so that
the observed market share will in practice be exactly equal to the probability ρj when
n is on the order of several million, for instance.
• With individual-level data, on the other hand, it is clear that the “market share” Xj
for a single individual, which is either 0 or 1, has a substantial variance, called sampling
error, and that it will therefore not equal ρj .
• When modelling a discrete-choice situation, for instance with a model like (5.4), we
are trying to model the probability ρj .
• Concretely, we assume that our model is correct, in the sense that there are ‘true
values’ of (θ, ξ), which we denote (θ0 , ξ0 ), such that
ρj = P (j|θ0 , ξ0 ), (5.10)
• Then, when the observed market shares sj have been generated from a large enough
sample of consumers that Var(sj ) = 0, so sj = ρj , the assumption (5.10) that our
model is correct implies that
sj = P (j|θ0 , ξ0 ). (5.11)
• The claim that our model (5.6) is correct now amounts to the statement
37
• But since Var(dij |zi ) > 0, dij 6= E(dij |zi ) = ρj (zi ), so that even if (5.12) holds (i.e. the
model is correct), we get
• We have shown that when market shares come from a large number of consumers, the
absence of sampling error implies that at the true parameter values, the model must
correctly predict market shares.
• However, no matter what values are chosen for (α, β), a model without the ξj -term
would typically not have enough degrees of freedom (enough “moving parts”) to allow
sj = P (j|θ, ξ = 0)
yi = xi θ,
apart from in the extreme case where n is equal to the number of columns (and rank)
of xi .
• Another way of saying that we cannot fit the observed market shares when ξ = 0 is
that
sj = P (j|θ, ξ = 0) + ωj (5.14)
• The problem with this is that it is very hard to justify the presence of ωj , since it
cannot, as we have seen, be explained by sampling error.
• But this kind of justification for ωj would contradict the structural starting point of
our model, i.e. the claim that xj − αpj + εij represents the surplus (indirect utility)
the consumer gets from choosing j.
• That is, if there was such a tendency for consumers to prefer or dislike j, it should
show up in uij , not added onto the choice probability.
38
• And this is precisely the function served by ξj ; it represents a tendency for consumers
to prefer or dislike j, and in line with this structural interpretation, it enters the utility
function (5.3).
• With product-level data, ξj is really just another error term, just like ei in the linear
regression model yi = xi θ + ei — that is, whatever needs to be added to the xi θ to get
yi .
ei = yi − xi θ; (5.15)
so ei is not really like just another variable; it’s simply whatever is needed to fill in the
difference between xi θ and the dependent variable yi .
sj = P (j|θ, ξ) (5.16)
• The only difference is that our structural interpretation of the uij and P (j|θ, ξ) dictates
that we can have an error term in the former, but not in the latter.
• In conclusion, then, ξ is a vector of error terms defined as the solutions to the system
of equations given by (5.16) for each j, just like the vector (e1 , . . . , en ) of error terms
in the linear regression model is the solution to (5.15) for each i.
1
Pn
where sj = n i=1 dij .
39
5.5 Estimation
• In the case of product-level data, by defining ξ through (5.16), we exclude the possibility
of estimating θ with GMM using moment conditions of the form
since the value would be zero for each j and any value of θ.
• The analogy from the linear regression model would be to use zi [yi − (xi θ + ei )] as a
moment. Again it would be zero for all i and all values of θ.
1. estimate both θ and ξ as parameters (i.e. without distinguishing between the two
in any particular way) with GMM, using moments
or, alternatively,
2. let ξ(θ) be defined as an implicit function of θ by (5.17), and estimate θ only by
GMM, using moments
where dij is one for the particular j chosen by i and zero for all other j.
40
the market share sj . We then get the simplification
J J
Y Pn Y
prod i=1 dij
L (θ, ξ) = P (j|θ, ξ) = P (j|θ, ξ)nsj .
j=1 j=1
• We can take the natural logarithm of the likelihood function to get, for individual-level
and product-level data, respectively:
n X
X J
ind
log L (θ, ξ) = dij log P (j|i, θ, ξ)
i=1 j=1
J
X
prod
log L (θ, ξ) = n sj log P (j|θ, ξ),
j=1
where the scaling by n can be ignored, since it does not affect the value of θ that
maximizes the function.
1. estimate both θ and ξ as parameters (i.e. without distinguishing between the two
in any particular way) with maximum likelihood, so
[
(θ, ξ)M L = arg max log Lind (θ, ξ),
θ,ξ
or, alternatively,
2. let ξ(θ) be defined as an implicit function of θ by (5.17), and estimate θ only by
maximum likelihood:
• With product-level data, defining ξ by (5.16) causes the same problem in maximum
likelihood estimation as with GMM: the model’s prediction P (j|θ, ξ) does not depend
on the value of θ since ξ adjusts to ensure that the model’s prediction always equals
sj . That is, we get
J
X
prod
log L (θ, ξ(θ)) = n sj log P (j|θ, ξ)
j=1
J
X
= n sj log sj
j=1
41
for all θ.
• Note that setting ξ = 0 is problematic under maximum likelihood estimation for the
same reason as discussed above for GMM: sampling error cannot explain why (5.16)
does not hold, and nothing else can either.
• However, since maximum likelihood estimation does not rely on explicitly defining an
error term like ωj in (5.14), there is a greater danger that this fact is overlooked.
• But in addition to the discrepancy between observed and predicted market shares, the
maximum likelihood estimator
is not consistent, because it does not take into account with the fact that price is
endogenous.
• Intuitively, with ξ = 0, there is nothing that can explain why some products with high
price also have high market shares, other than a low price sensitivity. Therefore α will
be (asymptotically) biased towards zero.
• The same would be true for other restrictions on ξ such as assuming that it is inde-
pendently distributed across j.
• Since the suggested GMM and maximum-likelihood estimators both fail for product-
level data, an alternative approach is needed: GMM estimation of θ only, with moments
zi ξj (θ) (5.20)
• Note how this is again equivalent to the linear regression case. The equation (5.16)
serves only to say that the error terms ξj should ensure exact fit with the dependent
variable, just like the equation yi = xi θ + ei does for the linear regression model.
• In both cases, these definitions of the error term in themselves provide no information
about the value of θ, and additional restrictions are needed, in the form of orthogonality
assumptions between exogenous variables (instruments) and the residuals: E[zj ξj (θ)] =
0 for the structural demand model, and E[zi ei (θ)] = 0 for the linear regression model
(where ei (θ) = yi − xi θ).
5.6 Finding ξj
42
• The conceptually simplest, but in practice often the least convenient, alternative, is to
use some standard numerical algorithm to solve the system of equations given by the
equations (5.16) for each j. (For an example, see Goolsbee and Petrin (2004): The
Consumer Gains from Direct Broadcast Satellites and the Competition with Cable TV,
Econometrica.)
• If εij has a type-1 extreme value distribution, so the choice probabilities are given by
(5.7) and (5.8), there is an analytical solution to (5.16)7
• Another approach, is slightly more involved than the previous one, but more widely
applicable.
• It relies on the fact that for some commonly used discrete-choice models, for any value
of θ, the function g : RJ → RJ defined as
where s = (s1 , . . . , sJ )0 and P (θ, ξ) = [P (1|θ, ξ), . . . , P (J|θ, ξ)], is a so-called contrac-
tion (or contraction mapping). (This is proved in BLP (1995).)
1. g has a unique fixed point ξ ∗ , i.e. a point such that ξ ∗ = g(ξ ∗ ), and
2. For any starting point ξ 1 ∈ RJ , the sequence (ξ t ) defined by the recursive relation
ξ t+1 = g(ξ t ) converges to the fixed point ξ ∗ .
sj /s0 sj sj
P (j|θ, ξ) = PJ = PJ = = sj .
1 + j 0 =1 sj /s0
0 s0 + j 0 =1 sj 1
8
See e.g. Rudin: Principles of Mathematical Analysis, 3rd. ed., p. 220.
43
• Then there is an integer T such that t > T implies maxj sj − P (j|θ, ξ t ) < τ , in other
words such that (5.16) is satisfied to whatever level of accuracy we desire.
• The final approach is conceptually slightly different, because it does not involve finding
ξ(θ) for every value of θ attempted during the optimization of the GMM or likelihood
objective.
• Instead it recasts the problem as a minimization problem with respect to both θ and
ξ, subject to the constraint that (5.16) hold (at the solution), and relies on established
algorithms for solving such problems to solve
[
(θ, ξ) = arg min F (θ, ξ) subject to s = P (θ, ξ) ,
θ,ξ
where F is the GMM objective function or the negative of the log likelihood. (See
Dube, Fox and Su (2012), Econometrica.)
• As mentioned briefly in section 1.3, we are often interested in the extent to which
consumers switch their purchases to other firms in response to an increase in price.
• Given the model (5.3) with εi type-1 extreme value, the resulting choice probabilities
(5.7) — which for simplicity we now write as Pj = P (j|θ, ξ) — have price derivatives
and for k 6= j,
• The cross-price derivatives show that in this model, substitution from j to k is pro-
portional to the market share of k.
• This follows from the fact that εi is i.i.d.: consumers who leave j after its price goes up
have the same probability of preferring k as any randomly chosen consumer. Therefore,
44
the share of these consumers who will switch to k is equal to Pk .
• This implication of the model is often unrealistic, because we would expect substitution
to be stronger between products that are similar.
• The probit model with estimated covariance matrix is one way of getting this effect,
but when J is large, it may not be feasible.
• An alternative is to relax the assumption that all consumers have the same coefficient
on xj and instead allow it to vary across consumers as indicated by writing βi .
• One way of doing this is to assume that for each consumer, βi is drawn from a multivari-
ate normal distribution, i.i.d across consumers, with estimated mean β and standard
deviation given by the diagonal matrix σ, so that
βi = β + σνi (5.22)
where νi has a multivariate standard (i.e. zero mean and identity covariance matrix)
normal distribution, with p.d.f. denoted by φ.
• As discussed in the context of the probit model, the integral can be simulated as
R
1X
Pj ≈ Pj (ν (r) ) (5.25)
R r=1
where R is the number of vectors ν (r) drawn from the standard normal distribution.
45
• We then get:
R
∂Pj 1 X ∂Pj (ν (r) )
≈
∂pk R r=1 ∂pk
R
1X
= αPj (ν (r) )Pk (ν (r) ).
R r=1
• To see this, first note that a different way of saying that two variables are positively
correlated is that they tend to be either both large or both small.
• For instance, suppose X and Y are two random variables with discrete support {0.1, 0.9}
and probability 0.5 for each outcome.
• If they are always both high or both low, their expected product is E(XY ) = 0.5(0.9)2 +
0.5(0.1)2 = 0.41.
• If they are never both high or both low, their expected product is E(XY ) = 0.5(0.9)(0.1)+
0.5(0.1)(0.9) = 0.09
• In the same way, if two products j and k have similar product characteristics, xj σνi
and xk σνi will be positively correlated across different values of νi .
• In turn, Pj (νi ) and Pk (νi ) will be positively correlated, giving a high average product
Pj (νi )Pk (νi ) and therefore a high cross-price derivative.
• This explains why random coefficients on the product characteristics xj allow similar
products to have higher cross-price derivatives than products with different attributes.
• Note that a similar effect can be obtained by interacting product characteristics with
observed consumer attributes zi , as in (5.2).
• However, even if such data are available, it is common to include random coefficients,
because it is unlikely that zi includes all determinants of tastes βi , in which case the
random coefficients capture remaining heterogeneity in consumers’ tastes for xj .
46
• Given a weighting matrix W , we can use a numerical optimization algorithm to solve
" J #0 " J #
X X
min zj ξj (θ) W zj ξj (θ) . (5.26)
θ
j=1 j=1
• However, we can make our computations more efficient by treating the ‘fixed’ param-
eters (α, β) differently from those, σ, that multiply a quantity, νi , that varies across
consumers.
• Define
δj = xj β − αpj + ξj , (5.27)
i.e. the sum of all utility components that are fixed across consumers.
• So far we have used the equation (5.16) to implicitly define the function ξ(θ).
• But we can just as well define an implicit function δ(σ) with the system of J equations
s j = Pj
Z
exp(xj σνi + δj )
= PJ φ(νi )dνi . (5.28)
1 + j 0 =1 exp(xj 0 σνi + δj 0 )
• Since δj absorbs xj β −αpj , α and β do not appear anywhere in this system of equations.
• Therefore the value of δ that solves the system of equations does not depend on α and
β, but only on σ.
• All the techniques discussed in section 5.6 for finding ξ also work for finding δ.
• If we let X be a matrix that vertically stacks the vectors [xj , −pj ] for each observation
j, and Z does the same for zj , and write θ1 = σ and θ2 = (β 0 , α)0 , our estimator for
(θ̂1 , θ̂2 ) is
47
• For each value of θ1 , the optimal choice of θ2 , written as θ̃2 , will satisfy the first-order
condition
∂
0 = [(δ(θ1 ) − X θ̃2 )0 Z]W [Z 0 (δ(θ1 ) + X θ̃2 )]
∂θ2
= −2X 0 ZW [Z 0 (δ(θ1 ) − X θ̃2 )]
X 0 ZW Z 0 X θ̃2 = X 0 ZW Z 0 δ(θ1 )
θ̃2 (θ1 ) = (X 0 ZW Z 0 X)−1 X 0 ZW Z 0 δ(θ1 ), (5.30)
where in the last line the notation reflects the fact that θ̃2 is a function of θ1 .
• Since we now have an analytical expression for θ̃2 in terms of θ1 , we can solve the
problem (5.29) by using a numerical optimization algorithm to find θ̂1 and then use
(5.30) to find θ̂2 :
h i h i
θ̂1 = arg min (δ(θ1 ) − X θ̃2 (θ1 ))0 Z W Z 0 (δ(θ1 ) − X θ̃2 (θ1 )) (5.31)
θ1
• Note that (5.30)-(5.32) and (5.29) define the same estimator. All we have done is to
derive an alternative calculation method, sometimes called concentrating out the linear
parameters.
• This analysis also applies to the special case where θ1 = 0, i.e. where there are
no random coefficients (or other coefficients on utility components that vary across
consumers).
• But now there are no longer any nonlinear parameters θ1 that affect δj , which is now
simply the solution to:
exp(δj )
sj = PJ . (5.33)
1 + j 0 =1 exp(δj 0 )
• The expression (5.30) for the linear parameters θ2 is now modified slightly by the fact
that δ is not a function of any θ1 , so we get:
48
where (5.33) now has the straightforward solution
δj = log(sj /s0 ).
• Finally, it is useful to note that (5.34) is simply the GMM estimator for the linear
model
with moment conditions E(zj ξj ) = 0, as can easily be seen from section 4.4, and (4.13)
in particular.
• In section 1.2 we saw how the prices set by a firm must satisfy first-order conditions
for profit maximization that depend on which products the firm owns.
• Defining quantity as the product of the number of consumers times the choice proba-
bility Qj = nPj , and noting that the market size n cancels, the first-order conditions
(1.4) and (1.6) for single-product and two-product firms, respectively, are
∂Pj
0 = Pj + (pj − mcj ) , j = 1, 2 (5.35)
∂pj
∂Pj ∂Pk
0 = Pj + (pj − mcj ) + (pk − mck ) , j = 1, 2; k 6= j. (5.36)
∂pj ∂pj
• In given market for differentiated products, let Ff be the set of products owned by
firm f , and let f (j) denote the firm that owns product j.
• When (5.35) is satisfied for j = 1, 2, each firm’s price is its best response to the other
firm’s price — that is, the prices (p1 , p2 ) are a Nash equilibrium in a game where firms
choose prices to maximize profits.
• In principle this game could have more than one Nash equilibrium, but it is a standard
assumption (which does not appear to be violated in practice) that (5.35) has a unique
solution.
49
• Define the ownership matrix
1(f (1) = f (1)) 1(f (2) = f (1))
Own = ,
1(f (1) = f (2)) 1(f (2) = f (2))
whose (j, k) entry is an indicator function for whether product k is owned by the same
firm as product j.
• Every diagonal entry (j, j) is necessarily one, and the matrix is symmetric (i.e. the
entries (k, j) and (j, k) are equal).
• When J = 2, the ownership matrix must take one of two shapes, corresponding to two
single-product (“sp”) firms or one two-product (“tp”) firm, respectively:
sp 1 0 tp 1 1
Own , Own = .
0 1 1 1
• Next define vectors of prices, marginal cost and choice probabilities, as well as the
(Jacobian) matrix of partial derivatives with respect to prices:
" #
∂P1 ∂P1
p1 mc1 P1 ∂p1 ∂p2
p= , c= , P = , ∇p P = ∂P ∂P2
p2 mc2 P2 2
∂p1 ∂p2
• The system of first-order conditions (5.35) and (5.36) can now be written on matrix
form as
0 = P + Ωsp (p − c) and
0 = P + Ωtp (p − c),
respectively.
• All these matrices (and vectors) extend to arbitrary numbers of products J and own-
50
ership structures. For instance, if J = 4 and F1 = {1, 4} and F2 = {2, 3},
1 0 0 1
0 1 1 0
Own = 0 1
,
1 0
1 0 0 1
where the fact that both choice probabilities and their derivatives depend on prices
has now been reflected in the notation.
• We can easily solve the system for the markups or marginal cost implied by the observed
prices and estimated demand system (as given by P and ∇P ) (under the assumptions
that the observed prices satisfy the first-order conditions):
• Solving for new optimal prices (for instance if Own changes) is more complicated, since
the system of equations (5.37) is nonlinear in p. But writing (5.37) in terms of prices
is still conceptually useful:
p = c − Ω−1 P.
51
6 Berry, Levinsohn and Pakes (1995) [BLP]
6.1 Brief overview
• The paper proposes a feasible method for estimating the demand for differentiated
products:
• The contribution of the paper is mainly methodological, so we will focus on how they
estimate their model, and not on their results.
6.2 Data
• 999 different car models in an unbalanced (not every model observed every year) panel
of 20 years (years denoted t) (1971-1990).9
• From other data sources the paper uses the following information:
– Mean mt (for each of the 20 years) and standard deviation σ̂y (common to all
years) of household income.
– Number Mt of households in the US in each year of the data.
9
It is stated in the paper (p. 869) that there are 997 models, but the replication by Gentzkow and
Shapiro indicates that the actual number is 999. See https://www.brown.edu/Research/Shapiro/pdfs/blp
replication.pdf and https://www.brown.edu/Research/Shapiro/data/blp 1995 replication.zip.
52
• Define market shares
qjt
sjt = .
Mt
Let Jt be the set of all models sold in year t and define the ‘outside good’ j = 0 as the
option of not choosing any j in Jt , i.e. of not buying a new car. Then
X
q0t = Mt − qjt ,
j∈Jt
or, dividing by Mt ,
X
s0t = 1 − sjt .
j∈Jt
where Air is an indicator for air conditioning, Size is width times length and T rend
is simply t, where we assume t = 1, . . . , 20.
• Let K1 be the length of x and K2 the length of w.
• For each jt, let xjt be the value of x, and let xjtk denote the k-th entry of x.
where νi = (νiy , νi1 , . . . , νiK1 )0 is a random vector with a multivariate normal distribu-
tion with mean zero and identity covariance matrix, whose probability density function
we denote by φ. The shocks εijt are i.i.d. type-1 extreme value.10
10
These equations differ from those on p. 868 in the paper in two respects: (1) the paper has ln(yit − pjt ),
and (2) we have normalized the utility of j = 0 to be zero apart from the logit shock. The latter is a
53
• The specification of income yit comes from the assumption that household income has
a lognormal distribution in the population.
• The observed market share sjt allow us to define δjt as an implicit function of the data
and the nonlinear parameters θ1 = (σ1 , . . . , σK1 , α):
• This is a system of equations whose number of unknowns and equations equals the
number of elements in the set Jt . There is one such system of equations for each year
of data.
• For each t, given a value of θ1 , we find δt (θ1 ) using the contraction method described
above.
• Once δjt (θ1 ) has been found, we have an expression for ξjt in terms of the data and
parameters:
This is defines the error term for the demand equation in BLP.
• The discussion in section 5.9 shows that the demand function combined with product
ownership has implications for the relationship between price and marginal cost for
each product.
• Defining, for one market t at a time, the combined ownership / price-derivative matrix
Ωt like in section 5.9, the implied vector of markups bt = pt − mct is
bt = −Ω−1
t Pt
normalization. Regarding (1), income is in many cases smaller than price, in which case the log specification
is not defined. Again according to Gentzkow/Shapiro, it seems that the functional form pjt /yit was used in
practice in BLP’s code.
54
where bt , prices pt and Pt are all column vectors of length equal to the number of
elements in Jt , and Ωt is a square matrix of corresponding dimensions. Pt is the vector
of choice probabilities given by (6.5).
• Since bt = pt − mct , we can write the marginal cost implied by the (1) observed prices,
(2) observed product ownership, and (3) choice probabilities (and their derivatives) as:
mct = pt − bt .
• Marginal cost is assumed to be constant (in output quantity) and its log a linear
function of product characteristics:
• Combining the implied markup bjt with this function for marginal cost, we get
This defines the error term for the supply equation in BLP.
6.5 Moments
• Suppose we have two row vectors of instruments z1jt and z2jt , of size 1 × L1 and 1 × L2 ,
respectively.
• Also arrange the error terms ξjt and ωjt , defined in (6.6) and (6.8) respectively, in a
2 × 1 vector:
ξjt
ujt = .
ωjt
• The paper estimates two equations using data with a panel structure (J products
observed in multiple years).
55
• Estimation is by GMM using the following (L1 + L2 ) × 1 two-equation panel moments
(see sections 4.5 and 4.6):
T
X
0
g(Wj , θ) = Zjt ujt , (6.9)
t=1
• The product characteristics xjt and wjt are assumed to be exogenous, so these variables
serve as instruments for themselves.
• Price pjt is clearly endogenous, since higher unobserved quality ξjt will make it optimal
for the firm to set a higher price.
• To form instruments for price, the paper relies on the reasonable claims that markups:
– depend on whether the product has many close substitutes or not, which in turn
depends on whether product characteristics are similar to those of competitors;
– “respond differently to own and rival products”.
• Since the markup is one constituent of price, product characteristics of the firm’s own
products and of other firms’ products will be correlated with price.
• These instruments must also serve to get enough (L ≥ K) moments to estimate the
σ-parameters.
• Although it is not too clear from the discussion in the paper, it seems intuitive that the
moments involving these instruments will also be informative about the σ-parameters,
11
I use an upper case ‘w’ here to distinguish it from wjt .
12
In practice, some of these must be dropped because of collinearity. See the readme file of Gentzkow and
Shapiro’s replication.
56
which are closely related to the differences in substitution patterns between similar
products vs. non-similar products.13
• As is often the case with nonlinear models the (implicit) claim that the model satisfies
requirement 4. for the GMM estimator relies on an intuitive argument that there is
a sufficient number of different sources of variation. See also the discussion in section
4.9.
• Requirement 3. for the GMM estimator is that the population mean of the moment is
zero (at the true parameter value).
• Let zt = (xjt , wjt )j∈Jt . BLP’s primitive identifying assumptions are the mean indepen-
dence conditions:
E[ξjt |zt ] = 0
E[ωjt |zt ] = 0.
• That is, it is assumed that neither the characteristics of product j itself, nor those of
other products, contain any information about the expected value of the unobserved
quality ξjt or unobserved cost shifter ωjt .
• It follows that if the instruments z1jt and z2jt are functions of zt , requirement 3. is
satisfied.
• Requirement 1. for the GMM estimator is that W1 , . . . , WJ are i.i.d., although weaker
assumptions may be sufficient, as mentioned in footnote 2.
• It is not obvious that independence holds, since the moment for each j explicitly
depends on the prices and characteristics of all other products j 0 sold in the same
markets, through the choice probability (6.5).
• The discussion in the paper related to this issue (mostly on p. 856) is not very trans-
parent, but at least independence across j seems to be assumed.14
13
For a recent, somewhat formal, discussion of what kind of instruments are needed in BLP-type models,
see Berry and Haile (2016): Identification in Differentiated Products Markets, Annual Review of Economics.
14
The large-sample properties of the estimator are discussed in more detail in Berry, Linton and Pakes
57
6.7 Practical issues
• The method for ‘concentrating out’ the linear parameters (see section 5.8) works in
the BLP setting and should be used.
• Still let θ1 = (σ1 , . . . , σK1 , α)0 and let θ2 = (β̂ 0 , γ 0 )0 .
• Define
δjt (θ1 )
yjt (θ1 ) =
ln(pjt − bjt (θ1 ))
and
xjt
Xjt = .
wjt
• Now let Y (θ1 ), X and Z be the vector and matrices that vertically stack yjt (θ1 ), Xjt
and Zjt .
• Based on the same reasoning as in section 5.8, we now get the following expression for
the BLP estimator:
(2004) and in appendix 1 of the NBER version of BLP. In both places the Lyapunov Central Limit Theorem
is used to show that the asymptotic distribution of the sample mean of the moments is normal. This
theorem requires independence (but not identical distribution) (see Billingsley: Probability and Measure,
3.ed., Theorem 27.3).
58
7 Nevo (2001)
7.1 Brief overview
• Paper: Measuring Market Power in the Ready-to-Eat Cereal Industry, 2001, by Aviv
Nevo.
• The model and estimation is very similar to BLP, so other than being a very nice
exposition of the BLP approach, the paper’s main contributions are to use:
– a structural demand model to formally test whether firms are colluding, by com-
paring the markups implied by the demand estimates to cost data, and
– product fixed effects to relax the mean independence / orthogonality assumption
in BLP to hold only for the market-specific deviation in unobservable quality.
• The model/estimation in this paper is essentially the same (with only very minor
differences) as in Nevo’s other paper from around the same time, RAND Journal of
Economics, 2000: Mergers with differentiated products: the case of the ready-to-eat
cereal industry.
• Instead of testing conduct using implied markups, the merger paper calculates the
counterfactual price equilibrium under the assumption that the ownership matrix
changes to reflect proposed mergers. We have already discussed how this can be done.
See the paper for the results.
7.2 Data
– 25 brands of breakfast cereal, denoted j. (Those with highest market share in the
last quarter of the data.)
– 20 quarters (1988:Q1 - 1992:Q4).
– 65 cities, increasing over time.
– There are 1124 city-quarter combinations, denoted by t.
– All except one brand is present in all 1124 city-quarter combinations. One brand
is present only in 1989:Q1 (unclear why this is included).
– Quantity variable, qjt is the total number of servings sold (kilograms sold divided
by serving size).
– Market size Mt is inhabitants × days in a quarter, i.e. one serving per capita per
day.
59
– Given qjt and Mt , market shares of inside and outside goods are defined as in
BLP.
– Price pjt is total value of sales of the brand at t (deflated by regional urban CPI)
divided by qjt . Price varies across cities as well as quarters.
– Vector of product characteristics
(where the last three variables are dummies for products geared to different con-
sumer segments) does not vary across t.
– Brand-specific advertising expenditure ajt (seems to vary across time).
• Other data, for instrumental variables in some specifications, but not in the full model:
7.3 Model
• Let xjk be the k-th element of the vector xj and let K be the length of xj .
• Let πk be a row vector of parameters (specific to xjk ) of the same length as the column
vector Di .
• Some of the elements of πk are set to zero; check rows 3-5 in Table VI, p. 327 of the
paper to see which parameters are estimated.
60
• Consumer i at t gets (indirect) utility from choosing alternative j:
K
X
uijt = δjt + pjt (π0 Di + σ0 νik ) + xjk (πk Di + σk νik ) + εijt (7.1)
k=1
δjt = dj + dq(t) + γajt − αpjt + ∆ξjt (7.2)
dj = xj β̄ + ξj (7.3)
ui0t = εi0t , (7.4)
where dj and dq(t) are dummy variables for the brand and quarter, where q(t) is the
quarter of city-quarter t (this effect is common to all cities in the quarter).
• Here νik is standard normal and εijt is type-1 extreme value.
• In (7.1), utility has been split into several parts, defined in (7.2-7.3), according to how
they are estimated.
• Choice probabilities are
Z Z
Pjt = Pjt (Di , νi ) φ(νi )h(Di )dνi dDi , (7.5)
where
61
where the vector δt stacks all the δjt vertically.
• The linear parameters that are ‘concentrated out’ are θ2 = (d1 , . . . , dJ , dq1 , . . . , dq19 , γ, α).
• This means that the data vector X in (5.30) should now stack, for each observation
(j, t), a vector consisting of a full set of product dummies as well as −pjt .
7.4 Estimation
– The error term that goes into the moments used in estimation is ∆ξjt , i.e. the
only city-quarter specific deviation in unobserved quality ξjt after controlling for
the average (across city-quarters) unobserved quality ξj through the brand fixed
effects dj .
– To decompose dj into the effects of xj and ξj , respectively, a second stage of
estimation is needed, where dj is regressed on xj and the residuals are now the
estimates of ξj .
zjt ∆ξjt ,
where ∆ξjt = δjt (θ1 ) − [dj + dq(t) + γajt − αpjt ] and zjt is a vector of instruments.
• This means that Nevo assumes E[zjt ∆ξjt ] = 0, but does not need the BLP-type as-
sumption E[zjt ξjt ] = 0 for his main results.
• This seems like a big advantage, since unobserved quality ξjt is likely to be correlated
with product characteristics or other j-specific instruments, while the city-quarter
specific deviation ∆ξjt is much more likely to be uncorrelated with the instruments.
• Whereas BLP assumed independence across j but not t, Nevo (implicitly) assumes
independence across each jt-observation.
• Since BLP has 999 j-observations, they can rely on asymptotics in the j-dimension.
• Nevo on the other hand, has only 25 j-observations, so if he defined the moments like
BLP do, there would be only 25 observations of each moment.15
15
Alternatively, he could assume independence across brands and cities, but not across quarters within a
brand-city combination. This would give a large enough number of observations (65 · 25).
62
• The second-stage regression
dj = xj β̄ + ξj
could be estimated with OLS or the GLS (generalized least squares) estimator used in
the paper (see p. 322).
• However, the parameters β̄ are not needed for any of the analysis in the paper. They
would only be needed for questions that involve changing the value of xj . The stronger
orthogonality assumption therefore plays a very minor role in the results.
7.5 Instruments
• The instrument vector zjt contain brand and quarter dummy variables and advertising
expenditure.
• The remaining parameters (for which other instruments are needed) are (see Table VI,
p. 327):
– α (1 parameter)
– σ (9 parameters)
– π (10 parameters)
• Product characteristics do not vary across city-quarters (t), so all product character-
istics or functions of product characteristics can be written as linear combinations of
the brand dummy variables.
• Therefore, the BLP instruments do not provide additional restrictions beyond those
involving the brand dummies. (See p. 320 in the paper for a discussion.)
– The quarterly average price of j in other cities in the same region as the city in
the city-quarter t. Since there are 20 quarters in the data, and each quarterly
average is included as a separate instrument, this gives 19 instruments when one
is dropped to avoid collinearity (with the full set of brand dummies).
– Proxies for marginal cost: census region dummy variables (9 regions minus 1,
presumably) (transport cost), city density (cost of space) and average city earn-
ings in the supermarket sector (labour cost). This gives 10 additional moment
conditions.
63
• The prices of other cities are correlated with price because of shared components of
marginal cost, but uncorrelated with the demand shock ∆ξjt because these shocks are
assumed to be independent across cities.
• Both sets of instruments should be helpful to estimate the price parameter (if we accept
the exogeneity assumptions).
• The paper does not discuss how the instruments contribute to estimating the σ and π
parameters, but in the same way as in BLP it seems plausible that instruments that
shift price and therefore market shares will provide information about the parameters
(σ and π) that determine substitution patterns. (See also footnote 13.)
• The end goal of the paper is to determine whether firms act as if they maximize profit
(a) of brands individually, (b) of the firm’s portfolio of brands, or (c) jointly of all
brands in the market (act as a cartel).
• Using
– the asymptotic normal distribution of the estimator (with estimated mean and
variance), and
– the fact that each hypothesis about conduct and each value of the parameter
vector imply a markup ( p−mc
p
) for each brand,
Nevo calculates 95-percent confidence intervals for the median (across brands and city-
quarters) markup.
1. Take R = 1500 (this is the number used in the paper) draws ν r from a multivariate
normal distribution with mean zero and identity covariance matrix, of dimension
equal to the number of estimated parameters (only the ones from the first stage
of estimation are needed).
2. Interact the estimated mean (parameter estimates) and covariance matrix of the
estimator with the standard normal draws to get θ̂r = θ̂ + Aν r (where AA0 equals
the covariance of the estimator).
3. For each θ̂r calculate the implied percentage markup for each jt under each of the
three hypotheses (a)-(c), and take the median across all jt.
4. Repeat steps 2. and 3. for all the R draws.
64
5. Find the 2.5-th and 97.5-th percentiles of the median markup across the R draws,
for each of the three hypotheses. These are the bounds of the confidence intervals.
• For each hypothesis i = a, b, c, let [li , ui ] denote the resulting 95-percent confidence
interval.
• Then if hypothesis i is true, and m is the true population median markup (assuming
the demand model is correct and the parameters consistently estimated):
Pr m ∈ [li , ui ] hypothesis i is true = 0.95.
• Using direct data on cost (not implied by the demand estimates), Nevo claims that
the percentage markup of a typical firm is between a low estimate of 31% and a high
estimate of 46%. These numbers should be regarded as an approximate value for the
true markup m.
• For each i the confidence interval then gives us a test of the null hypothesis that
hypothesis i true.
• Under the null hypothesis, the relevant confidence interval covers m with probability
0.95, so we will reject the null hypothesis if m is not in [li , ui ].
• Using this method, Nevo can reject hypothesis (c), of joint profit maximization, but
not the other two.
• He then concludes that firms in this industry do not appear to collude, but that their
market power is due to product differentiation (probably to a large extent driven by
marketing).
65