1 Motivation. - UW Faculty Web Serverfaculty.washington.edu/bajari/metricstheorysp09/bayesian.pdf1...
Transcript of 1 Motivation. - UW Faculty Web Serverfaculty.washington.edu/bajari/metricstheorysp09/bayesian.pdf1...
1 Motivation.
• Bayesian discrete choice models
• Bayesian approach offers extremely powerful meth-ods for numerical integration
• These methods facilitate the study of latent vari-able models
• Discrete choice is a case in point
• Some particularly rich models can only be studiedusing Bayes
2 Review of Bayesian Statistics
• Let p(D|θ) denote the probability of the data Dgiven parameter θ
• In most applications, p(D|θ) corresponds to alikelihood l(θ)
• Unlike many classical estimators, Bayesian ap-proach typically requires specification of paramet-
ric likelihood
• Many researchers view this as a drawback
• However, in discrete choice parametric models arealmost always used
• p(θ) is prior distribution of parameters
• The choice of a prior can be controversial in ap-plied work
• In some cases, we might put reasonable restric-tions on parameters
• Some researchers conduct senstivity analysis withrespect to priors
• Bayes theorem:
p(θ|D) =p(D, θ)
p(D)
=p(D|θ)p(θ)
p(D)∝ p(θ)l(θ)
• p(θ|D) is the posterior distribution of the para-meters given the data
• In the Bayesian approach, the research updateshis/her beliefs about the parameters using the
laws of conditional probability
• The posterior is proportional to the prior timesthe likelihood
• Note that as the sample becomes large, the like-lihood will become the dominant term
• In classical statistics, the parameter is a randomvariable
• Confidence intervals etc.... are obtained using anasymptotic approximation
• The choice between Bayes/Classical approachesis a matter of dispute in statistics/econometrics
• Under mild regularity conditions, as the samplesize n becomes large, the posterior becomes:
p(θ|D) ≈ N(bθMLE,∙−H
θ=bθMLE
¸−1)
• Intuitively, this occurs because the posterior closelyresembles the likelihood function as the sample
size becomes large.
• p(θ) stays fixed with n
• l(θ) increases with n
• The prior swamps the likelihood in large samples
2.1 Predictive distributions
• Df data that we have not observed
• D observed data
p(Df |D) =Zp(Df |θ)p(θ|D)dθ
• p(Df |θ) probability of Df given θ
• Integrate out parameter uncertainty using the pos-terior, p(θ|D)
2.2 Decision Theory
• L (a, θ) loss function associated with an action awhen the parameter is θ
• A Bayesian should choose to minimize expectedloss given parameter uncertainty
mina
ZL (a, θ) p(θ|D)dθ
• Examples: estimation- a corresponds to choice ofparameter
• Non-nested testing- a coresponds to choice ofmodel
• Profit maximization: target marketing
• Utility Maximization
3 Markov Chains.
• Two common ways to conduct MCMC are Gibbssampling and Metropolis.
• A normal random walk metropolis works as fol-
lows.
• First, the econometrician comes up with a roughguess θ0 at the MLE.
• Second, come up with a rough guess at I0 at theinformation matrix using the hessian of the MLE.
• A sequence of psueorandom values θ(1), ..., θ(S)
is drawn as follows. Given θ(s), we draw θ(s+1)
as follows:
1. First, draw a candidate value eθ ∼ N(θ(s), I0)
2. Second, compute α = min{ p(eθ)f(y|eθ)p(θ(s))f(y|θ(s))
, 1}
3. Set θ(s+1) = eθ with probability α and θ(s+1) =
θ(s) with probability α.
• Implimenting this algorithm simply requires the
econometrician to evaluate the likihood repeat-
edly and draw normal deviates.
• A second algorithm for constructing a Markov
Chain is Gibbs sampling.
• Partition parameters into θ1, ..., θd blocks
• Let pk(θk|θ1, ..., θk−1, θk+1, ..., θd) denote the con-ditional distribution of the kth block of parame-
ters given the others.
• In some applications, this distribution can be con-venient to form even if the entire likelihood is
quite complicated!
• Starting with an initial value θ0, Gibbs samplingworks as follows. Given θ(s)
1. Draw θ(s+1)1 ∼ p1(θ1|θ
(s)2 , θ
(s)2 ..., θ
(s)d )
2. Draw θ(s+1)2 ∼ p2(θ1|θ
(s+1)1 , θ
(s)3 , θ
(s)4 ..., θ
(s)d )
3. Draw θ(s+1)3 ∼ p3(θ1|θ
(s+1)1 , θ
(s+1)2 , θ
(s)4 ..., θ
(s)d )
...
d. Draw θ(s+1)d ∼ pd(θ1|θ
(s+1)1 , θ
(s+1)2 , θ
(s+1)4 ..., θ
(s+1)d−1 )
d+1 Return to 1.
4 Bayesian Analysis of Regression
• MCMC/Gibbs sampling are particularly powerfulin problems with latent variables.
• In discrete choice, utiltity is latent to the econo-metrician.
• In a multinomial probit, if utility was observed bythe econometrician, estimating parameters would
boil down to linear regression.
• For our analysis, it will be useful to consider theBayesian analysis of linear regression.
• A key step in Rossi et. al.’s paper essentially in-volve the Bayesian analysis of a normal linear re-
gression.
• The analysis here follows Geweke’s textbook, Con-temporary Bayesian Econometrics (an excellent
intro to the subject).
• y is a T × 1 vector of dependent variables
• X is a T ×k matrix of covariates (nonstochastic)
• Assume that error terms are normal, homoskedas-tic.
• y|β, h,X ∼ N(Xβ, h−1IT )
p(y|β, h,X) = (2π)−T/2hT/2 exp(−h(y−Xβ)0(y−Xβ)/2))
• h is called the precision parameter (inverse of vari-
ance)
• IT is a scalar covariance matrix
• This is our likelihood function.
• Recall that Bayes theorem implies that the pos-
terior distribution of the model parametes is the
prior times the likelihood.
• We need to specify a prior distribution for ourmodel parameters.
• p(β) = N(β,H−1)
p(β) = (2π)−k/2|H|1/2 exp(−(β − β)0H−1(β − β)))
• The prior on beta is normal with prior mean, βand prior precision H
• The prior distribution on h is s2h ∼ χ2(v)
p(h) =h2v/2Γ(v/2)
i−1(s)v/2h(v−2)/2 exp(−s2h/2)
• Remark- this is essentially a gamma distribution,rewritten in a manner that will be convenient for
reasons below.
• This form for the prior is chosen because of conju-gacy, i.e. the posterior distribution can be written
in an analytically convenient manner.
• Now recall tht the posterior is proportional to theprior time the likelihood.
• That is, p(β, h|X, y) ∝ p(β)p(h)p(y|β, h,X)
• Combining the equations above yields p(β, h|X, y) ∝:
(2π)−T/2hT/2h2v/2Γ(v/2)
i−1 |H|1/2(s)v/2h(T+v−2)/2 exp(−s2h/2)
exp
"−(β − β)0H−1(β − β)/2−h(y −Xβ)0(y −Xβ)/2
#
• Now recall from Cameron and Trivedi, the idea
behind Gibbs sampling is to ”block” the parame-
ters into a set of convenient conditional distribu-
tions.
• In this case, we will want to block p(β|h,X, y)
and p(h|β,X, y)
• Let’s first derive p(β|h,X, y).
• It is obvious from the agove expression, that β isgoing to be normally distributed.
• We will want to complete the the square insidethe exp() to rewrite the expression in the form(β − eβ)0fH−1(β − eβ)
• Then eβ will be the posterior mean and fH theposterior precision
• Distributing terms and completing the square inβ yields that:
(β − β)0H−1(β − β) + h(y −Xβ)0(y −Xβ) =
(β − β)0H−1(β − β) + h(y0 −Xβ)0(y0 −Xβ)
where y0 = fitted ols values
= (β − eβ)0fH−1(β − eβ) +QfH = H + hX 0Xeβ = fH−1(H−1β + hX 0Xb)
where b is the ols estimate of β
Q is a constant that does not depend on β
• Note that the posterior precision is a weightedaverage of the prior precison and the ols estimate
of the precision
• The posterior estimate of eβ involves a weightedaverage of the prior mean and the ols estimate.
• The weights depend on the posterior precision,the prior precision and the ols estimate of the
precision.
• As the sample size becomes sufficiently large, thedata will ”swamp” the prior.
• The number of terms in the likelihood is a func-tion of T and grows with the sample size
• The number of terms in the prior remains fixed.
• Next, we have to derive the posterior in h
• If you look at the prior times the likelihood, itis obvious that the posterior will be of the form
p(h|β,X, y) of hα exp(−hω)
• If we can derive α and ω we can express the pos-terior
• By some straightforward, albeit tedious, algebrawe can write the posterior distribution as p(h|β,X, y):
es2h ∼ χ2(ev)es = s+ (y0 −Xβ)0(y0 −Xβ)ev = v + T
• A Gibbs sampler generates a pseudo-random se-
quence (h(s), β(s))s = 1, ..., S using the following
markov chain,
1. Given (h(s), β(s)), draw β(s+1) ∼ p(β|h(s),X, y)
2. Given β(s+1) draw h(s+1) ∼ p(h|β(s+1),X, y)
3. Return to 1
• This example illustrates the importance of conju-gacy.
• That is, choosing our prior and likelihoods to bethe ”right” functional form greatly simplifies the
analysis of the posterior distribution.
• Standard texts in Bayesian statistics (e.g. Berger/Bernardoand Smith) have appendices that lay out conju-
gate distributions.
5 Bayesian Multinomial Probit
• The multinomial probit is closely related to theanalysis of the normal linear model.
• The multinomial probit is defined as:
yij = xijβ + εij (1)
var(εi1, ..., εiJ) = h−1IJcij = 1{yij > yij0 for j
0 6= j}
• In the above, yij is the utility of person i for
alternative j
• εij is the stochastic preference shock
• xij are covariates that enter into i’s utility
• cij = 1 if i chooses j
• If the yij were known, then we could use the Gibbssampler above to estimate β and h
• However, the yij are latent variables and thereforewe do data augmentation.
• The idea behind data augmentation is simple— weintegrate out the distribution of the variables that
we do not see.
• Follwing the notation in Cameron and Trivedi, letf(θ|y, y∗) denote the posterior conditional on theobserved variables, y and the latent variables, y∗.
• Let f(y∗|y, θ) denote the distribution of the latentvariable conditional on y and parameters.
• Then the posterior can be written as:
p(θ|y) =Zf(θ|y, y∗)f(y∗|y, θ)dy∗
• Taking account of the latent variable simply in-volves an additional Gibbs step.
• The distribution of the latent utility yijis a tru-
cated normal distribution.
• If cij = 1, yij is a truncated normal with mean pa-rameter β, precision h and lower truncation point
max{yij0, j0 6= j}.
• If cij = 0, yij is a truncated normal with mean
parameter β, precision h and upper truncation
point max{yij}.
• The Gibbs sampler for the multinomial probit sim-ply adds the data augmentation step above:
• A Gibbs sampler generates a pseudo-random se-
quence (h(s), β(s),½y(s)ij
¾i∈I,j∈J
) s = 1, ..., S us-
ing the following markov chain
1. Given (h(s), β(s)), draw β(s+1) ∼ p(β|h(s),X, y, C)
2. Given β(s+1) draw h(s+1) ∼ p(h|β(s+1),X, y, C)
3. For each I, draw y(s+1)i1 ∼ p(h|β(s+1),X, y
(s)i2 , ..., y
(s)iJ , C)
4. Draw y(s+1)i2 ∼ p(h|β(s+1),X, y
(s+1)i1 , ..., y
(s)iJ , C)
5. ...
6. Draw y(s+1)iJ ∼ p(h|β(s+1),X, y
(s+1)i1 , ..., y
(s+1)iJ−1 , C)
7. Return to 1
6 Target Marketing
• In ”The Value of Purchase History Data in TargetMarketing” Rossi et. al. attempt to estimate
household level preference parameters.
• This is of interest as a marketing problem.
• CMI checkout coupon uses purchase informationto customize coupons to a particular household.
• In principal, the entire purchase history (from con-sumer loyalty cards) could be used to customize
coupons (and hence prices)
• If a household level preference parameter can beforecasted with high precision, this is essentially
first degree price discrimination!
• Even with short purchase histories, they find thatprofits are increased 2.5 fold through the use of
purchase data compared to blanket couponing strate-
gies.
• Even one observation can boost profits from coupon-ing by 50%.
• This application is of interest to economists aswell.
• The methods in this paper allow us to account forconsumer heterogeneity in a very rich manner.
• This might be useful to examine the distribtionof welfare consequences of a policy intervention
(e.g. a merger or market regulation).
• Beyond that, these methods demonstrate the powerof Bayesian methods in latent variable problems.
7 Random Coefficients Model
• Multinomial probit with panel data on householdlevel choices
yh,t = Xh,tβh + εh,tεh,t ∼ N(0,Λ)
βh = ∆zh + vh, vh ∼ N(0, Vβ)
• Households h = 1, ...,H and time t = 1, ..., T
• Xh,t covariates and zh demographics
• Note that household specific random coefficients
βh remain fixed over time
• Ih,t observed choice
• The posterior distributions are derived in Appen-dix A.
• Formally, the derivations are very close to ourmultinomial probit model above.
• Gibbs sampling is used to simulate the posteriordistribution of Λ,∆, Vβ
8 Predictive Distributions
• The authors wish to give different coupons to dif-ferent households.
• A rational (Bayesian) decision maker would formher beliefs about household h’s preference para-
meters given her posterior about the model para-
meters.
• This will involve, as we show below, forming a
predictive distribution for βh given the econome-
trician’s information set.
• As a first case, suppose that the econometricianonly knew zh, the demographics of household h
• From our model, p(βh|zh,∆, Vβ) is N(∆zh, Vβ)
• Given the posterior p(∆, Vβ|Data), the econome-
tricians predictive distribution for βh is:
p(βh|zh,Data) =Zp(βh|zh,∆, Vβ)p(∆, Vβ|Data)
• We can simulate p(βh|zh,Data) using Gibbs sam-
pling given our posterior simulations ∆(s), V(s)β
s = 1, ..., S:
1
S
Xp(βh|zh,∆(s), V
(s)β )
• We could draw random βh from p(βh|zh,Data).
• For each∆(s), V (s)β , draw β(s)h from p(βh|zh,∆(s), V
(s)β )
• Given β(s)h , s = 1, ..., S, we could then simulate
purchase probabilities.
• Draw ε(s)ht from εh,t ∼ N(0,Λ(s))
• The posterior purchase probability for j, givenXht and zh is:
1
S
Xs1½Xjhtβh + ε
(s)jht > Xj0htβh + ε
(s)j0ht for j
0 6= j¾
• This would allow us to simulate the purchase re-sponse to different couponing strategies for a spe-
cific household h.
• The paper runs through different couponing strate-gies given different information set (e.g. full or
choice only information sets).
• The key ideas are similar- form a predictive distri-
bution for h’s preferences and simulate purchase
behavior in an analogous fashion.
• In the case of a full purchase information his-tory, we could use the raw Gibbs output since
the markov chain will simulate β(s)h s = 1, ..., S.
• This could then be used to simulate choice be-havior as in the example above (given draws of
ε(s)ht )
9 Data
• AC Neilson scanner panel data for tuna in Spring-field Missouri.
• 400 households, 1.5 years, 1-61 purchases.
• Brands and covariates in Table 2.
• Demographics Table 3.
• Table 4, delta coefficients.
• Poorer people prefer private label.
• Goodness of fit moderate for demographic coeffi-cients
• Figures 1 and 2, household level coefficient esti-mates with different information sets
• Table 5, return to different marketing strategies.
• Bottom line, you gain .5 cents to 1.0 cents per
customer through better estimates.
• With a lot of customers, this could be quite prof-itable.