# 1 Maximum Likelihood Estimation - huji.ac.il · 2006. 7. 8. · that M k is number of times the...

### Transcript of 1 Maximum Likelihood Estimation - huji.ac.il · 2006. 7. 8. · that M k is number of times the...

heads tails

Figure 1: A simple thumbtack tossing experiment.

0 0.2 0.4 0.6 0.8 1

L(θ

:D)

0 0.2 0.4 0.6 0.8 1

L(θ

:D)

Figure 2: The likelihood function for the sequence of tossesH, T, T, H, H .

1 Maximum Likelihood Estimation

In this section, we describe the basic principles behindmaximum likelihood estimation.

1.1 The Thumbtack Example

We start with what may be considered the simplest learning problem: parameter learn-ing for a single variable. This is a classical “Statistics 101” problem which illustratessome of the issues that we shall encounter in more complex learning problems. Sur-prisingly, this simple problem already contains some interesting issues that we need totackle.

Imagine that we have a thumbtack, and we conduct an experiment whereby we flipthe thumbtack in the air, and it comes to land as either heads or tails, as in Figure 1. Wetoss the coin several times, obtaining a data set consisting of heads or tails outcomes.Based on this data set, we want to estimate the probability with which the next flip willland heads or tails.

In this description, we already made the implicit assumptions that the thumbtacktosses are controlled by an (unknown) parameter θ, which describes the frequency ofheads in thumbtack tosses. In addition, we need to assume that the tosses are indepen-dent of each other. That is, the outcome of a toss is not affected by the outcomes ofprevious tosses: the thumbtack does not “remember” the previous flips. Data instancessatisfying these assumptions are often referred to as independent and identically dis-I.I.D. SAMPLES

tributed (i.i.d.) samples.Assume that we toss the thumbtack 100 times, of which 35 come up heads. What is

our estimate for θ? Our intuition suggests that the best estimate is 0.35. Had θ had been0.1, for example, our chances of seeing 35/100 heads would be much lower. In fact,

1

we examined a similar situation in our discussion of sampling methods in Section ??,where we used samples from a distribution to estimate the probability of a query. As wediscussed, the central limit theorem shows that, as the number of coin tosses grows, it isincreasingly unlikely to sample a sequence of i.i.d. thumbtack flips where the fractionof tosses that come out heads is very far from θ. Thus, for sufficiently large M , thefraction of heads among the tosses is a good estimate with high probability.

To formalize this intuition, assume that we have a set of thumbtack tosses x[1], . . . , x[M ]that are i.i.d., that is, each is sampled independently from the same distribution in whichX [m] is equal toH (heads) or T (tails) with probability θ and 1− θ, respectively. Ourtask is to find a good value for the parameter θ. As in many formulations of learningtasks, we define a hypothesis space Θ— a set of possibilities that we are considering,and a scoring function that tells us how good different hypotheses in the space are rel-ative to our data set D. In this case, our hypothesis space Θ is the set of all parametersθ ∈ [0, 1].

How do we score different possible parameters θ? One way of evaluating θ is byhow well it predicts the data. In other words, if the data is likely given the parameter,the parameter is a good predictor. For example, suppose we observe the sequence ofoutcomes H, T, T, H, H . If we know θ, we could assign a probability to observingthis particular sequence. The probability of the first toss is P (X [1] = H) = θ. Theprobability of the second toss is P (X [2] = T | X [1] = H), but our assumption thatthe coin tosses are independent allows us to conclude that this probability is simplyP (X [2] = T ) = 1 − θ. This is also the probability of the third outcome, and so on.Thus, the probability of the sequence is

P (〈H, T, T, H, H〉 : θ) = θ(1 − θ)(1 − θ)θθ = θ3(1 − θ)2.

As expected, this probability depends on the particular value θ. As we consider differ-ent values of θ, we get different probabilities for the sequence. Thus, we can examinehow the probability of the data changes as a function of θ. We define the likelihoodfunction to be

L(θ : 〈H, T, T, H, H〉) = P (〈H, T, T, H, H〉 : θ) = θ3(1 − θ)2.

Figure 2 plots the likelihood function in our example.Clearly, parameter values with higher likelihood are more likely to generate the

observed sequences. Thus, we can use the likelihood function as our measure of qual-ity for different parameter values, and select the parameter value that maximizes thelikelihood. Such an estimator is called Maximum Likelihood Estimator (MLE). ByMAXIMUM LIKELIHOOD

ESTIMATOR viewing Figure 2 we see that θ̂ = 0.6 = 3/5maximizes the likelihood for the sequenceH, T, T, H, H .

Can we find the MLE for the general case? Assume that our data set D of observa-tions containsMh heads andMt tails. We want to find the value θ̂ that maximizes thelikelihood of θ relative to D. The likelihood function in this case is:

L(θ : D) = θMh(1 − θ)Mt .

It turns out that it is easier to maximize the logarithm of the likelihood function. In ourcase, the log-likelihood function is:LOG-LIKELIHOOD

2

�(θ : D) = Mh log θ + Mt log(1 − θ).

Note that the log-likelihood is monotonically related to the likelihood. Therefore, max-imizing the one is equivalent to maximizing the other. However, the log-likelihood ismore convenient to work with, as products are converted to summations.

Differentiating the log-likelihood, setting the derivative to 0, and solving for θ, weget that the maximum likelihood parameters, which we denote θ̂, is

θ̂ =Mh

Mh + Mt(1)

as expected.As we shall see, the maximum likelihood approach has many advantages. However,

the approach also has some limitations. For example, if we get 3 heads out of 10 tossesthe MLE estimate is 0.3. We get the same estimate if we get 300 heads out of 1000tosses. Clearly, the two experiments are not equivalent. Our intuition is that, in thesecond experiment, we should be more confident of our estimate. Indeed, statisticalestimation theory deals with confidence intervals. These are common in news reports,e.g., when describing the results of election polls, where we often hear that “61%±2%”plan to vote for a certain candidate. The 2% is a confidence interval — the poll isdesigned so as to select enough people so that the MLE estimate will be within 0.02 ofthe true parameter, with high probability. Exercise ?? expands on this topic.

1.2 The Maximum Likelihood Principle

We start by describing the setting of the learning problem. Assume that we observeseveral i.i.d. samples of a set of random variables X from an unknown distributionP ∗(X ). We assume we know in advance the sample space we are dealing with (i.e.,which random variables, and what values they can take). However, we do not makeany additional assumptions about P ∗. We denote the training set of samples as D andTRAINING SET

assume it consists ofM instances of X : ξ[1], . . . ξ[M ].Next, we need to consider what exactly we want to learn. We assume that we

are given a parametric model for which we wish to estimate parameters. Formally, aPARAMETRIC MODEL

PARAMETERS model is a function P (ξ : θ) that, given a set of parameter values θ and an instance ξof X , assigns a probability (or density) to ξ. Of course, we require that for each choiceof parameters θ, P (ξ : θ) is a legal distribution; that is, it is non-negative and∑

ξ

P (ξ : θ) = 1.

In general, for each model, not all parameter values are legal. Thus, we need to definethe parameter space Θ, which is the set of allowable parameters.PARAMETER SPACE

To get some intuition, we consider concrete examples. The model we examined inSection 1.1 has parameter space Θthumbtack = [0, 1], and is defined as

Pthumbtack(x : θ) ={

θ if x = H1 − θ if x = T

There are many additional examples.

3

Example 1.1: Suppose thatX is a multinomial variable that can take values x1, . . . , xK .The simplest representation of a multinomial distribution is as a vector θ ∈ IRK , suchthat

Pmultinomial (x : θ) = θk if x = xk.

The parameter space of this model is

Θmultinomial =

{θ ∈ [0, 1]K :

∑i

θi = 1

}.

Example 1.2: Suppose that X is a continuous variable that can take values in the realline. A Gaussian model forX is

PGaussian(x : μ, σ) =1√2πσ

e−(x−μ)2

2σ2

where θ = 〈μ, σ〉. The parameter space for this model is ΘGaussian = IR × IR+. Thatis, we allow any real value of μ and any positive real value for σ.

The next step in maximum likelihood estimation is defining the likelihood function.LIKELIHOOD

As we saw in our example, the likelihood function, for a given choice of parameters θ,is the probability (or density) the model assigns the training data:

L(θ : D) =∏m

P (ξ[m] : θ)

In the thumbtack example, we have seen that we can write the likelihood functionusing simpler terms. That is, using the counts Mh and Mt, we managed to have acompact description of the likelihood. More precisely, once we knew the values ofMh andMt, we did not need to consider other aspects of training data (e.g., the orderof tosses). These are the sufficient statistics for the thumbtack learning problem. In amore general setting, a sufficient statistic is a function of the data that summarizes therelevant information for computing the likelihood.

Definition 1.3: A function s(ξ) from instances of X to IR� (for some �) is a sufficientSUFFICIENT STATISTICS

statistic if, for any two data sets D and D′ and any θ ∈ Θ, we have that∑ξ[m]∈D

s(ξ[m]) =∑

ξ′[m]∈D′s(ξ′[m]) =⇒ L(θ : D) = L(θ : D′).

We often refer to the tuple∑

ξ[m]∈D s(ξ[m]) as the sufficient statistics of the data setD.

Example 1.4: Let us reconsider the multinomial model of Example 1.1. It is easy tosee that a sufficient statistic for the data set is the tuple of counts 〈M1, . . . , MK〉, such

4

that Mk is number of times the value xk appears in the training data. To obtain thesecounts by summing instance-level statistics, we define s(x) to be a tuple of dimensionK , such that s(x) has a 0 in every position, except at the position k for which x = xk,where its value is 1:

s(xk) = (

k−1︷ ︸︸ ︷0, . . . , 0, 1,

n−k︷ ︸︸ ︷0, . . . , 0).

Given the vector of counts we can write the likelihood function as

L(D : θ) =∏k

θMk

k .

Example 1.5: Let us reconsider the Gaussian model of Example 1.2. In this case, itis less obvious how to construct sufficient statistics. However, if we expand the term(x − μ)2 in the exponent, we can rewrite the model as

PGaussian(x : μ, σ) = e−x2 12σ2 +x μ

σ2 − μ

2σ2 − 12 log(2π)−log(σ)

We then see that the function

sGaussian(x) = 〈1, x, x2〉is a sufficient statistic for this model. Note that the first element in the sufficient statis-tics tuple is “1”, which does not depend on the value of the data item; it serves, as inthe multinomial case, to count the number of data items.

Several comments about the likelihood function. First, we stress that the likelihoodfunction measures the effect of the choice of parameters on the training data. Thus, forexample, if we have two sets of parameters θ and θ′, so that L(θ : D) = L(θ′ : D),then we cannot, given only the data, distinguish between the two choices of parame-ters. Moreover, if L(θ : D) = L(θ′ : D) for all possible choices of D, then the twoparameters are indistinguishable for any outcome. In such a situation, we can say inadvance (i.e., before seeing the data) that some distinctions cannot be resolved basedon the data alone.

Second, since we are maximizing the likelihood function, we usually want it to becontinuous (and preferably smooth) function of θ. To ensure these properties, most ofthe theory of statistical estimation requires that P (ξ : θ) is a continuous and differen-tiable function of θ, and moreover that Θ is a continuous set of points (which is oftenassumed to be convex).

Once we have defined the likelihood function, we can use maximum likelihoodMLE

estimation (MLE) to choose the parameter values. Formally, we state this principle asfollows.

Maximum Likelihood Estimation: Given a data set D, choose parame-ters θ̂ that satisfy

L(θ̂ : D) = maxθ∈Θ

L(θ : D)

5

Example 1.6: Consider estimating the parameters of a multinomial distribution of Ex-ample 1.4. As one might guess, the maximum likelihood is attained when

θ̂k =Mk

M

(see Exercise ??). That is, the probability of each value of X corresponds to its fre-quency in the training data.

Example 1.7: Consider the estimating the parameters of a Gaussian distribution ofExample 1.5. It turns out that the maximum is attained when μ and σ correspond to theempirical mean and variance of the training data:

μ̂ =1M

∑m

x[m]

σ̂ =

√1M

∑m

(x[m] − μ̂)2

(see Exercise ??).

2 Bayesian Estimation

Although the MLE approach seems plausible, it can be overly simplistic in many cases.Assume again that we perform the thumbtack experiment and get 3 heads out of 10. Itmay be quite reasonable to conclude that the parameter θ is 0.3. But what if we do thesame experiment with a dime, and also get 3 heads? We would be much less likely tojump to the conclusion that the parameter of the dime is 0.3. Why? Because we have alot more experience with tossing dimes, so we have a lot more prior knowledge abouttheir behavior. Note that we do not want our prior knowledge to be an absolute guide,but rather a reasonable starting assumption that allows us to counterbalance our currentset of 10 tosses, under the assumption that they may not be typical. However, if weobserve 1,000,000 tosses of the dime, of which 300,000 came out heads, then we maybe more willing to conclude that this is a trick dime, one whose parameter is closer to0.3.

Maximum likelihood allows us to make neither of these distinctions: between athumbtack and a dime, and between 10 tosses and 1,000,000 tosses of the dime. Thereis, however, another approach, the one recommended by Bayesian statistics.

2.0.1 Joint Probabilistic Model

In this approach, we encode our prior knowledge about θ with a probability distri-bution; this distribution represents how likely we are a priori to believe the differentchoices of parameters. Once we quantify our knowledge (or lack thereof) about possi-ble values of θ, we can create a joint distribution over the parameter θ and the data casesthat we are about to observeX [1], . . . , X [M ]. This joint distribution is not arbitrary; itcaptures our assumptions about the experiment.

6

θ

X[1] X[2] X[m].�.�.Figure 3: The Bayesian network for simple Bayesian parameter estimation.

Let us reconsider these assumptions. Recall that we assumed that tosses are in-dependent of each other. Note, however, that this assumption was made when θ wasfixed. If we do not know θ, then the tosses are not marginally independent: Each tosstells us something about the parameter θ, and thereby about the probability of the nexttoss. However, once θ is known, we cannot learn about the outcome of one toss fromobserving the results of others. Thus, we assume that the tosses are conditionally in-dependent given θ. We can describe these assumptions using the Bayesian network ofFigure 3.

Having determined the model structure, it remains to specify the local probabilitymodels in this network. We begin by considering the probabilityP (X [m] | θ). Clearly,

P (x[m] | θ) ={

θ if x[m] = x1

1 − θ if x[m] = x0

Note that since we now treat θ as a random variable, we use the conditioning bar,instead of P (x[m] : θ).

To finish the description of the joint distribution, we need to describe P (θ). Thisis our prior probability distribution over the value of θ. In our case, this is a con-tinuous density over the interval [0, 1]. Before we discuss particular choices for thisdistribution, let us consider how we use it.

The network structure implies that the joint distribution of a particular data set andθ factorizes as

P (x[1], . . . , x[M ], θ) = P (x[1], . . . , x[M ] | θ)P (θ)

= P (θ)M∏

m=1

P (x[m] | θ)

= P (θ)θMh(1 − θ)Mt ,

whereMh is the number of heads in the data, andMt is the number of tails. Note thatthe expression P (x[1], . . . , x[M ] | θ) is simply the likelihood function L(θ : D).

This network specifies a joint probability model over parameters and data. Thereare several ways in which we can use this network. Most obviously, we can take an ob-served data setD ofM outcomes, and use it to instantiate the values of x[1], . . . , x[M ];we can then compute the posterior probability over θ:

P (θ | x[1], . . . , x[M ]) =P (x[1], . . . , x[M ] | θ)P (θ)

P (x[1], . . . , x[M ]).

7

In this posterior, the first term in the numerator is the likelihood, the second isthe prior over parameters, and the denominator is a normalizing factor, that we willnot expand on right now. We see that the posterior is (proportional to) a product ofthe likelihood and the prior. This product is normalized so that it will be a properdensity function. In fact, if the prior is a uniform distribution (that is, P (θ) = 1 for allθ ∈ [0, 1]), then the posterior is just the normalized likelihood function.

2.0.2 Prediction

If we do use a uniform prior, what then is the difference between the Bayesian approachand the MLE approach of the previous section? The main philosophical difference isin the use of the posterior. Instead of selecting from the posterior a single value for theparameter θ, we use it, in its entirety, for predicting the probability over the next toss.

To derive this prediction in a principled fashion, we introduce the value of the nextcoin toss x[M +1] to our network. We can then compute the probability over x[M +1]given the observations of the first M tosses. Note that, in this model, the parameter θis unknown, and we are considering all of its possible values. By reasoning over thepossible values of θ and using the chain rule we see that

P (x[M + 1] | x[1], . . . , x[M ]) =

=∫

P (x[M + 1] | θ, x[1], . . . , x[M ])P (θ | x[1], . . . , x[M ])dθ

=∫

P (x[M + 1] | θ)P (θ | x[1], . . . , x[M ])dθ,

where we use the conditional independence in the network to rewrite P (x[M + 1] |θ, x[1], . . . , x[M ]) as P (x[M +1] | θ). In other words, we are integrating our posteriorover θ to predict the probability of heads for the next toss.

Let us go back to our thumbtack example. Assume that our prior is uniform overθ in the interval [0, 1]. Then P (θ | x[1], . . . , x[M ]) is proportional to the likelihoodP (x[1], . . . , x[M ] | θ) = θMh(1 − θ)Mt . Plugging this into the integral, we need tocompute

P (X [M + 1] = x1 | x[1], . . . , x[M ]) =1

P (x[1], . . . , x[M ])

∫θ · θMh(1 − θ)Mtdθ.

Doing all the math (see Exercise ??), we get (for uniform priors)

P (X [M + 1] = x1 | x[1], . . . , x[M ]) =Mh + 1

Mh + Mt + 2. (2)

This prediction is quite similar to the MLE prediction of Eq. (1), except that it addsone “imaginary” sample to each count. Clearly, as the number of samples grows, theBayesian estimator and the MLE estimator converge to the same value. The particu-lar estimator that corresponds to a uniform prior is often referred to as the Laplace’scorrection.

8

0 0.2 0.4 0.6 0.8 1P(theta)

theta

Beta(1,1)

0 0.2 0.4 0.6 0.8 1P(theta)

theta

Beta(1,1)

0 0.2 0.4 0.6 0.8 1

P(theta)

theta

Beta(2,2)

0 0.2 0.4 0.6 0.8 1

P(theta)

theta

Beta(2,2)

0 0.2 0.4 0.6 0.8 1

P(theta)

theta

Beta(10,10)

0 0.2 0.4 0.6 0.8 1

P(theta)

theta

Beta(10,10)

Beta(1, 1) Beta(2, 2) Beta(10, 10)

0 0.2 0.4 0.6 0.8 1

P(theta)

theta

Beta(3,2)

0 0.2 0.4 0.6 0.8 1

P(theta)

theta

Beta(3,2)

0 0.2 0.4 0.6 0.8 1

P(theta)

theta

Beta(15,10)

0 0.2 0.4 0.6 0.8 1

P(theta)

theta

Beta(15,10)

0 0.2 0.4 0.6 0.8 1

Beta(0.5,0.5)

Beta(3, 2) Beta(15, 10) Beta(12 , 1

2 )

Figure 4: Examples of Beta distributions for different choices of hyperparameters.

2.0.3 Priors

We now want to consider non-uniform priors. The challenge here is to pick a distribu-tion over this continuous space that we can represent compactly (e.g., using an analyticformula), and update efficiently as we get new data. For reasons that we discuss below,an appropriate prior in this case is the Beta distribution.BETA DISTRIBUTION

Definition 2.1: A Beta distribution is parameterized by two hyperparameters αh, αt,HYPERPARAMETERS

which are positive reals. The distribution is defined as follows:

θ ∼ Beta(αh, αt) if p(θ) = γθαh−1(1 − θ)αt−1,

The constant γ is a normalizing constant, defined as follows:

γ =Γ(αh + αt)Γ(αh)Γ(αt)

where Γ(x) =∫ ∞0

tx−1e−tdt is the Gamma function.

Intuitively, the hyperparameters αh and αt correspond to the number of imaginaryheads and tails that we have “seen” before starting the experiment. Figure 4 showsBeta distributions for different values of α.

At first glance, the normalizing constant for the Beta distribution might seem some-what obscure. However, the Gamma function is actually a very natural one: it is sim-ply a continuous generalization of factorials. More precisely, it satisfies the propertiesΓ(1) = 1 and Γ(x + 1) = xΓ(x). As a consequence, we easily see that Γ(n + 1) = n!when n is an integer. This function arises directly from the integral over θ that definesthe normalizing constant, as

∫θxdθ = 1

xθx−1, and the 1/x coefficients accumulate,resulting in the Γ function.

9

Beta distributions have properties that make them particularly useful for parameterestimation. Assume our distribution P (θ) is Beta(αh, αt), and consider a single cointoss X . Let us compute the marginal probability overX , based on P (θ). To computethe marginal probability, we need to integrate out θ; standard integration techniquescan be used to show that:

P (X [1] = x1) =∫ 1

0

P (X [1] = x1 | θ) · P (θ)dθ

=∫ 1

0

θ · P (θ)dθ =αh

αh + αt.

This conclusion supports our intuition that the Beta prior indicates that we have seenαh (imaginary) heads αt (imaginary) tails.

Now, let us see what happens as we get more observations. Specifically, we observeMh heads andMt tails. It follows easily that:

P (θ | x[1], . . . , x[M ]) ∝ P (x[1], . . . , x[M ] | θ)P (θ)∝ θMh(1 − θ)Mt · θαh−1(1 − θ)αt−1

= θαh+Mh−1(1 − θ)αt+Mt−1

which is precisely Beta(αh + Mh, αt + Mt). This result illustrates a key property ofthe Beta distribution: If the prior is a Beta distribution, then the posterior distribution,i.e., the prior conditioned on the evidence, is also a Beta distribution.

An immediate consequence is that we can compute the probabilities over the nexttoss:

P (X [M + 1] = x1 | x[1], . . . , x[M ]) =αh + Mh

α + M

where α = αh + αt. In this case, our posterior Beta distribution tells us that we haveseen αh + Mh heads (imaginary and real) and αt + Mt tails.

It is interesting to examine the effect of the prior on the probability over the nextcoin toss. For example, the prior Beta(1, 1) is very different than Beta(10, 10): Al-though both predict that the probability of heads in the first toss is 0.5, the secondprior is more entrenched, and requires more observations to deviate from the predic-tion 0.5. To see this, suppose we observe 3 heads in 10 tosses. Using the first prior,our estimate is 3+1

10+2 = 13 ≈ 0.33. On the other hand, using the second prior, our

estimate is 3+1010+20 = 13

30 ≈ 0.43. However, as we obtain more data, the effect of theprior diminishes. if we obtain 1000 tosses of which 300 are heads, the first prior givesus an estimate of 300+1

1000+2 and the second an estimate of300+101000+20 , both of which are

very close to 0.3. Thus, we see that the Bayesian framework allows us to capture bothof the relevant distinctions: The distinction between the thumbtack and the dime canbe captured by the strength of the prior: for a dime, we might use αh = αt = 100,whereas for a thumbtack, we might use αh = αt = 1. The distinction between 10 and1000 samples is captured by the peakedness of our posterior, which increases with theamount of data.

10

2.1 Priors and Posteriors

We now turn to examine in more detail the Bayesian approach to dealing with unknownparameters. As before, we assume a general learning problem where we observe atraining set D that contains M i.i.d. samples of a set of random variable X from anunknown distribution P ∗(X ). We also assume that we have a parametric model P (ξ |θ) where we can choose parameters from a parameter space Θ.

Recall that the MLE approach attempts to find parameters θ̂ that are the parametersin Θ that are “best” given the data. The Bayesian approach, on the other hand, doesnot attempt to find such a point estimate. Instead, the underlying principle is that weshould keep track of our beliefs about θ’s values, and use these beliefs for reaching con-clusions. That is, we should quantify the subjective probability we assign to differentvalues of θ after we have seen the evidence. Note that, in representing such subjec-tive probabilities, we now treat θ as a random variable. Thus, the Bayesian approachrequires that we use probabilities to describe our initial uncertainty about the parame-ters θ, and then use probabilistic reasoning (i.e., Bayes rule) to take into account ourobservations.

To perform this task, we need to describe a joint distribution P (D, θ) over the dataand the parameters. We can easily write

P (D, θ) = P (D | θ)P (θ)

The first term is just the likelihood function we discussed above. The second termis the prior distribution over the possible values in Θ. The prior captures our initialPRIOR DISTRIBUTION

uncertainty about the parameters. It can also capture our previous experience beforestarting the experiment. For example, if we study coin tossing, we might have priorexperience that suggests that most coins are unbiased (or nearly unbiased).

Once we have specified the likelihood function and the prior, we can use the datato derive the posterior distribution over the parameters. Since we have specified a jointPOSTERIOR

DISTRIBUTION distribution over all the quantities in question, the posterior is immediately derived byBayes rule:

P (θ | D) =P (D | θ)P (θ)

P (D).

The term P (D) is the marginal likelihood of the dataMARGINAL LIKELIHOOD

P (D) =∫Θ

P (D | θ)P (θ)dθ

That is, the integration of the likelihood over all possible parameter assignments. Thisis the a priori probability of seeing this particular dataset given our prior beliefs.

As we saw, for some probabilistic models, the likelihood function can be compactlydescribed by using sufficient statistics. Can we also compactly describe the posteriordistribution? In general, this depends on the form of the prior. As we saw in thethumbtack example of Section 1.1, we can sometimes find priors for which we have adescription of the posterior.

11

As another example of the forms of priors and posteriors, let us examine the learn-ing problem of Example 1.4. Here we need to describe our uncertainty about the pa-rameters of a multinomial distribution. The parameter space Θ is the space of all non-negative vectors θ = 〈θ1, . . . , θK〉 such that ∑k θk = 1. As we saw in Example 1.4,the likelihood function in this model has the form:

L(D : θ) =∏k

θMk

k

Since the posterior is a product of the prior and the likelihood, it seems natural torequire that the prior also have a form similar to the likelihood.

One such family of priors are Dirichlet priors that generalize the Beta priors weDIRICHLET PRIORS

discussed above. A Dirichlet prior is specified by a set of hyperparametersα1, . . . , αK ,HYPERPARAMETERS

so thatθ ∼ Dirichlet (α1, . . . , αK) if P (θ) ∼

∏k

θαk−1k

It is easy to see that if we use a Dirichlet prior, then the posterior is also Dirichlet.

Proposition 2.2: If P (θ) is Dirichlet(α1, . . . , αK) then P (θ | D) is Dirichlet(α1 +M1, . . . , αK + MK), where Mk is the number of occurrences of xk.

Priors such as the Dirichlet priors are useful since the ensure that the posterior hasa nice compact description. Moreover, this description is using the same representationas the prior. This phenomenon is a general one, and one that we strive to achieve as itmakes our computation and representation much easier.

Definition 2.3: A family of priors P (θ : α) is conjugate to a particular model P (ξ | θ)CONJUGATE PRIORS

if for any possible dataset D of i.i.d. samples from P (ξ | θ), and any choice of legalhyperparameters α for the prior over θ, there are hyperparameters α′ that describe theposterior. That is,

P (θ : α′) ∝ P (D | θ)P (θ : α).

For example, Dirichlet priors are conjugate to the multinomial model. We note that thisdoes not preclude the possibility of other families that are also conjugate to the samemodel. See Exercise ?? for an example of such a prior for the multinomial model. Wecan find conjugate priors for other models as well. See Exercise ?? and Exercise ?? forthe development of conjugate priors for the Gaussian distribution.

This discussion shows some examples where we can easily update our beliefs aboutθ after observing a set of instances D. This update process results in a posterior thatcombines our prior knowledge and our observations. What can we do with the pos-terior? We can use the posterior to determine properties of the model at hand. Forexample, to assess our beliefs that a coin we experimented with is biased toward heads,we might compute the posterior probability that θ > t for some threshold t, say 0.6.

Another use of the posterior is to predict the probability of future examples. Sup-pose that we are about to sample a new instance ξ[M + 1]. Since we already have

12

observations over previous instances, our probability over a new example is

P (ξ[M + 1] | D) =∫

P (ξ[M + 1] | D, θ)P (θ | D)dθ

=∫

P (ξ[M + 1] | θ)P (θ | D)dθ

= IEP (θ|D)[[P (ξ[M + 1] | θ)]],

where, in the second step, we use the fact that instances are independent given θ. Thus,our prediction is the average over all parameters according to the posterior.

Let us examine prediction with the Dirichlet prior. We need to compute

P (x[M + 1] = xk | D) = IEP (θ|D)[[θk]].

To compute the prediction on a new data case, we need to compute the expectation ofparticular parameters with respect tor a Dirichlet distribution over θ.

Proposition 2.4: LetP (θ) be a Dirichlet distribution with hyperparametersα1, . . . , αk,then

E [θk] =αk∑k′ αk′

Recall that our posterior isDirichlet(α1+M1, . . . , αK +MK)whereM1, . . . , Mk

are the sufficient statistics from the data. Hence, the prediction with Dirichlet priors is

P (x[M + 1] = xk | D) =Mk + αk

M +∑

k′ αk′

This prediction is similar to prediction with the MLE parameters. The only differ-ence is that we added the hyperparameters to our counts when making the prediction.For this reason the Dirichlet hyperparameters are often called pseudo-counts. We canthink of these as the number of times we have seen the different outcomes in our priorexperience before conducting our current experiment.

The total of the pseudo-counts reflects how confident we are in our prior. To seethis, we defineM ′ =

∑α′

k to be the sample size of the pseudo-counts. The parameterM ′ is often called the equivalent sample size. Using M ′, we can rewrite the hyperpa-EQUIVALENT SAMPLE

SIZE rameters as αk = M ′θ′k, where θ′ = {θ′k : k = 1, . . . , K} is a distribution describingthe mean prediction of our prior. We can see that the prior prediction (before observingany data) is simply θ′. Moreover, we can rewrite the prediction given the posterior as:

P (x[M + 1] = xk | D) =M ′

M + M ′ θ′k +

M

M + M ′ ·Mk

M(3)

That is, the prediction is a weighted average (convex combination) of the prior meanand the MLE estimate. The combination weights are determined by the relative mag-nitude of M ′ — the confidence of the prior (or total weight of the pseudo-counts) —andM — the number of observed samples.

To gain some intuition for the interaction between these different factors, Figure 5shows the effect of the strength and means of the prior on our estimates. We can

13

M�=�#samples

P(X=H)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 100

M�=�#samples

P(X=H)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 100

(a) (b)

Figure 5: The effect of the strength and means of the prior on our estimates. Ourempirical distribution is an idealized version of samples from a biased coin where thefrequency of heads is 0.2. The x axis represents the number of samples from thedistribution, and the y the expected probability of heads according to the Bayesianestimate. (a) shows the effect of varying the priormeans θ′h, θ′t, for a fixed prior strengthM ′. (b) shows the effect of varying the prior strength for a fixed prior mean θ′h = θ′t =0.5.

see that, as the amount of real data grows, our estimate converges to the empiricaldistribution, regardless of the starting point. The convergence time grows both with thedifference between the prior mean and the empirical mean, and with the strength of theprior.

Based on Eq. (3), we can see that the Bayesian prediction will converge to the MLEestimate in two situations

• When M → ∞. Intuitively, when we have very large training set the contri-bution of the prior is negligible, and the prediction will be dominated by thefrequency of outcomes in the data.

• When M ′ → 0. In this case, we are unsure about our prior. Note that the casewhere M ′ = 0 is not achievable: the normalization constant for the Dirichletprior grows to infinity when the hyperparameters are close to 0. Thus, the priorwith M ′ = 0 (that is, αk = 0 for all k) is not well defined. Nonetheless, wecan define its behavior by examining the limit whenM ′ approaches 0. The priorwithM ′ = 0 is often called an improper prior.

The difference between the Bayesian estimate and the MLE estimate arises whenM is not too large, andM ′ is not close to 0. In these situations, the Bayesian estimateis “biased” toward the prior probability θ′. In effect, the Bayesian estimate is thensmoother than the MLE estimate. Since we have few samples, we are quite unsureabout our estimate given the data. Moreover, we can see that an additional sample willchange the MLE estimate dramatically.

14

P(X

= H

|D)

M

0.1

0.2

0.3

0.4

0.5

0.6

0.7

5 10 15 20 25 30 35 40 45 50N

T

H

P(X

= H

|D)

M

0.1

0.2

0.3

0.4

0.5

0.6

0.7

5 10 15 20 25 30 35 40 45 50N

T

H

Figure 6: The effect of priors on smoothing our parameter estimates. The graph showsthe estimate of P (X = H |D) (y-axis) after seeing different number of samples (x-axis). The graph below the x-axis shows the particular sequence of tosses. The solidline corresponds to the MLE estimate, and the remaining ones to Bayesian estimateswith different strengths, and uniform prior means. The large-dash line corresponds toBeta(1, 1), the small-dash line to Beta(5, 5), and the dotted line to Beta(10, 10).

Example 2.5: Suppose we are trying to estimate the parameter associated with a coin,and we observe one head and one tail. Our MLE estimate of θH is 1/2 = 0.5. Now, ifthe next observation is a head, we will change our estimate to be 2/3 ≈ 0.66. On theother hand, if our next observation is a tail, we will change our estimate to 1/3 ≈ 0.33.In contrast, consider the Bayesian estimate with a Dirichlet prior with M ′ = 1 andθ′H = 0.5. With this estimator our original estimate is 1.5/3 = 0.5. If we observeanother head, we revise to 2.5/4 = 0.625, and if observe another tail, we revise to1.5/4 = 0.375. We see that the estimate changes by slightly less after the update. IfM ′ is larger, then the smoothing is more aggressive. For example, when M ′ = 5,our estimate is 4.5/8 = 0.5625 after observing a head, and 3.5/8 = 0.4375 afterobserving a tail. We can also see this effect visually in Figure 6, which shows ourchanging estimate for P (θH) as we observe a particular sequence of tosses.

This smoothing effect results in more robust estimates when we do not have enoughdata to reach definite conclusions. If we have good prior knowledge, we revert to it.Alternatively, if we do not have prior knowledge, we can use a uniform prior that willkeep our estimate from taking extreme values. In general, it is a bad idea to haveextreme estimates (ones where some of the parameters are close to 0) since these mightassign too small probability to new instances we later observe. In particular, as wealready discussed, probability estimates that are actually 0 are dangerous, as no amountof evidence can change it. Thus, if we are unsure about our estimates, it is better tobias them away from extreme estimates. The MLE estimate, on the other hand, oftenassigns probability 0 to values that were not observed in the training data.

15

0

0.01

0.02

0.03

0.04

0 0.2 0.4 0.6 0.8 1

0.017

0.035

P(D

|θ)P(θ)

θ

Figure 7: Example of the differences between maximal likelihood score and marginallikelihood for the sequence of coin tosses 〈H, T, T, H, H〉.

3 Marginal Likelihood

Consider a single binary random variable X , and assume we have a prior distributionDirichlet(αh, αt) over X . Consider a data set D that has Mh heads and Mt tails.Then, the maximum likelihood value givenD is

P (D | θ̂) =(

Mh

M

)Mh

·(

Mt

M

)Mt

Now, consider the Bayesian way of assigning probability to the data.

P (D | G) =∫

ΘGP (D | θG ,G)P (θG | G)dθG (4)

where P (D | θG ,G) is the likelihood of the data given the network 〈G, θG〉 and P (θG |G) is our prior distribution over different parameter values for the network G. We callthis term the marginal likelihood of the data given the structure, since we marginalizeMARGINAL LIKELIHOOD

out the unknown parameters.Here, we are not conditioning on the parameter. Instead, we need to compute the

probability P (X [1], . . . , X [M ]) of the data given our prior. One approach to comput-ing this term is to evaluate the integral Eq. (4). An alternative approach uses the chainrule

P (x[1], . . . , x[M ]) = P (x[1]) · P (x[2] | x[1]) · . . . · P (x[M ] | x[1], . . . , x[M − 1])

Recall that if we use a Dirichlet prior, then

P (x[m + 1] | x[1], . . . , x[m]) =Mm

h + αh

m + α

where Mmh is the number of heads in the first m examples. For example, if D =

〈H, T, T, H, H〉,

P (x[1], . . . , x[5]) =αh

α· αt

α + 1· αt + 1

α + 2· αh + 1

α + 3· αh + 2

α + 4

=[αh(αh + 1)(αh + 2)][αt(αt + 1)]

α · · · (α + 4)

16

Picking αh = αt = 1, so that α = αh + αt = 2, we get

[1 · 2 · 3] · [1 · 2]2 · 3 · 4 · 5 · 6 =

12720

= 0.017

(see Figure 7) which is significantly lower than the log-likelihood(35

)3

·(

25

)2

=1083125

≈ 0.035.

Thus, the log-likelihood ascribes a much higher probability to this sequence than doesthe marginal likelihood. The reason is that the log-likelihood is making an overlyoptimistic assessment, based on a parameter that was designed with full retrospectiveknowledge to be an optimal fit to the entire sequence.

In general, for a binomial distribution with a Beta prior, we have

P (x[1], . . . , x[M ]) =[αh · · · (αh + Mh − 1)][αt · · · (αt + Mt − 1)]

α · · · (α + M − 1)

Each of the terms in square brackets is a product of a sequence of numbers such asα · (α + 1) · · · (α + M − 1). If α is an integer, we can write this product as (α+M−1)!

(α−1)! .However, we do not necessarily know that α is an integer. It turns out that we canuse a generalization of the factorial function for this purpose. Recall that the Gammafunction is such that Γ(m) = (m − 1)! and Γ(x + 1) = x · Γ(x). Using the laterproperty, we can rewrite

α(α + 1) · · · (α + M − 1) =Γ(α + M)

Γ(α)

Hence,

P (x[1], . . . , x[M ]) =Γ(α)

Γ(α + M)· Γ(αh + Mh)

Γ(αh)· Γ(αt + Mt)

Γ(αt)

A similar formula holds for a multinomial distribution over the space x1, . . . , xk,with a Dirichlet prior with hyperparameters α1, . . . , αk:

P (x[1], . . . , x[M ]) =Γ(α)

Γ(α + M)·

k∏i=1

Γ(αi + M [xi])Γ(αi)

. (5)

Note that the final expression for the marginal likelihood is invariant to the orderwe selected in the expansion via the chain rule. In particular, any other order resultsin exactly the same final expression. This property is reassuring, because the i.i.d. as-sumption tells us that the specific order in which we get data cases is insignificant.Also note that the marginal likelihood can be computed directly from the same suffi-cient statistics used in the computation of the likelihood function — the counts of thedifferent values of the variable in the data.

17