Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log (...

61
149 Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006 Learning Bayesian Networks 23.02.2006

Transcript of Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log (...

Page 1: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

149Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Learning Bayesian Networks

23.02.2006

Page 2: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

150Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Learning Bayesian Networks

p S p x Ssh

i i ih

i

n

( | , ) ( | , , )x paθ θ==

∏1

localdistributionfunctions

joint distribution function

Sh is the hypothesis that structure S (minimal) encodes the joint distribution function

Page 3: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

151Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Example

θ θ θ θs = { , , }| |1 2 1 2 1

X1 X2S:

p x x S p x S p x x Ssh h h( , | , ) ( | , ) ( | , , )

( )|

|

1 2 1 1 2 1 2 1

1 2 11

θ θ θθ θ

=

= −

p x x S p x S p x x Ssh h h( , | , ) ( | , ) ( | , , )|

|

1 2 1 1 2 1 2 1

1 2 1

θ θ θθ θ

==

Page 4: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

152Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Updating or "Learning" Parameters(Sh is certain, θθθθs is uncertain)Let be a random sample from

D

p Sm

sh

= { , , }

( | , ).

x x

x1 K

θ

p D S p S p D S

p S p S

sh

sh

sh

sh

l sh

l

m

( | , ) ( | ) ( | , )

( | ) ( | , )

θ θ θ

θ θ

==

∏ x1

∫ ++ = sh

sh

smh

m dSDPSDxPSDxP θθθ ),|(),,|(),|( 11

Page 5: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

153Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Example

sample 1

sample 2

M

Θ2|1Θ1 Θ2|1

X1 X2

X1 X2

θ θ θ θs = { , , }| |1 2 1 2 1

X1 X2S:

Page 6: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

154Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Exact computation of p(θθθθs|D,Sh)

� No missing data, no hidden variables

� Independent parameters

� Local distribution functions from the exponential family, conjugate priors

p S p Ssh

ih

i

n

( | ) ( | )θ θ==

∏1

Page 7: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

155Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Parameter Independence

sample 1

sample 2

M

Θ2|1Θ1 Θ2|1

X1 X2

X1 X2

Page 8: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

156Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

No Missing Data

Each parameter can be updated independently….

sample 1

sample 2

M

Θ2|1Θ1 Θ2|1

X1 X2

X1 X2

P N N S ph N N( | , , ) ( ) ( )| | | |θ θ θ θ2 1 21 2 1 2 1 2 1 2 121 2 11∝ −

Page 9: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

157Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Updating Parameters

x1 : true false x2 : false true x3 : true true

X1 X2

Data (D)

x4 : true true x5 : false true x6 : true false

p D p( | ) ( )θ θ θ θ1 1 14

121∝ −( )

Page 10: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

158Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Updating Parameters

p D p( | ) ( )θ θ θ θ2 1 2 1 2 12

2 121| | | |( )∝ −x1 : true false

x2 : false true x3 : true true

X1 X2

Data (D)

x4 : true truex5 : false true x6 : true false

Page 11: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

159Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Updating Parameters

p D p( | ) ( )θ θ θ θ2 1 2 1 2 12

2 101| | | |( )∝ −x1 : true false

x2 : false truex3 : true true

X1 X2

Data (D)

x4 : true true x5 : false truex6 : true false

Page 12: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

160Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Conjugate Priors

Θ2|1Θ1 Θ2|1

X1 X2

X1 X2

Beta( 1 1θ α| )

M

Beta( 2|1 2|1θ α| )

Beta( 2| 1 2| 1θ α| )

Page 13: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

161Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Missing Data

x1 : ? true X1 X2

Data

p x p x x p x x p x x p x x( ) =θ θ θ2 1 2 2 1 1 2 1 2 2 1 1 2 1 2| | || ( | , ) ( | ) ( | , ) ( | )+

Beta Beta

Mixing coefficients

For real problems, exact calculations are intractable.

Page 14: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

162Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Approximate computation of p(θθθθs|D,Sh)� Monte Carlo: Gibbs sampling, importance sampling (e.g., Neal 93)

� Gaussian approximation (e.g., Kass et al. 88)

� MAP: ~θs

→)~()~(

21

),~|(),|(ss

tss Ah

smh

s eSDcpSSpθθθθ

θθ−−−∞→

Page 15: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

163Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Gibbs Sampling Geman and Geman (1984)

Basic idea:Given = { and ( ), estimate ( ( )) as follows:

Initialize somehow Repeat

- For = , ,

Sample each according to ( | \ )

Compute using current values of

Average of ( ) regardless of the initial .

X x x p X f X

X

i n

x p x X x

f X X

f X f X X

n

i i i

1

1

, , } E

( )

E( ( )),

K

K

••

Page 16: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

164Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

••

Initialize Repeat

- Sample missing data (given and producing

- Compute

- Sample from

Average of

θ

θθ

θ θ

θ θ

s

s l lc

s lc h

s s lc h

s lc h

s lh

D D

p D S

p D S

p D S p D S

)

( | , )

( | , )

( | , ) ( | , )

Using Gibbs Sampling to Approximate p(θθθθ|D,Sh)

Page 17: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

165Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Gaussian ApproximationTierney and Kadane (1986), Kass et al. (1988)

Basic idea: for large data sets, will often be ~Gaussianp D Ss

h( | , )θ

θ s

p D Ssh( | , )θ

Page 18: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

166Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Gaussian Approximationg p D p

g g

g g A A g

p D p p D p e

T

AT

( ) log ( | ) ( )

( ) ( ~):

( ) ( ~) ( ~) ( ~) ( ' ' ( ~))

( | ) ( ) ( | ~) ( ~)( ~ ) ( ~ )

θ θ θθ θ

θ θ θ θ θ θ θ

θ θ θ θθ θ θ θ

≈ + − − − = −

≈− − −

Expand about its maximum 12

12

Page 19: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

167Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Computational Considerations� Find �Gradient methods�Monte-Carlo methods�EM, if p(D|θθθθs,Sh) is in the exponential family

� Compute �Numerical methods (Meng and Rubin 91) �Likelihood ratio tests (e.g., Raftery 94) �Via inference, if variables discrete (Thiesson 95)

g''( ~)θ

Page 20: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

168Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Expectation-Maximization (EM) Algorithm (Dempster et al. (1977))� Initialize parameters� Expectation step: compute the expected sufficient statistics

� Maximization step: choose parameters so as to maximize their posterior probability given the expected sufficient statistics

� Iterate. Parameters will converge to a local MAP value

E N p x xs l sl

N

( | ) ( , | , )12 1 21

θ θ==∑ x

θθ α

θ α θ α2 121 21

21 21 2 1 2 1

11 1| :

( | )( | ) ( | )

=+ −

+ − + + −E N

E N E Ns

s s

Page 21: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

169Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Learning Structure(both Sh and θθθθs are uncertain)

?X1 X2

X1 X2

Sa:

Sb:

)|(),|(

)|(),|()|(

1

11

DSpDSxp

DSpDSxpDxphb

hbm

ha

hamm

+

++ +=

model averaging

Page 22: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

170Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Learning Structure(both Sh and θθθθs are uncertain)

p S D p S p D S

p S p D S p S d

h h h

hs

hs

hs

( | ) ( ) ( | )

( ) ( | , ) ( | )

= ∫ θ θ θ

Parameters updated as beforeParameters updated as before

marginallikelihood

Page 23: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

171Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Model Selection

� The number of possible structures for a given domain is more than exponential in the number of variables� Solution: Use only one or a handful of models� Necessary components: �Search method�Scoring method

Page 24: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

172Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

One Reasonable Score: Posterior Probability of a Structure

p S D p S p D S

p S p D S p S d

h h h

hs

hs

hs

( | ) ( ) ( | )

( ) ( | , ) ( | )

= ∫ θ θ θ

structureprior

parameterprior

likelihood

Page 25: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

173Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Relation to cross-validation(Dawid 84)

log ( | ) log ( | , , , )

log ( | ) log ( | , ) log ( | , , )

p D S p S

p S p S p S

hl l

h

l

m

h h h

=

= + + +

−=∑ x x x

x x x x x x

1 11

1 2 1 3 1 2

K

L

Obs! The last term in the sumis related to cross-validation

Page 26: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

174Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Predictive ScoreSpiegelhalter et al. (1993)

pred(S p y y y Shl l l l

h

l

m

) log ( | ,{ , }, ,{ , }, )= − −=∑ x x x1 1 1 1

1

K

Ydisease

X1

X2

Xn symptoms...

Page 27: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

175Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Exact computation of p(D|Sh)

� No missing data, no hidden variables

� Independent parameters

� Local distribution functions from the exponential family, conjugate priors� (Prior modularity)

p S p Ssh

ih

i

n

( | ) ( | )θ θ==

∏1

Page 28: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

176Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Prior Modularity

The parameter priors are modular.That is, these quantities for Xi depend only on the structure that is local to Xi (i.e., Pai) and not all of S.

p S p Sih

ih( | ) ( | )θ θ1 2=

If Xi has the same parents in S1 and S2, then

Page 29: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

177Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Example: All Local Distributions Multinomial

p Sx

hx

ki ij

ik

ij

ijk( | )| |

θ θ αpa pa

∝ −∏ 1Conjugate prior:

p x Sik

ij

ih

xik

ij( | , , )

|pa

paθ θ=

Local distribution functions:

Page 30: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

178Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Bayesian Dirichlet ScoreCooper and Herskovits (1991)

p D SN

Nh ij

ij ijj

q

i

nijk ijk

ijkk

ri i

( | )( )

( )

( )

( )=

++

== =∏∏ ∏

ΓΓ

ΓΓ

αα

αα11 1

N X x

r X

q X

N N

ijk i i ij

i i

i i

ij ijkk

r

ij ijkk

ri i

:

::

# cases where = and =

number of states of number of instances of parents of

ik Pa pa

α α= == =∑ ∑

1 1

Page 31: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

179Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

� Monte Carlo � Gaussian approximation

� Bayesian Information Criterion (BIC)log ( | ) log ( | , $ ) ( / ) log( ) ( )p D S p D S d m Oh h

s= − +θ 2 1

log ( | ) log ( | , ~ ) log ( ~ | )

( / ) log( ) ( / ) log | | ( )

p D S p D S p S

d A O m

h hs s

h= +

+ − + −

θ θ

π2 2 1 2 1

Approximations for p(D|Sh)

Page 32: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

180Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Bayesian Information Criterion (BIC) Schwarz (1978)

� No priors needed� BIC = - approximation of stochastic complexity (Rissanen 87)

log ( | ) log ( | , $ ) logp D S p D Sd

mh hs= −θ

2

Page 33: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

181Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Another ApproximationCheeseman & Stutz (1995)

� Accuracy (CH96): Gaussian > CS >> BIC� As efficient as BIC

log ( | ) log ( | )p D S p D Sh h≈ EM

p D SE N

E NEM

h ij

ij ijj

q

i

nijk ijk

ijkk

ri i

( | )( )

( ( ))

( ( ))

( )≈

++

== =∏∏ ∏

ΓΓ

ΓΓ

αα

αα11 1

Page 34: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

182Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Problems When There Are Many Possible Structures� Prior assessment� Search

Page 35: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

183Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Assumptions for ConstructingParameter Priors Geiger and Heckerman (1995)

Relatively small# of assessments

Parameter priorsfor all structuresin a given domain

Page 36: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

184Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Assumptions That Simplify Assessment of Priors� Parameter independence� Prior modularity� (Conjugate priors)� Marginal likelihood equivalence

Page 37: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

185Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Distribution Equivalence

Suppose the local likelihoods are restricted to some family F (e.g., discrete, linear regression).Two network structures for X are distribution equivalent wrt F if they encode the same distributions on X.

Page 38: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

186Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Independence and Distribution Equivalence

S1, S2 distribution eqv wrt some F ⇒ S1, S2 independence eqvbut the converse does not hold for all F. E.g.:

p x S

a b xi i

ji

h

i ijx

jj i

( | , , )

exp

pa

pa

θ =

+ +�

��

��

��

���

∈∑

1

1

X1

X2 X3

X1

X2 X3

and are not distribution equivalent

Page 39: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

187Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Assumption:Covered-Arc-Reversal EquivalenceGiven local likelihoods restricted to F, any two structures for X that differ only by a covered arc reversal are distribution equivalent.Examples:- unrestricted discrete- linear regression (e.g., Shachter and Kenley)

S1, S2 independence eqv⇒ S1, S2 distribution eqv wrt F

Page 40: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

188Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

θθθθx: Parameters for the Joint Likelihood

X1 X2 X3

X1 X3 X2

X2 X1 X3

X2 X3 X1

X3 X1 X2

X3 X2 X1

p Sch( | )θ x

p Sch( | , )x xθ

Page 41: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

189Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

θθθθx: Discrete Case

p S p x x Sch

n ch

x xn( | , ) ( , , | , ) , ,x x xθ θ θ= =1 1

K �

θ θx x x x xi

n

n i i1 1 11

, , | , ,� �=−

=∏

∂θ∂θ

θx

scx x x

r

x xi

n

i i

j in

j

i

=−

= + −

=

∏∏ [ ]| , ,[ ]

, ,1 1

1

1

1

1

1

Π

Page 42: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

190Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

θθθθx: Linear-Regression Casep S Wc

hn( | , ) ( | , )x N xxθ µ=

µ µi i ji jj

i

m b= +=

−∑1

1

W iW i

v v

v v

i it

i

i

i

i

i i

( )( )

+ =+

+ +

+

+

+

+

+ +

1 1

1 1

1

1

1

1

1 1

b b b

b

Page 43: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

191Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Discrete Case: Dirichlet Priors are Inevitable

p Sch

x xp x x S

x xn

n ch

n

( | ) , ,( , , | )

, ,

θ θαx = ⋅ −∏ 1

1

1

1K

K

K

parameter independence

given:likelihood equivalence

Page 44: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

192Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Discrete Case

⇓ parameter independenceparameter independencelikelihood equivalencelikelihood equivalence

X1 X2 X1 X2

f f f f f fy x1 1 21 2 1 2 1

1 1

2 2 12 12 12 12

2 21 1( ) ( ) ( )

( )( ) ( ) ( )

( )| | | | | | | |

θ θ θθ θ

θ θ θθ θ−

=−

Page 45: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

193Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Assessments for the BDeScore

is Dirichlet

p S ch( | )θ x

- equivalent sample size

-

αp X X Sn c

h( , , | )1 K

prior Bayesian networkfor {X1,...,Xn}

Page 46: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

194Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

BDe ScoreHeckerman, Geiger, and Chickering(1994)

p D SN

Nh ij

ij ijj

q

i

nijk ijk

ijkk

ri i

( | )( )

( )

( )

( )=

++

== =∏∏ ∏

ΓΓ

ΓΓ

αα

αα11 1

α ijk i ik

i ik

chp X x S= = =( , | )Pa pa

“Equivalent structures get equal scores.”

Page 47: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

195Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Linear-Regression Case:Normal-Wishart Prior is Inevitable

p S Wch

n W( | ) ( | , , , )θ θ µ α αµx xNw= 0 0

parameter independence

given:likelihood equivalence

Page 48: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

196Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Combine Domain Knowledge and Data

x1 : true false true . . . truex2 : false true true . . . false

xN : true true false . . . true

X1 X2 X3 . . . Xn

...

X1

X4X3

X2

Prior Network Equivalent Sample Size Data

α

X1

X4

X2

X3

Priors for all structures

Learned structure

Page 49: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

197Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Structure Priors

p S D p S D p S p D Sh h h h( | ) ( , ) ( ) ( | )∝ =

structure prior

Page 50: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

198Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Structure Prior

δ is the number of arcs that differ between

Sh and the prior network

p S h( ) ,∝ < <κ κδ 0 1

Page 51: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

199Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Search Methods

��� � � � � ��� � ���� � �� � � � � � � � � � � � � � � � � � �� � � �� � � � � � �� � � �� � �� � � � � � � �

� � � � � � ��� � � � � � � � ! � � � �" � � ��# $ %& � � � �� � � � '( )

* � � � � � �� � � � � � � � �

+, - � � � �/. %�0 132 � � � � � � � � )

+ 4� � �2 " � � � � �� � � � �

+ 5 � �� 2 & � � 6 � �� � � � � � � � � � � �

7� � � � � �� � ��8 9: - �; �=< � >� �; � 6� �� � 6 � � �� �

+ 7� � � �� � � � 5 � � � % $ ' '( )@? & � � � �� � � � % $ ' 'A )

& � � � 6�� �� ; �=< �� �� � � 6�� �� � � � �score( , ) ( , , )S D s X Di i i

i

n

==

∏ Pa1

Page 52: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

200Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

score( , ) log ( , , )

( , , ) log ( , , )

S D s X Pa D

w X Pa D s X D

hi i i

i

n

i ii

n

i ii

n

=

= + ∅

=

= =

∑∑ ∑

1

1 1

w X X D s X X D s X Di j i j i i i( , , ) log ( , , ) log ( , , )≡ − ∅

Special Case: Each Node Has at Most One Parent

Page 53: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

201Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Each Node Has at Most One Parent

The network with the highest probability is theone for which Σ w(Xi , Pai, D) is a maximum

• Scores without likelihood+prior equivalence:Maximum branchings (Edmonds)

• Scores with likelihood+prior equivalence:Maximum spanning tree

Page 54: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

202Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

General Case: Each Node Has at Most kParents

Finding best structure is NP-hard for k>1� Greedy (+/- restarts)� Best-first search� Simulating annealing� Other Monte-Carlo approaches

Page 55: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

203Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Local Search

scoreall possible

single changes

anychangesbetter?

performbest

change

yes

no

returnsaved structure

initializestructure

• prior structure• empty graph• max spanning tree• random graph

extension:multiple restarts

Page 56: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

204Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Simulated Annealing (Metropolis)

pick random changeand compute:

p=exp(∆∆∆∆score/T)

quit?

performchange

with probp

yes

returnsaved structure

initializestructure

• prior structure• empty graph• max spanning tree• random graph

no

Page 57: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

205Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Evaluation Methodology

goldstandardnetwork

learnednetwork(s)

priornetwork

datarandomsample

noise (ηηηη)

Two measures of utility of learned network:- Cross Entropy (gold standard network, learned network)- Structural difference

score +search

Page 58: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

206Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

R

AA

A

A

RR

R

D

D

DD

D

D

R

Gold Standard

Prior Network

Learned Network

Data

Page 59: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

207Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

D

Gold Standard

Prior Network (no arcs)

Learned Network

Data (10,000 cases)

Search over equivalence classesSpirtes and Meek (1995)

Page 60: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

208Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Learning with Hidden VariablesSpecial case of learning with missing dataE.g.: AutoClass (Cheeseman)

H

X1 X2 Xn...

observed

unobserved

Page 61: Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

209Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Identifying Hidden Variables

� Prior knowledge (e.g., AutoClass)� Dependency cliques