A New Scoring Function for Bayesian Network Structure ......2. Bayesian Networks (BNs) and Structure...

Post on 07-Aug-2020

2 views 0 download

Transcript of A New Scoring Function for Bayesian Network Structure ......2. Bayesian Networks (BNs) and Structure...

SBIC+SB (G |D) = logP(D |G)−logn2

G + maxC∈ΩijEij∉G

∑ minc∈val (C )

sb(Xi,Xj |D,C = c)

Bayesian Information Criterion (BIC) SparsityBoost (SB)

P(S|R)!Sprinkler!T! F!

Rain!F! 0.2! 0.8!T! 0.01! 0.99!

•  A BN is a graph G that represents a probability distribution with one node per variable, and one conditional probability distribution (CPD) per node, � �

•  Structure learning as discrete� optimization: �

A New Scoring Function for Bayesian Network Structure Learning Extended to Arbitrary Discrete Variables

Rachel Hodos1, David Sontag2 1. Computational Biology program, NYU, 2. Computer Science Dept., NYU

Learning Task: Given n observations of m variables, learn the Bayesian network structure which generated the data.�

1. Introduction�

Focus of this work: Devise a score that makes structure learning easier with more data, and that works for discrete variables with any number of states.�

Input: Data, D�X1! X2!

X5!

X4!

X3!

X1!X2!X3!X4!X5!0! 2! 1! 0! 2!1! 1! 0! 0! 0!

1! 0! 1! 1! 2!

Output: Graph, G�

2. Bayesian Networks (BNs) and Structure Learning�

P(Xi | Pa(Xi ))

P(R)!Rain !

T! F!0.3! 0.7!

P(G|S,R)! Grass Wet!S! R! T! F!T! T! 0.99! 0.01!T! F! 0.9! 0.1!F! T! 0.8! 0.2!F! F! 0! 1!

Rain!

Grass !Wet!

A simple example:�•  But what score to use?�

�•  Maximum likelihood estimation,

S = P(D|G), gives complete graph, which is useless!�

•  Hence we need to encourage sparsity�

G* = argmaxG

S(G |D)

P(X1,X2,…,Xn ) = P(Xi | Pa(Xi ))i=1

n

�generic �

score� function�

Sprinkler!

3. A New Score: SparsityBoost�

•  PROBLEM: Existing complexity penalties are data agnostic, causing the score to be more difficult to optimize with more data�

•  IDEA: Add a data-dependent term that boosts sparsity.�•  HOW: Search for evidence that an edge should not be present,

and boost score of any graph that does not contain that edge. �

Data-agnostic complexity penalty, |G| = # of parameters of G�Large for strong evidence of independence, small otherwise�Ωij is a set of conditioning sets, e.g. all subsets of size ≤ 2, excluding Xi and Xj�

Want strongest evidence for each edge, so take max over conditioning sets�Independence should hold for all values of conditioning set�

1! 2!3!

4! 5!

1!

2!

4!

3!

5!

4. Bayesian Independence Testing�

sb(Xi,Xj |D,C = c) =“Conditioning on C=c, how strongly do the

data show that Xi is indep. of Xj?”�

� •  Use conditional mutual information, MI�

•  Assuming uniform prior over joint distributions, derive posterior, p(MI | D)�

MI(P(Xi,Xj | c)) = P(xi, x j | c)logP(xi, x j | c)

P(xi | c)P(x j | c)xi ,x j

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Prior over I (n=0)

I (mutual information)

P(I)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=800

I (mutual information)

P(I |

D)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=1200

I (mutual information)

P(I |

D)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=400

I (mutual information)

P(I |

D)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=800

I (mutual information)

P(I |

D)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=1200

I (mutual information)

P(I |

D)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Prior over I (n=0)

I (mutual information)

P(I)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=400

I (mutual information)

P(I |

D)

P(MI | D), �D from indep.�distribution�

P(MI | D),�D from dep.�distribution�

n increasing�

sb(Xi,Xj |D,C = c) = − log p(MI(P(Xi,Xj |C = c)) |D )η

∫ dp

threshold, η    

[1] Hutter, M., Zaffalon, M., Distribution of Mutual Information, Computational Statistics and Data Analysis, Vol. 48, No. 3, March 2005, pages 633-657.��[2] Brenner, E., Sontag, D., SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure, Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, July 2013.��[3] Beinlich, I. A., Suermondt, H. J., Chavez, R.M., & Cooper, G. F. 1989. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. Pages 247-256 of: Proceedings of the 2nd European Conference on Articial Intelligence in Medicine. Springer-Verlag.��[4] Cussens, J. Bayesian network learning with cutting planes. In Fabio G. Cozman and Avi Pfeffer, editors, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 153-160, Barcelona, 2011, AUAI Press.

6. References� 5. Results on Synthetic Data�

0 1000 2000 3000 4000 5000 6000 7000 80000

500

1000

1500

2000

2500

3000

Number of samples

Aver

age

runt

ime

(sec

)

BICBIC+SB1BIC+SB2

0 500 1000 1500 2000 2500 3000 3500 40000

5

10

15

20

25

30

35

40

Number of samples

Stru

ctur

al H

amm

ing

Dis

tanc

e

BICBIC+SB1BIC+SB2

•  Start with known network structure (‘Alarm’ network)�•  Generate synthetic data (only binary data shown)�•  Find globally optimal structure with respect to score�•  Each point shown is average of ten independent experiments�•  Both accuracy and runtime improved�

Error� Runtime�