A New Scoring Function for Bayesian Network Structure ......2. Bayesian Networks (BNs) and Structure...

1
S BIC+SB (G | D ) = log P ( D | G ) log n 2 G + max C ∈Ω ij E ij G min c val ( C ) sb( X i , X j | D , C = c ) Bayesian Information Criterion (BIC) SparsityBoost (SB) P(S|R) Sprinkler T F Rain F 0.2 0.8 T 0.01 0.99 A BN is a graph G that represents a probability distribution with one node per variable, and one conditional probability distribution (CPD) per node, Structure learning as discrete optimization: A New Scoring Function for Bayesian Network Structure Learning Extended to Arbitrary Discrete Variables Rachel Hodos 1 , David Sontag 2 1. Computational Biology program, NYU, 2. Computer Science Dept., NYU Learning Task: Given n observations of m variables, learn the Bayesian network structure which generated the data. 1. Introduction Focus of this work: Devise a score that makes structure learning easier with more data, and that works for discrete variables with any number of states. Input: Data, D X 1 X 2 X 5 X 4 X 3 X 1 X 2 X 3 X 4 X 5 02102 11000 10112 Output: Graph, G 2. Bayesian Networks (BNs) and Structure Learning P ( X i | Pa( X i )) P(R) Rain T F 0.3 0.7 P(G|S,R) Grass Wet S R T F T T 0.99 0.01 T F 0.9 0.1 F T 0.8 0.2 F F 0 1 Rain Grass Wet A simple example: But what score to use? Maximum likelihood estimation, S = P(D|G), gives complete graph, which is useless! Hence we need to encourage sparsity G * = argmax G S (G | D ) P ( X 1 , X 2 , , X n ) = P ( X i | Pa( X i )) i= 1 n generic score function Sprinkler 3. A New Score: SparsityBoost PROBLEM: Existing complexity penalties are data agnostic , causing the score to be more difficult to optimize with more data IDEA: Add a data-dependent term that boosts sparsity. HOW: Search for evidence that an edge should not be present, and boost score of any graph that does not contain that edge. Data-agnostic complexity penalty, |G| = # of parameters of G Large for strong evidence of independence, small otherwise Ω ij is a set of conditioning sets, e.g. all subsets of size 2, excluding X i and X j Want strongest evidence for each edge, so take max over conditioning sets Independence should hold for all values of conditioning set 1 2 3 4 5 1 2 4 3 5 4. Bayesian Independence Testing sb( X i , X j | D , C = c ) = “Conditioning on C=c, how strongly do the data show that X i is indep. of X j ?” Use conditional mutual information, MI Assuming uniform prior over joint distributions, derive posterior, p(MI | D) MI ( P ( X i , X j | c )) = P ( x i , x j | c )log P ( x i , x j | c ) P ( x i | c ) P ( x j | c ) x i , x j 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 70 80 Prior over I (n=0) I (mutual information) P(I) 0.09 0.1 10 20 30 40 50 60 70 80 Posterior over I with n=800 P(I | D) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 70 80 Posterior over I with n=1200 I (mutual information) P(I | D) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 70 80 I (mutual information) P(I | D) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 I (mutual information) P(I | D) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 70 80 Posterior over I with n=1200 I (mutual information) P(I | D) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 70 80 Prior over I (n=0) I (mutual information) P(I) 0 0.01 0.02 0 10 20 30 40 50 60 70 80 Posterior over I with n=400 P(I | D) P(MI | D), D from indep . distribution P(MI | D), D from dep . distribution n increasing sb( X i , X j | D , C = c ) = log p( MI ( P ( X i , X j | C = c )) | D ) η dp threshold, η [1] Hutter, M., Zaffalon, M., Distribution of Mutual Information, Computational Statistics and Data Analysis, Vol. 48, No. 3, March 2005, pages 633-657. [2] Brenner, E., Sontag, D., SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure, Proceedings of the 29 th Conference on Uncertainty in Artificial Intelligence, July 2013. [3] Beinlich, I. A., Suermondt, H. J., Chavez, R.M., & Cooper, G. F. 1989. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. Pages 247-256 of: Proceedings of the 2nd European Conference on Articial Intelligence in Medicine. Springer-Verlag. [4] Cussens, J. Bayesian network learning with cutting planes. In Fabio G. Cozman and Avi Pfeffer, editors, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 153-160, Barcelona, 2011, AUAI Press. 6. References 5. Results on Synthetic Data 0 1000 2000 3000 4000 5000 6000 7000 8000 0 500 1000 1500 2000 2500 3000 Number of samples Average runtime (sec) BIC BIC+SB1 BIC+SB2 0 500 1000 1500 2000 2500 3000 3500 4000 0 5 10 15 20 25 30 35 40 Number of samples Structural Hamming Distance BIC BIC+SB1 BIC+SB2 Start with known network structure (‘Alarm’ network) Generate synthetic data (only binary data shown) Find globally optimal structure with respect to score Each point shown is average of ten independent experiments Both accuracy and runtime improved Error Runtime

Transcript of A New Scoring Function for Bayesian Network Structure ......2. Bayesian Networks (BNs) and Structure...

Page 1: A New Scoring Function for Bayesian Network Structure ......2. Bayesian Networks (BNs) and Structure Learning P(X i |Pa(X i)) P(R)! Rain ! T! F! 0.3! 0.7! P(G|S,R)! Grass Wet! S! R!

SBIC+SB (G |D) = logP(D |G)−logn2

G + maxC∈ΩijEij∉G

∑ minc∈val (C )

sb(Xi,Xj |D,C = c)

Bayesian Information Criterion (BIC) SparsityBoost (SB)

P(S|R)!Sprinkler!T! F!

Rain!F! 0.2! 0.8!T! 0.01! 0.99!

•  A BN is a graph G that represents a probability distribution with one node per variable, and one conditional probability distribution (CPD) per node, � �

•  Structure learning as discrete� optimization: �

A New Scoring Function for Bayesian Network Structure Learning Extended to Arbitrary Discrete Variables

Rachel Hodos1, David Sontag2 1. Computational Biology program, NYU, 2. Computer Science Dept., NYU

Learning Task: Given n observations of m variables, learn the Bayesian network structure which generated the data.�

1. Introduction�

Focus of this work: Devise a score that makes structure learning easier with more data, and that works for discrete variables with any number of states.�

Input: Data, D�X1! X2!

X5!

X4!

X3!

X1!X2!X3!X4!X5!0! 2! 1! 0! 2!1! 1! 0! 0! 0!

1! 0! 1! 1! 2!

Output: Graph, G�

2. Bayesian Networks (BNs) and Structure Learning�

P(Xi | Pa(Xi ))

P(R)!Rain !

T! F!0.3! 0.7!

P(G|S,R)! Grass Wet!S! R! T! F!T! T! 0.99! 0.01!T! F! 0.9! 0.1!F! T! 0.8! 0.2!F! F! 0! 1!

Rain!

Grass !Wet!

A simple example:�•  But what score to use?�

�•  Maximum likelihood estimation,

S = P(D|G), gives complete graph, which is useless!�

•  Hence we need to encourage sparsity�

G* = argmaxG

S(G |D)

P(X1,X2,…,Xn ) = P(Xi | Pa(Xi ))i=1

n

�generic �

score� function�

Sprinkler!

3. A New Score: SparsityBoost�

•  PROBLEM: Existing complexity penalties are data agnostic, causing the score to be more difficult to optimize with more data�

•  IDEA: Add a data-dependent term that boosts sparsity.�•  HOW: Search for evidence that an edge should not be present,

and boost score of any graph that does not contain that edge. �

Data-agnostic complexity penalty, |G| = # of parameters of G�Large for strong evidence of independence, small otherwise�Ωij is a set of conditioning sets, e.g. all subsets of size ≤ 2, excluding Xi and Xj�

Want strongest evidence for each edge, so take max over conditioning sets�Independence should hold for all values of conditioning set�

1! 2!3!

4! 5!

1!

2!

4!

3!

5!

4. Bayesian Independence Testing�

sb(Xi,Xj |D,C = c) =“Conditioning on C=c, how strongly do the

data show that Xi is indep. of Xj?”�

� •  Use conditional mutual information, MI�

•  Assuming uniform prior over joint distributions, derive posterior, p(MI | D)�

MI(P(Xi,Xj | c)) = P(xi, x j | c)logP(xi, x j | c)

P(xi | c)P(x j | c)xi ,x j

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Prior over I (n=0)

I (mutual information)

P(I)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=800

I (mutual information)

P(I |

D)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=1200

I (mutual information)

P(I |

D)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=400

I (mutual information)

P(I |

D)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=800

I (mutual information)

P(I |

D)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=1200

I (mutual information)

P(I |

D)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Prior over I (n=0)

I (mutual information)

P(I)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

10

20

30

40

50

60

70

80Posterior over I with n=400

I (mutual information)

P(I |

D)

P(MI | D), �D from indep.�distribution�

P(MI | D),�D from dep.�distribution�

n increasing�

sb(Xi,Xj |D,C = c) = − log p(MI(P(Xi,Xj |C = c)) |D )η

∫ dp

threshold, η    

[1] Hutter, M., Zaffalon, M., Distribution of Mutual Information, Computational Statistics and Data Analysis, Vol. 48, No. 3, March 2005, pages 633-657.��[2] Brenner, E., Sontag, D., SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure, Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, July 2013.��[3] Beinlich, I. A., Suermondt, H. J., Chavez, R.M., & Cooper, G. F. 1989. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. Pages 247-256 of: Proceedings of the 2nd European Conference on Articial Intelligence in Medicine. Springer-Verlag.��[4] Cussens, J. Bayesian network learning with cutting planes. In Fabio G. Cozman and Avi Pfeffer, editors, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 153-160, Barcelona, 2011, AUAI Press.

6. References� 5. Results on Synthetic Data�

0 1000 2000 3000 4000 5000 6000 7000 80000

500

1000

1500

2000

2500

3000

Number of samples

Aver

age

runt

ime

(sec

)

BICBIC+SB1BIC+SB2

0 500 1000 1500 2000 2500 3000 3500 40000

5

10

15

20

25

30

35

40

Number of samples

Stru

ctur

al H

amm

ing

Dis

tanc

e

BICBIC+SB1BIC+SB2

•  Start with known network structure (‘Alarm’ network)�•  Generate synthetic data (only binary data shown)�•  Find globally optimal structure with respect to score�•  Each point shown is average of ten independent experiments�•  Both accuracy and runtime improved�

Error� Runtime�