• date post

07-Aug-2020
• Category

## Documents

• view

1

0

Embed Size (px)

### Transcript of A New Scoring Function for Bayesian Network Structure ... ... 2. Bayesian Networks (BNs) and...

• SBIC+SB (G |D) = logP(D |G)− logn 2

G + max C∈ΩijEij∉G

∑ min c∈val (C )

sb(Xi,Xj |D,C = c)

Bayesian Information Criterion (BIC)

SparsityBoost (SB)

P(S|R)! Sprinkler! T! F!

Rain! F! 0.2! 0.8! T! 0.01! 0.99!

•  A BN is a graph G that represents a probability distribution with one node per variable, and one conditional probability distribution (CPD) per node, � �

•  Structure learning as discrete� optimization: �

A New Scoring Function for Bayesian Network Structure Learning Extended to Arbitrary Discrete Variables

Rachel Hodos1, David Sontag2 1. Computational Biology program, NYU, 2. Computer Science Dept., NYU

Learning Task: Given n observations of m variables, learn the Bayesian network structure which generated the data.�

1. Introduction�

Focus of this work: Devise a score that makes structure learning easier with more data, and that works for discrete variables with any number of states.�

Input: Data, D� X1! X2!

X5!

X4!

X3!

X1!X2!X3!X4!X5! 0! 2! 1! 0! 2! 1! 1! 0! 0! 0!

1! 0! 1! 1! 2!

Output: Graph, G�

2. Bayesian Networks (BNs) and Structure Learning�

P(Xi | Pa(Xi ))

P(R)! Rain !

T! F! 0.3! 0.7!

P(G|S,R)! Grass Wet! S! R! T! F! T! T! 0.99! 0.01! T! F! 0.9! 0.1! F! T! 0.8! 0.2! F! F! 0! 1!

Rain!

Grass ! Wet!

A simple example:�•  But what score to use?�

� •  Maximum likelihood estimation,

S = P(D|G), gives complete graph, which is useless!�

•  Hence we need to encourage sparsity�

G* = argmax G

S(G |D)

P(X1,X2,…,Xn ) = P(Xi | Pa(Xi )) i=1

n

� generic �

score� function�

Sprinkler!

3. A New Score: SparsityBoost�

•  PROBLEM: Existing complexity penalties are data agnostic, causing the score to be more difficult to optimize with more data�

•  IDEA: Add a data-dependent term that boosts sparsity.� •  HOW: Search for evidence that an edge should not be present,

and boost score of any graph that does not contain that edge. �

Data-agnostic complexity penalty, |G| = # of parameters of G� Large for strong evidence of independence, small otherwise� Ωij is a set of conditioning sets, e.g. all subsets of size ≤ 2, excluding Xi and Xj� Want strongest evidence for each edge, so take max over conditioning sets� Independence should hold for all values of conditioning set�

1! 2!3!

4! 5!

1!

2!

4!

3!

5!

4. Bayesian Independence Testing�

sb(Xi,Xj |D,C = c) = “Conditioning on C=c, how strongly do the

data show that Xi is indep. of Xj?”�

� •  Use conditional mutual information, MI�

•  Assuming uniform prior over joint distributions, derive posterior, p(MI | D)�

MI(P(Xi,Xj | c)) = P(xi, x j | c)log P(xi, x j | c)

P(xi | c)P(x j | c)xi ,x j ∑

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

10

20

30

40

50

60

70

80 Prior over I (n=0)

I (mutual information)

P( I)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

10

20

30

40

50

60

70

80 Posterior over I with n=800

I (mutual information)

P( I |

D )

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

10

20

30

40

50

60

70

80 Posterior over I with n=1200

I (mutual information)

P( I |

D )

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

10

20

30

40

50

60

70

80 Posterior over I with n=400

I (mutual information)

P( I |

D )

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

10

20

30

40

50

60

70

80 Posterior over I with n=800

I (mutual information)

P( I |

D )

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

10

20

30

40

50

60

70

80 Posterior over I with n=1200

I (mutual information)

P( I |

D )

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

10

20

30

40

50

60

70

80 Prior over I (n=0)

I (mutual information)

P( I)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

10

20

30

40

50

60

70

80 Posterior over I with n=400

I (mutual information)

P( I |

D )

P(MI | D), � D from indep.� distribution�

P(MI | D),� D from dep.� distribution�

n increasing�

sb(Xi,Xj |D,C = c) = − log p(MI(P(Xi,Xj |C = c)) |D )η ∞

∫ dp threshold, η

[1] Hutter, M., Zaffalon, M., Distribution of Mutual Information, Computational Statistics and Data Analysis, Vol. 48, No. 3, March 2005, pages 633-657.� � [2] Brenner, E., Sontag, D., SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure, Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, July 2013.� � [3] Beinlich, I. A., Suermondt, H. J., Chavez, R.M., & Cooper, G. F. 1989. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. Pages 247-256 of: Proceedings of the 2nd European Conference on Articial Intelligence in Medicine. Springer-Verlag.� � [4] Cussens, J. Bayesian network learning with cutting planes. In Fabio G. Cozman and Avi Pfeffer, editors, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 153-160, Barcelona, 2011, AUAI Press.

6. References� 5. Results on Synthetic Data�

0 1000 2000 3000 4000 5000 6000 7000 8000 0

500

1000

1500

2000

2500

3000

Number of samples

Av er

ag e

ru nt

im e

(s ec

)

BIC BIC+SB1 BIC+SB2

0 500 1000 1500 2000 2500 3000 3500 4000 0

5

10

15

20

25

30

35

40

Number of samples

St ru

ct ur

al H

am m

in g

D is

ta nc

e

BIC BIC+SB1 BIC+SB2

•  Start with known network structure (‘Alarm’ network)� •  Generate synthetic data (only binary data shown)� •  Find globally optimal structure with respect to score� •  Each point shown is average of ten independent experiments� •  Both accuracy and runtime improved�

Error� Runtime�