A New Scoring Function for Bayesian Network Structure ... ... 2. Bayesian Networks (BNs) and...

download A New Scoring Function for Bayesian Network Structure ... ... 2. Bayesian Networks (BNs) and Structure

If you can't read please download the document

  • date post

    07-Aug-2020
  • Category

    Documents

  • view

    1
  • download

    0

Embed Size (px)

Transcript of A New Scoring Function for Bayesian Network Structure ... ... 2. Bayesian Networks (BNs) and...

  • SBIC+SB (G |D) = logP(D |G)− logn 2

    G + max C∈ΩijEij∉G

    ∑ min c∈val (C )

    sb(Xi,Xj |D,C = c)

    Bayesian Information Criterion (BIC)

    SparsityBoost (SB)

    P(S|R)! Sprinkler! T! F!

    Rain! F! 0.2! 0.8! T! 0.01! 0.99!

    •  A BN is a graph G that represents a probability distribution with one node per variable, and one conditional probability distribution (CPD) per node, � �

    •  Structure learning as discrete� optimization: �

    A New Scoring Function for Bayesian Network Structure Learning Extended to Arbitrary Discrete Variables

    Rachel Hodos1, David Sontag2 1. Computational Biology program, NYU, 2. Computer Science Dept., NYU

    Learning Task: Given n observations of m variables, learn the Bayesian network structure which generated the data.�

    1. Introduction�

    Focus of this work: Devise a score that makes structure learning easier with more data, and that works for discrete variables with any number of states.�

    Input: Data, D� X1! X2!

    X5!

    X4!

    X3!

    X1!X2!X3!X4!X5! 0! 2! 1! 0! 2! 1! 1! 0! 0! 0!

    1! 0! 1! 1! 2!

    Output: Graph, G�

    2. Bayesian Networks (BNs) and Structure Learning�

    P(Xi | Pa(Xi ))

    P(R)! Rain !

    T! F! 0.3! 0.7!

    P(G|S,R)! Grass Wet! S! R! T! F! T! T! 0.99! 0.01! T! F! 0.9! 0.1! F! T! 0.8! 0.2! F! F! 0! 1!

    Rain!

    Grass ! Wet!

    A simple example:�•  But what score to use?�

    � •  Maximum likelihood estimation,

    S = P(D|G), gives complete graph, which is useless!�

    •  Hence we need to encourage sparsity�

    G* = argmax G

    S(G |D)

    P(X1,X2,…,Xn ) = P(Xi | Pa(Xi )) i=1

    n

    � generic �

    score� function�

    Sprinkler!

    3. A New Score: SparsityBoost�

    •  PROBLEM: Existing complexity penalties are data agnostic, causing the score to be more difficult to optimize with more data�

    •  IDEA: Add a data-dependent term that boosts sparsity.� •  HOW: Search for evidence that an edge should not be present,

    and boost score of any graph that does not contain that edge. �

    Data-agnostic complexity penalty, |G| = # of parameters of G� Large for strong evidence of independence, small otherwise� Ωij is a set of conditioning sets, e.g. all subsets of size ≤ 2, excluding Xi and Xj� Want strongest evidence for each edge, so take max over conditioning sets� Independence should hold for all values of conditioning set�

    1! 2!3!

    4! 5!

    1!

    2!

    4!

    3!

    5!

    4. Bayesian Independence Testing�

    sb(Xi,Xj |D,C = c) = “Conditioning on C=c, how strongly do the

    data show that Xi is indep. of Xj?”�

    � •  Use conditional mutual information, MI�

    •  Assuming uniform prior over joint distributions, derive posterior, p(MI | D)�

    MI(P(Xi,Xj | c)) = P(xi, x j | c)log P(xi, x j | c)

    P(xi | c)P(x j | c)xi ,x j ∑

    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

    10

    20

    30

    40

    50

    60

    70

    80 Prior over I (n=0)

    I (mutual information)

    P( I)

    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

    10

    20

    30

    40

    50

    60

    70

    80 Posterior over I with n=800

    I (mutual information)

    P( I |

    D )

    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

    10

    20

    30

    40

    50

    60

    70

    80 Posterior over I with n=1200

    I (mutual information)

    P( I |

    D )

    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

    10

    20

    30

    40

    50

    60

    70

    80 Posterior over I with n=400

    I (mutual information)

    P( I |

    D )

    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

    10

    20

    30

    40

    50

    60

    70

    80 Posterior over I with n=800

    I (mutual information)

    P( I |

    D )

    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

    10

    20

    30

    40

    50

    60

    70

    80 Posterior over I with n=1200

    I (mutual information)

    P( I |

    D )

    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

    10

    20

    30

    40

    50

    60

    70

    80 Prior over I (n=0)

    I (mutual information)

    P( I)

    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

    10

    20

    30

    40

    50

    60

    70

    80 Posterior over I with n=400

    I (mutual information)

    P( I |

    D )

    P(MI | D), � D from indep.� distribution�

    P(MI | D),� D from dep.� distribution�

    n increasing�

    sb(Xi,Xj |D,C = c) = − log p(MI(P(Xi,Xj |C = c)) |D )η ∞

    ∫ dp threshold, η    

    [1] Hutter, M., Zaffalon, M., Distribution of Mutual Information, Computational Statistics and Data Analysis, Vol. 48, No. 3, March 2005, pages 633-657.� � [2] Brenner, E., Sontag, D., SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure, Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, July 2013.� � [3] Beinlich, I. A., Suermondt, H. J., Chavez, R.M., & Cooper, G. F. 1989. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. Pages 247-256 of: Proceedings of the 2nd European Conference on Articial Intelligence in Medicine. Springer-Verlag.� � [4] Cussens, J. Bayesian network learning with cutting planes. In Fabio G. Cozman and Avi Pfeffer, editors, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 153-160, Barcelona, 2011, AUAI Press.

    6. References� 5. Results on Synthetic Data�

    0 1000 2000 3000 4000 5000 6000 7000 8000 0

    500

    1000

    1500

    2000

    2500

    3000

    Number of samples

    Av er

    ag e

    ru nt

    im e

    (s ec

    )

    BIC BIC+SB1 BIC+SB2

    0 500 1000 1500 2000 2500 3000 3500 4000 0

    5

    10

    15

    20

    25

    30

    35

    40

    Number of samples

    St ru

    ct ur

    al H

    am m

    in g

    D is

    ta nc

    e

    BIC BIC+SB1 BIC+SB2

    •  Start with known network structure (‘Alarm’ network)� •  Generate synthetic data (only binary data shown)� •  Find globally optimal structure with respect to score� •  Each point shown is average of ten independent experiments� •  Both accuracy and runtime improved�

    Error� Runtime�