A New Scoring Function for Bayesian Network Structure ......2. Bayesian Networks (BNs) and Structure...
Transcript of A New Scoring Function for Bayesian Network Structure ......2. Bayesian Networks (BNs) and Structure...
SBIC+SB (G |D) = logP(D |G)−logn2
G + maxC∈ΩijEij∉G
∑ minc∈val (C )
sb(Xi,Xj |D,C = c)
Bayesian Information Criterion (BIC) SparsityBoost (SB)
P(S|R)!Sprinkler!T! F!
Rain!F! 0.2! 0.8!T! 0.01! 0.99!
• A BN is a graph G that represents a probability distribution with one node per variable, and one conditional probability distribution (CPD) per node, � �
• Structure learning as discrete� optimization: �
�
A New Scoring Function for Bayesian Network Structure Learning Extended to Arbitrary Discrete Variables
Rachel Hodos1, David Sontag2 1. Computational Biology program, NYU, 2. Computer Science Dept., NYU
Learning Task: Given n observations of m variables, learn the Bayesian network structure which generated the data.�
1. Introduction�
Focus of this work: Devise a score that makes structure learning easier with more data, and that works for discrete variables with any number of states.�
Input: Data, D�X1! X2!
X5!
X4!
X3!
X1!X2!X3!X4!X5!0! 2! 1! 0! 2!1! 1! 0! 0! 0!
1! 0! 1! 1! 2!
Output: Graph, G�
2. Bayesian Networks (BNs) and Structure Learning�
P(Xi | Pa(Xi ))
P(R)!Rain !
T! F!0.3! 0.7!
P(G|S,R)! Grass Wet!S! R! T! F!T! T! 0.99! 0.01!T! F! 0.9! 0.1!F! T! 0.8! 0.2!F! F! 0! 1!
Rain!
Grass !Wet!
A simple example:�• But what score to use?�
�• Maximum likelihood estimation,
S = P(D|G), gives complete graph, which is useless!�
• Hence we need to encourage sparsity�
�
G* = argmaxG
S(G |D)
P(X1,X2,…,Xn ) = P(Xi | Pa(Xi ))i=1
n
∏
�generic �
score� function�
Sprinkler!
3. A New Score: SparsityBoost�
• PROBLEM: Existing complexity penalties are data agnostic, causing the score to be more difficult to optimize with more data�
• IDEA: Add a data-dependent term that boosts sparsity.�• HOW: Search for evidence that an edge should not be present,
and boost score of any graph that does not contain that edge. �
Data-agnostic complexity penalty, |G| = # of parameters of G�Large for strong evidence of independence, small otherwise�Ωij is a set of conditioning sets, e.g. all subsets of size ≤ 2, excluding Xi and Xj�
Want strongest evidence for each edge, so take max over conditioning sets�Independence should hold for all values of conditioning set�
1! 2!3!
4! 5!
1!
2!
4!
3!
5!
4. Bayesian Independence Testing�
sb(Xi,Xj |D,C = c) =“Conditioning on C=c, how strongly do the
data show that Xi is indep. of Xj?”�
� • Use conditional mutual information, MI�
• Assuming uniform prior over joint distributions, derive posterior, p(MI | D)�
MI(P(Xi,Xj | c)) = P(xi, x j | c)logP(xi, x j | c)
P(xi | c)P(x j | c)xi ,x j
∑
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
10
20
30
40
50
60
70
80Prior over I (n=0)
I (mutual information)
P(I)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
10
20
30
40
50
60
70
80Posterior over I with n=800
I (mutual information)
P(I |
D)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
10
20
30
40
50
60
70
80Posterior over I with n=1200
I (mutual information)
P(I |
D)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
10
20
30
40
50
60
70
80Posterior over I with n=400
I (mutual information)
P(I |
D)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
10
20
30
40
50
60
70
80Posterior over I with n=800
I (mutual information)
P(I |
D)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
10
20
30
40
50
60
70
80Posterior over I with n=1200
I (mutual information)
P(I |
D)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
10
20
30
40
50
60
70
80Prior over I (n=0)
I (mutual information)
P(I)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
10
20
30
40
50
60
70
80Posterior over I with n=400
I (mutual information)
P(I |
D)
P(MI | D), �D from indep.�distribution�
P(MI | D),�D from dep.�distribution�
n increasing�
sb(Xi,Xj |D,C = c) = − log p(MI(P(Xi,Xj |C = c)) |D )η
∞
∫ dp
threshold, η
[1] Hutter, M., Zaffalon, M., Distribution of Mutual Information, Computational Statistics and Data Analysis, Vol. 48, No. 3, March 2005, pages 633-657.��[2] Brenner, E., Sontag, D., SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure, Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, July 2013.��[3] Beinlich, I. A., Suermondt, H. J., Chavez, R.M., & Cooper, G. F. 1989. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. Pages 247-256 of: Proceedings of the 2nd European Conference on Articial Intelligence in Medicine. Springer-Verlag.��[4] Cussens, J. Bayesian network learning with cutting planes. In Fabio G. Cozman and Avi Pfeffer, editors, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 153-160, Barcelona, 2011, AUAI Press.
6. References� 5. Results on Synthetic Data�
0 1000 2000 3000 4000 5000 6000 7000 80000
500
1000
1500
2000
2500
3000
Number of samples
Aver
age
runt
ime
(sec
)
BICBIC+SB1BIC+SB2
0 500 1000 1500 2000 2500 3000 3500 40000
5
10
15
20
25
30
35
40
Number of samples
Stru
ctur
al H
amm
ing
Dis
tanc
e
BICBIC+SB1BIC+SB2
• Start with known network structure (‘Alarm’ network)�• Generate synthetic data (only binary data shown)�• Find globally optimal structure with respect to score�• Each point shown is average of ten independent experiments�• Both accuracy and runtime improved�
Error� Runtime�