Ch9: Exact Inference: Variable Eliminationhaimk/pgm-seminar/part_1_and_3_shimi.pdf ·...
Transcript of Ch9: Exact Inference: Variable Eliminationhaimk/pgm-seminar/part_1_and_3_shimi.pdf ·...
Ch9: Exact Inference: Variable Elimination
Shimi Salant, Barak Sternberg
Part 1
Reminder, introduction (1/3)
We saw two ways to represent (finite discrete) distributions via graphical data
structures:
Bayesian Network , �
is a DAG
Markov Network , Φ
is an undirected graph � , … , = Π � | Φ , … , = � � ∗ … ∗ � � factorizes over means:
for each ∈ : an edge ⇀ Φ factorizes over means:
each ⊆ � is a clique in
� , , , , = � � � | , � | � | Φ D, I, G, S, L =
Φ ϕ , , ϕ , ϕ ,
Reminder, introduction (2/3)
We reviewed a concept of separation in graphs,
and saw we can query the graph itself in order to get independence assertions,
all of which apply in the represented distribution.
Bayesian Network , �
= { ⊥ | | d-sepG ; | } ⊆ B
Markov Network , Φ
= { ⊥ | | sepH ; | } ⊆ Φ
* we also mentioned that these are solid representations in the sense that are
"infinitely many more" 's for which = than those for which ⊊ .
Reminder, introduction (3/3)
We will now use these graphical data structure in order to:
answer probabilistic queries such as = = ?
show that properties of the graphs determine upper bounds for
computational cost of answering query, i.e. these properties provide a way
to gauge/reduce costs.
show algorithms that take these properties into consideration.
Definition of the inference task (1/2)
Some general context: types of tasks we may wish to carry out with PGMs:
inference given graph and factors/CPDs,
find probabilities, e.g. = ?
learning given graph and data,
find factors/CPDs (namely: their parameters)
structure learning given variables and data,
find graph structure and factors/CPDs
* extra characteristic of learning tasks:
data might be partially observed, i.e. we're not given values for all variables.
Definition of the inference task (2/2)
The exact inference task is defined as:
Given a fully parameterized BN or MN over variables �,
⊆ �,
⊆ �, = � and � ∈ :
Compute | = �
are the query variables,
are the evidence/observed variables, � is the evidence itself.
can be empty in which case we're after .
* note that we're after a distribution,
size of answer is | | which - to begin with - is exponential in | |.
Hardness of inference task (1/12)
We first consider only BNs: � | = � = ?
For any specific valuation ∈ � , we can compute the joint � � = :
� � � | , � | � |
.5 � .6 � � .3 � .4 � .6
.5 � .4 � � .7 � .6 � .4
� � .2 � .6 � .6
� � .8 � .4 � .4
� � .1
� � .9
� � .5
� � .5
� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � | � , � , � , , =
Hardness of inference task
� � � | , � | � |
.5 � .6 � � .3 � .4 � .6
.5 � .4 � � .7 � .6 � .4
� � .2 � .6 � .6
� � .8 � .4 � .4
� � .1
� � .9
� � .5
� � .5
� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � | � , � , � , , = . ∗
Hardness of inference task
� � � | , � | � |
.5 � .6 � � .3 � .4 � .6
.5 � .4 � � .7 � .6 � .4
� � .2 � .6 � .6
� � .8 � .4 � .4
� � .1
� � .9
� � .5
� � .5
� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � | � , � , � , , = . ∗ . ∗
Hardness of inference task
� � � | , � | � |
.5 � .6 � � .3 � .4 � .6
.5 � .4 � � .7 � .6 � .4
� � .2 � .6 � .6
� � .8 � .4 � .4
� � .1
� � .9
� � .5
� � .5
� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � | � , � , � , , = . ∗ . ∗ . � ∗
Hardness of inference task
� � � | , � | � |
.5 � .6 � � .3 � .4 � .6
.5 � .4 � � .7 � .6 � .4
� � .2 � .6 � .6
� � .8 � .4 � .4
� � .1
� � .9
� � .5
� � .5
� , , , , = � ∗ � ∗ � | , ∗ � �| ∗ � | � , � , � , , = . ∗ . ∗ .8 ∗ . ∗
Hardness of inference task
� � � | , � | � |
.5 � .6 � � .3 � .4 � .6
.5 � .4 � � .7 � .6 � .4
� � .2 � .6 � .6
� � .8 � .4 � .4
� � .1
� � .9
� � .5
� � .5
� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � �| � , � , � , , = . ∗ . ∗ .8 ∗ . ∗ . = . 8
Hardness of inference task (2/12)
Task: compute | = � .
Naive solution: sum the joint.
Denote = � − � − �. For each ∈ compute:
Σ�∈� � , �, � = � , � .
Then, compute
Σ ∈� � , � = � � .
Then, for each ∈ compute
� , � / � � = � |� .
This would entail | | = | | work for each ∈ .
Hardness of inference task (3/12)
Task: compute | = � .
Any solution: we show it's NP-hard.
Reminder:
3-SAT is in NP:
i.e. there is an polynomial-time verifier algorithm that for a given problem
instance � and an assignment checks if proves that that � is
satisfiable.
3-SAT is NP-hard:
i.e. every other problem in NP can be reduced to 3-SAT.
The 3-SAT decision problem:
Given a formula over binary variables � , … , = …
where every = ± , ± , ± , is a disjunction of 3 literals
[example: � = − − − − ]
- decide whether there is a satisfying assignment to .
Hardness of inference task (4/12)
Consider the following tasks, each a sub-case of the next:
(1) � | = � = ? ( can be empty)
(2) � = | = � = ? ( can be empty, is a single variable)
(3) � = = ?
(4) � = > ? (a decision problem)
If (4) is NP-hard, the original inference task (1) certainly is.
The BN-Positive problem:
Given a BN , � , a variable ∈ � and value ∈ ,
decide whether � = > .
Hardness of inference task (5/12)
BN-Positive is NP-hard:
We next show a reduction from 3-SAT to BN-Positive.
(also: BN-Positive is in NP:
Given (a candidate proof) values to all variables � ∈ � where
= in �, we can compute � � = ξ and check for positivity.
i.e. BN-Positive is NP-Complete.)
The BN-Positive problem:
Given a BN , � , a variable ∈ � and value ∈ ,
decide whether � = > .
Hardness of inference task (6/12)
Given a 3-SAT formula � = … over binary variables { , … , },
we construct a BN: (in polynomial time)
for each formula variable : a binary variable with
� = = .
for each formula clause : a deterministic binary variable with
( = | , , , , , ) = iff is satisfied by values of , , , , ,
a layer of deterministic binary AND variables { , … , − , } s.t.
= = iff all are valued .
Hardness of inference task (7/12)
� is not satisfiable
⇒ no assignment to { } can yield � = = , hence = = . � is satisfiable
⇒ an assignment to { } can yield � = = , hence = > .
i.e. if we can answer whether � = > , we can solve 3-SAT. ⇒ BN-Positive, problem (4), is NP-hard. ⇒ | = � = ?, our initial inference problem (1), is NP-hard for BNs.
Hardness of inference task (8/12)
For MNs: we can easily translate a BN to an MN (hence MNs are NP-hard too):
Given a BN , � where � , … , = Π � | ,
we can construct an MN , Φ modeling the same distribution by:
using CPDs as factors: � ( , ) = � | , and Φ = .
translate directed to undirected (also need to add edges between all
parents of a node, cost of addition is polynomial).
� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � |
Φ D, I, G, S, L = Φ ϕ� ∗ � ∗ � , , ∗ �� , ∗ � ,
Hardness of inference task (9/12)
* extra note:
BN-Positive is a decision problem, i.e. of the for are there a y solutio s
(in BN above: � = > ? i.e. is there an event in the event space - all
possible valuations to all variables - in which = and event has a positive
probability).
The general i fere e task is of the for how a y solutio s are there
(in BN above: knowing how many such � = > events there are we
would have divided this amount by size of event space to obtain the
probability � = ).
It turns out the inference task belongs to an even harder class of problems
#P-hard, whi h o sists of su h ou ti g pro le s.
Hardness of inference task (10/12)
Approximate inference: obtain an approximation | for | .
We consider measuring accuracy of = | rather than that of | .
def: | has absolute error � if
| | − | | �.
might be too coarse of a criterion, e.g. � = − for | ≈ −8.
def: | has relative error � if
+ � − | | + � | .
e.g. true answer | is between . | and | .
Hardness of inference task (11/12)
def: | has absolute error � if
| | − | | �.
might be too coarse of a criterion, e.g. � = − for | ≈ −8.
def: | has relative error � if
+ � − | | + � | .
e.g. true answer | is between . | and | .
If we can get such a relative error approximation | : | > ⇔ | >
and we can solve BN-Positive which is NP-hard ⇒ relative error inference is NP-hard.
Hardness of inference task (12/12)
def: | has absolute error � if
| | − | | �.
might be too coarse of a criterion, e.g. � = − for | ≈ −8.
def: | has relative error � if
+ � − | | + � | .
e.g. true answer | is between . | and | .
For absolute error inference, the situation is more specific:
For the case where no evidence is present, i.e. | − | �,
there exists a randomized polynomial-time algorithm.
For the case where evidence is present, i.e. | | − | | �,
and � ∈ , . - absolute error inference is NP-hard.
Namely - any useful absolute error inference is NP-hard when evidence is
present.
End of part 1
VE algorithm
is + ∗ � , where is number of variables, is number of initial
factors in factor set Φ and � is the size (number of entries) of the largest
intermediate factor � created throughout run.
def 9.5: the induced graph
for factor set Φ (over �) and an ordering ≺ (over eliminated variables ),
we define the induced graph Φ,≺ as an undirected graph over �
with edges − where and appear in the scope of some
intermediate factor � when Sum-Product-VE(Φ, , ≺) is executed.
theorem 9.6: For Φ, ≺ and their induced graph Φ,≺:
The scope of every � is a clique in Φ,≺.
Every maximal clique in Φ,≺ is the scope of a certain �.
1st statement means: size of largest clique in Φ,≺ bounds �.
2st statement means: bound is tight, it is encountered.
Part 3
Finding elimination orderings (1/12)
We reached the problem of finding an optimal ordering for VE.
Considering a network , Φ , we saw each ordering yields an
induced graph Φ,≺ whose largest clique size is an exponent in run-time bound.
We now consider the ordering problem in purely graphical terms.
We can observe the induced graph K,≺ of the network's graph (instead of its
factor set Φ), since the induced graph depends only on how variables appear
together in factors, i.e. depends only on structure of network's graph.
Finding elimination orderings
, , , , = � , � , � , � ,
Compute : Sum-Product-VE(Φ, = { , , , }, ≺= , , , ))
Φ = {� , , � , , � , , � , }
Finding elimination orderings
Compute : Sum-Product-VE(Φ, = { , , , }, ≺= , , , ))
Φ = {� , , � , , � , , � , }
Eliminate
� = � , (factor size: 2) � = �� = �� ,
an added edge to induced-graph
(was already present in original graph)
Finding elimination orderings
Compute : Sum-Product-VE(Φ, = { , , , }, ≺= , , , ))
Φ = {� , � , , � , , � , }
Eliminate
� = � , � , (factor size: 3) � , = Σ � = Σ � , � ,
an added fill-edge to induced-graph
(wasn't present in original graph)
Finding elimination orderings
Compute : Sum-Product-VE(Φ, = { , , , }, ≺= , , , ))
Φ = {� , � , , � , }
Eliminate
� = � � , (factor size: 2) � = Σ � = Σ � � ,
Finding elimination orderings
Compute : Sum-Product-VE(Φ, = { , , , }, ≺= , , , ))
Φ = {� , � , }
Eliminate �
� = � � , (factor size: 2) � = Σ�� = ΣS� � ,
Finding elimination orderings
Compute : Sum-Product-VE(Φ, = { , , , }, ≺= , , , ))
Φ = {� }
�∗ = � = �
Finding elimination orderings (2/12)
def: the induced-width , ≺ of , ≺ is the size of largest clique in ,≺ - 1.
(the -1 is for having zero width for an ,≺ which is a tree)
def: the tree-width of is its minimal induced-width ∗ = min≺ , ≺
(i.e. size of the smallest clique we can hope for in an induced graph
of , minus 1)
The ordering problem is now:
find ≺∗= �rgmin≺ , ≺
Finding elimination orderings (3/12)
(1) ≺∗= �rgmin≺ , ≺ = ? (find optimal ordering)
(2) ∗ = min≺ , ≺ = ? (find min. induced width, i.e. tree-width)
(3) ∗ ∈ ℕ ? (decide whether tree-width )
A theorem from graph theory: (3) is NP-complete. ⇒ (2) is NP-hard ⇒ (1) is NP-hard
and we have now shown that finding optimal ordering is NP-hard.
We will now translate the ordering problem to a different graphical problem,
thereby yielding an ordering algorithms for a certain type of graphs.
Finding elimination orderings (4/12)
def: A graph (directed or undirected)
is chordal if every loop of length
greater than 3 has a chord.
Such a graph is
tria gulated , all
polygons are divided to
triangles.
Finding elimination orderings (5/12)
We next show that the class of induced graphs is the class of chordal graphs.
* remark: this is also the class of graphs for which there are prefect
I-maps, i.e. for a graph in this class = for a distribution that
factorize over graph.
Finding elimination orderings (6/12) ,≺ is induced ⇒ ,≺ is chordal:
Assume ,≺ is not chordal: it has a loop − − ⋯ − − , .
Assume WLOG was eliminated first.
After line 3 of Sum-Product-Eliminate-Var(Φ, ),
no more edges − are added to ,≺.
Since ,≺ has the edges − , − : when line 3 runs:
appears in some factor with ,
appears in some factor with
⇒ , will appear in � of line 3 ⇒ ,≺ has the edge −
Finding elimination orderings (7/12)
is chordal ⇒ is induced:
def: For an undirected graph over �, a tree is a clique tree for if:
every node ⊆ � in the tree is a clique in the graph.
every maximal clique in the graph is a node in the tree.
for each an edge − in : separates nodes in : those
appearing on 's side of the tree from those appearing on 's side.
Finding elimination orderings (8/12)
def: For an undirected graph over �, a tree is a clique tree for if:
every node ⊆ � in the tree is a clique in the graph.
every maximal clique in the graph is a node in the tree.
for each an edge − in : separates nodes in : those
appearing on 's side of the tree from those appearing on 's side.
Finding elimination orderings (9/12)
Not every graph has a clique tree, but:
Every chordal graph has a clique tree.
A property of clique trees:
Every leaf in the tree has a node present only there.
Finding elimination orderings (10/12)
is chordal ⇒ is induced:
We show that if is chordal then = ,≺ for some ordering ≺.
Claim: all variables of a chordal graph over variables can be eliminated
without introducing fill-edges. Induction on :
is chordal ⇒ it has a clique tree .
is a clique tree ⇒ it has a leaf node and in that leaf node a variable
∈ present only in .
can be eliminated from without introducing fill-edges:
present only in ⇒ all of 's neighbors are in .
is a clique in ⇒ all neighbors already connected to each other.
Removing from , we are left with a chordal graph of − variables.
Finding elimination orderings
Example:
≺= , , , , , , ,
graph
's clique tree induced graph ,≺
Finding elimination orderings
≺= , , , , , , ,
graph
's clique tree induced graph ,≺
Finding elimination orderings
≺= , , , , , , ,
graph
's clique tree induced graph ,≺
Finding elimination orderings
≺= , , , , , , ,
graph
's clique tree induced graph ,≺
Finding elimination orderings
≺= , , , , , , ,
graph
's clique tree induced graph ,≺
Finding elimination orderings
≺= , , , , , , ,
graph
's clique tree induced graph ,≺
Finding elimination orderings
≺= , , , , , , ,
graph
's clique tree induced graph ,≺
Finding elimination orderings
≺= , , , , , , ,
graph
's clique tree induced graph ,≺
Finding elimination orderings (11/12)
For a chordal graph, we have an algorithm that does not produce fill-edges:
Max-Cardinality produces an ordering consistent with always choosing a node
which is only present in a leaf of the (current) graph's clique-tree, and only in
that leaf.
Meaning: Max-Cardinality solves the ordering problem for chordal graphs.
Finding elimination orderings (12/12)
The general ordering problem therefore reduces to a well-known problem from
graph-theory:
Minimal-Triangulation:
Given a graph , find a chordal graph containing ,
such that 's largest clique is as small as possible.
There are different graph-theoretic algorithms addressing
Minimal-Triangulation, offering different levels of performance guarantees.
Greedy search for elimination ordering (1/3)
We saw that the inference problem | = � = ? is NP-hard.
Moreover, even finding an optimal ordering for variable elimination
for running Sum-Product-VE(Φ, , ≺) is NP-hard.
A practical approach is therefore to search the ordering space for a sub-optimal
yet as-good-as-possible ordering.
We can search this space greedily, each time choosing a variable whose
elimination will result in least cost (for some definition of cost) given current
state of induced graph. We do not have formal complexity guarantees for this
approach, yet it works surprisingly well in practice.
Greedy search for elimination ordering (2/3)
* note that is the
induced graph.
Possible evaluation metrics:
min-neigh , = number of neighbors of
min-neigh-weight , = product of cardinalities of neighbors of
min-fill , = number of added fill-edges if is eliminated
min-fill-weight , = sum of weights of fill-edges, where the
weight of a fill-edge is the product of
cardinalities of edge's nodes
Greedy search for elimination ordering (3/3)
Neither of these is universally better than others.
A variant of the greedy approach is to introduce stochasticity into it, by not
always deterministically selecting the that minimizes , .
For example: upon each elimination round, select � � � , out of a
random half of remaining variables. This serves to introduce exploration into
the ordering search (rather than have it be fully exploitative).
Note that the greedy algorithm runs in polynomial time and it can compute the
number of operations the VE algorithm itself will execute. A suggested practice
for large networks where such pre-computation time is negligible is to execute
the greedy algorithm multiple times and use the best ordering obtained.
*extra material
Conditioning (1/6)
Consider MNs: Φ � = Φ � ∗ … ∗ �
Φ ∗ Φ � = � ∗ … ∗ � Φ ∗ Φ � = � ∗ … ∗ �
Sum-Product-VE calculated :
Sum-Product-VE(Φ, , ≺)
( = � − �) returned �∗ ,
such that: �∗ = = Σ ∈� Π�∈Φ � , = Σ ∈� � , ∗ … ∗ � , = Σ ∈� Φ , = Φ
... and if Φ is needed - renormalize.
Conditioning (2/6)
We used the ability to
perform a sum-product
calculation for getting
conditional
probabilities:
Cond-Prob-VE
calculated Φ , � and Φ � :
It returned �∗ s.t. �∗ (now = � − � − �)
= Σ ∈� Π�∈Φ[�=�] � , = Σ ∈� � [ = �] , ∗ … ∗ � [ = �] , = Σ ∈� � , , � ∗ … ∗ � , , � = Σ ∈� Φ , , � = Φ , �
and a scalar: � = Σ ∈� Φ , � = Φ �
... and if Φ |� is needed: Φ , � / Φ � = Φ |�
Conditioning (3/6)
Calculating conditional probabilities ∙ | = � can reduce sizes of
encountered intermediate factors � - since evidence variables can render other
variables independent.
Conditioning (4/6)
Example: same network as before, with evidence = � (using Cond-Prob-VE)
, , , , = � , � , � , � ,
Compute | = � for Φ, = { , , }, ≺= , , ))
Φ = {� , , � , , � , , � , }
Φ ← Φ[I = i ] = {� , , � , � , � � , , � , } = {� , , � [� ] , � [� ] , � , }
Conditioning
Compute | = � for Φ, = { , , }, ≺= , , ))
Φ = {� , , � [� ] , � [� ] , � , }
Eliminate
� = � , (factor size: 2) � = �� = �� ,
Conditioning
Compute | = � for Φ, = { , , }, ≺= , , ))
Φ = {� , � [� ] , � [� ] , � , }
Eliminate
� = � � [� ] (factor size: 1) � [� ] =
Σ � = Σ � � [� ]
� [� ] ∈ ℝ, a scalar dependent on � .
Conditioning
Compute | = � for Φ, = { , , }, ≺= , , ))
Φ = {� [� ] ∈ ℝ, � [� ] , � , }
Eliminate �
� = � [� ] � , (factor size: 2) � [� ] =
�� = �� [� ] � ,
Conditioning
Compute | = � for Φ, = { , , }, ≺= , , ))
Φ = {� [� ] ∈ ℝ, � [� ] }
�∗ = � [� ] ∗ � [� ] = � , �
Conditioning (5/6)
So, computing Φ , � can introduce smaller intermediate factors compared
to computation of Φ .
We can do the following (where we build upon Cond-Prob-VE - our ability to
calculate Φ , � and Φ � ):
For some ⊆ �, for every � ∈ , compute
Φ , � = Φ ∗ , � and Φ � = Φ ∗ �
then compute
Σ�∈� Φ , � = Φ ∗ Φ and Σ�∈� Φ � = Φ
and have:
Φ ∗ Φ / Φ = Φ
Computing Φ as such is called conditioning.
Conditioning (6/6)
Conditioning does no less work than ordinary VE we've seen so far, it generally
does more by running VE many times, but the maximal intermediate factor
ever encountered over VE runs can be smaller -- which is a necessity if we
cannot hold very large intermediate factors in memory.