Ch9: Exact Inference: Variable Eliminationhaimk/pgm-seminar/part_1_and_3_shimi.pdf ·...

Ch9: Exact Inference: Variable Elimination

Shimi Salant, Barak Sternberg

Part 1

Reminder, introduction (1/3)

We saw two ways to represent (finite discrete) distributions via graphical data

structures:

Bayesian Network , �

is a DAG

Markov Network , Φ

is an undirected graph � , … , = Π � | Φ , … , = � � ∗ … ∗ � � factorizes over means:

for each ∈ : an edge ⇀ Φ factorizes over means:

each ⊆ � is a clique in

� , , , , = � � � | , � | � | Φ D, I, G, S, L =

Φ ϕ , , ϕ , ϕ ,


We reviewed a concept of separation in graphs,

and saw we can query the graph itself in order to get independence assertions,

all of which apply in the represented distribution.

Bayesian Network , �

= { ⊥ | | d-sepG ; | } ⊆ B

Markov Network , Φ

= { ⊥ | | sepH ; | } ⊆ Φ

* we also mentioned that these are solid representations in the sense that are

"infinitely many more" 's for which = than those for which ⊊ .


We will now use these graphical data structure in order to:

answer probabilistic queries such as = = ?

show that properties of the graphs determine upper bounds for

computational cost of answering query, i.e. these properties provide a way

to gauge/reduce costs.

show algorithms that take these properties into consideration.

Definition of the inference task (1/2)

Some general context: types of tasks we may wish to carry out with PGMs:

inference given graph and factors/CPDs,

find probabilities, e.g. = ?

learning given graph and data,

find factors/CPDs (namely: their parameters)

structure learning given variables and data,

find graph structure and factors/CPDs

* extra characteristic of learning tasks:

data might be partially observed, i.e. we're not given values for all variables.

Definition of the inference task (2/2)

The exact inference task is defined as:

Given a fully parameterized BN or MN over variables �,

⊆ �,

⊆ �, = � and � ∈ :

Compute | = �

are the query variables,

are the evidence/observed variables, � is the evidence itself.

can be empty in which case we're after .

* note that we're after a distribution,

size of answer is | | which - to begin with - is exponential in | |.

Hardness of inference task (1/12)

We first consider only BNs: � | = � = ?

For any specific valuation ∈ � , we can compute the joint � � = :

� � � | , � | � |

.5 � .6 � � .3 � .4 � .6

.5 � .4 � � .7 � .6 � .4

� � .2 � .6 � .6

� � .8 � .4 � .4

� � .1

� � .9

� � .5

� � .5

� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � | � , � , � , , =

Hardness of inference task

� � � | , � | � |

.5 � .6 � � .3 � .4 � .6

.5 � .4 � � .7 � .6 � .4

� � .2 � .6 � .6

� � .8 � .4 � .4

� � .1

� � .9

� � .5

� � .5

� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � | � , � , � , , = . ∗


� � � | , � | � |

.5 � .6 � � .3 � .4 � .6

.5 � .4 � � .7 � .6 � .4

� � .2 � .6 � .6

� � .8 � .4 � .4

� � .1

� � .9

� � .5

� � .5

� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � | � , � , � , , = . ∗ . ∗


� � � | , � | � |

.5 � .6 � � .3 � .4 � .6

.5 � .4 � � .7 � .6 � .4

� � .2 � .6 � .6

� � .8 � .4 � .4

� � .1

� � .9

� � .5

� � .5

� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � | � , � , � , , = . ∗ . ∗ . � ∗


� � � | , � | � |

.5 � .6 � � .3 � .4 � .6

.5 � .4 � � .7 � .6 � .4

� � .2 � .6 � .6

� � .8 � .4 � .4

� � .1

� � .9

� � .5

� � .5

� , , , , = � ∗ � ∗ � | , ∗ � �| ∗ � | � , � , � , , = . ∗ . ∗ .8 ∗ . ∗


� � � | , � | � |

.5 � .6 � � .3 � .4 � .6

.5 � .4 � � .7 � .6 � .4

� � .2 � .6 � .6

� � .8 � .4 � .4

� � .1

� � .9

� � .5

� � .5

� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � �| � , � , � , , = . ∗ . ∗ .8 ∗ . ∗ . = . 8


Task: compute | = � .

Naive solution: sum the joint.

Denote = � − � − �. For each ∈ compute:

Σ�∈� � , �, � = � , � .

Then, compute

Σ ∈� � , � = � � .

Then, for each ∈ compute

� , � / � � = � |� .

This would entail | | = | | work for each ∈ .


Task: compute | = � .

Any solution: we show it's NP-hard.

Reminder:

3-SAT is in NP:

i.e. there is an polynomial-time verifier algorithm that for a given problem

instance � and an assignment checks if proves that that � is

satisfiable.

3-SAT is NP-hard:

i.e. every other problem in NP can be reduced to 3-SAT.

The 3-SAT decision problem:

Given a formula over binary variables � , … , = …

where every = ± , ± , ± , is a disjunction of 3 literals

[example: � = − − − − ]

- decide whether there is a satisfying assignment to .


Consider the following tasks, each a sub-case of the next:

(1) � | = � = ? ( can be empty)

(2) � = | = � = ? ( can be empty, is a single variable)

(3) � = = ?

(4) � = > ? (a decision problem)

If (4) is NP-hard, the original inference task (1) certainly is.

The BN-Positive problem:

Given a BN , � , a variable ∈ � and value ∈ ,

decide whether � = > .


BN-Positive is NP-hard:

We next show a reduction from 3-SAT to BN-Positive.

(also: BN-Positive is in NP:

Given (a candidate proof) values to all variables � ∈ � where

= in �, we can compute � � = ξ and check for positivity.

i.e. BN-Positive is NP-Complete.)

The BN-Positive problem:

Given a BN , � , a variable ∈ � and value ∈ ,

decide whether � = > .


Given a 3-SAT formula � = … over binary variables { , … , },

we construct a BN: (in polynomial time)

for each formula variable : a binary variable with

� = = .

for each formula clause : a deterministic binary variable with

( = | , , , , , ) = iff is satisfied by values of , , , , ,

a layer of deterministic binary AND variables { , … , − , } s.t.

= = iff all are valued .


� is not satisfiable

⇒ no assignment to { } can yield � = = , hence = = . � is satisfiable

⇒ an assignment to { } can yield � = = , hence = > .

i.e. if we can answer whether � = > , we can solve 3-SAT. ⇒ BN-Positive, problem (4), is NP-hard. ⇒ | = � = ?, our initial inference problem (1), is NP-hard for BNs.


For MNs: we can easily translate a BN to an MN (hence MNs are NP-hard too):

Given a BN , � where � , … , = Π � | ,

we can construct an MN , Φ modeling the same distribution by:

using CPDs as factors: � ( , ) = � | , and Φ = .

translate directed to undirected (also need to add edges between all

parents of a node, cost of addition is polynomial).

� , , , , = � ∗ � ∗ � | , ∗ � | ∗ � |

Φ D, I, G, S, L = Φ ϕ� ∗ � ∗ � , , ∗ �� , ∗ � ,


* extra note:

BN-Positive is a decision problem, i.e. of the for are there a y solutio s

(in BN above: � = > ? i.e. is there an event in the event space - all

possible valuations to all variables - in which = and event has a positive

probability).

The general i fere e task is of the for how a y solutio s are there

(in BN above: knowing how many such � = > events there are we

would have divided this amount by size of event space to obtain the

probability � = ).

It turns out the inference task belongs to an even harder class of problems

#P-hard, whi h o sists of su h ou ti g pro le s.


Approximate inference: obtain an approximation | for | .

We consider measuring accuracy of = | rather than that of | .

def: | has absolute error � if

| | − | | �.

might be too coarse of a criterion, e.g. � = − for | ≈ −8.

def: | has relative error � if

+ � − | | + � | .

e.g. true answer | is between . | and | .



| | − | | �.



+ � − | | + � | .


If we can get such a relative error approximation | : | > ⇔ | >

and we can solve BN-Positive which is NP-hard ⇒ relative error inference is NP-hard.



| | − | | �.



+ � − | | + � | .


For absolute error inference, the situation is more specific:

For the case where no evidence is present, i.e. | − | �,

there exists a randomized polynomial-time algorithm.

For the case where evidence is present, i.e. | | − | | �,

and � ∈ , . - absolute error inference is NP-hard.

Namely - any useful absolute error inference is NP-hard when evidence is

present.

End of part 1

VE algorithm

is + ∗ � , where is number of variables, is number of initial

factors in factor set Φ and � is the size (number of entries) of the largest

intermediate factor � created throughout run.

def 9.5: the induced graph

for factor set Φ (over �) and an ordering ≺ (over eliminated variables ),

we define the induced graph Φ,≺ as an undirected graph over �

with edges − where and appear in the scope of some

intermediate factor � when Sum-Product-VE(Φ, , ≺) is executed.

theorem 9.6: For Φ, ≺ and their induced graph Φ,≺:

The scope of every � is a clique in Φ,≺.

Every maximal clique in Φ,≺ is the scope of a certain �.

1st statement means: size of largest clique in Φ,≺ bounds �.

2st statement means: bound is tight, it is encountered.

Part 3

Finding elimination orderings (1/12)

We reached the problem of finding an optimal ordering for VE.

Considering a network , Φ , we saw each ordering yields an

induced graph Φ,≺ whose largest clique size is an exponent in run-time bound.

We now consider the ordering problem in purely graphical terms.

We can observe the induced graph K,≺ of the network's graph (instead of its

factor set Φ), since the induced graph depends only on how variables appear

together in factors, i.e. depends only on structure of network's graph.

Finding elimination orderings

, , , , = � , � , � , � ,

Compute : Sum-Product-VE(Φ, = { , , , }, ≺= , , , ))

Φ = {� , , � , , � , , � , }



Φ = {� , , � , , � , , � , }

Eliminate

� = � , (factor size: 2) � = Σ�� = Σ�� ,

an added edge to induced-graph

(was already present in original graph)



Φ = {� , � , , � , , � , }

Eliminate

� = � , � , (factor size: 3) � , = Σ � = Σ � , � ,

an added fill-edge to induced-graph

(wasn't present in original graph)



Φ = {� , � , , � , }

Eliminate

� = � � , (factor size: 2) � = Σ � = Σ � � ,



Φ = {� , � , }

Eliminate �

� = � � , (factor size: 2) � = Σ�� = ΣS� � ,



Φ = {� }

�∗ = � = �


def: the induced-width , ≺ of , ≺ is the size of largest clique in ,≺ - 1.

(the -1 is for having zero width for an ,≺ which is a tree)

def: the tree-width of is its minimal induced-width ∗ = min≺ , ≺

(i.e. size of the smallest clique we can hope for in an induced graph

of , minus 1)

The ordering problem is now:

find ≺∗= �rgmin≺ , ≺


(1) ≺∗= �rgmin≺ , ≺ = ? (find optimal ordering)

(2) ∗ = min≺ , ≺ = ? (find min. induced width, i.e. tree-width)

(3) ∗ ∈ ℕ ? (decide whether tree-width )

A theorem from graph theory: (3) is NP-complete. ⇒ (2) is NP-hard ⇒ (1) is NP-hard

and we have now shown that finding optimal ordering is NP-hard.

We will now translate the ordering problem to a different graphical problem,

thereby yielding an ordering algorithms for a certain type of graphs.


def: A graph (directed or undirected)

is chordal if every loop of length

greater than 3 has a chord.

Such a graph is

tria gulated , all

polygons are divided to

triangles.


We next show that the class of induced graphs is the class of chordal graphs.

* remark: this is also the class of graphs for which there are prefect

I-maps, i.e. for a graph in this class = for a distribution that

factorize over graph.

Finding elimination orderings (6/12) ,≺ is induced ⇒ ,≺ is chordal:

Assume ,≺ is not chordal: it has a loop − − ⋯ − − , .

Assume WLOG was eliminated first.

After line 3 of Sum-Product-Eliminate-Var(Φ, ),

no more edges − are added to ,≺.

Since ,≺ has the edges − , − : when line 3 runs:

appears in some factor with ,

appears in some factor with

⇒ , will appear in � of line 3 ⇒ ,≺ has the edge −


is chordal ⇒ is induced:

def: For an undirected graph over �, a tree is a clique tree for if:

every node ⊆ � in the tree is a clique in the graph.

every maximal clique in the graph is a node in the tree.

for each an edge − in : separates nodes in : those

appearing on 's side of the tree from those appearing on 's side.


def: For an undirected graph over �, a tree is a clique tree for if:

every node ⊆ � in the tree is a clique in the graph.

every maximal clique in the graph is a node in the tree.

for each an edge − in : separates nodes in : those

appearing on 's side of the tree from those appearing on 's side.


Not every graph has a clique tree, but:

Every chordal graph has a clique tree.

A property of clique trees:

Every leaf in the tree has a node present only there.


is chordal ⇒ is induced:

We show that if is chordal then = ,≺ for some ordering ≺.

Claim: all variables of a chordal graph over variables can be eliminated

without introducing fill-edges. Induction on :

is chordal ⇒ it has a clique tree .

is a clique tree ⇒ it has a leaf node and in that leaf node a variable

∈ present only in .

can be eliminated from without introducing fill-edges:

present only in ⇒ all of 's neighbors are in .

is a clique in ⇒ all neighbors already connected to each other.

Removing from , we are left with a chordal graph of − variables.


Example:

≺= , , , , , , ,

graph

's clique tree induced graph ,≺


≺= , , , , , , ,

graph

's clique tree induced graph ,≺


For a chordal graph, we have an algorithm that does not produce fill-edges:

Max-Cardinality produces an ordering consistent with always choosing a node

which is only present in a leaf of the (current) graph's clique-tree, and only in

that leaf.

Meaning: Max-Cardinality solves the ordering problem for chordal graphs.


The general ordering problem therefore reduces to a well-known problem from

graph-theory:

Minimal-Triangulation:

Given a graph , find a chordal graph containing ,

such that 's largest clique is as small as possible.

There are different graph-theoretic algorithms addressing

Minimal-Triangulation, offering different levels of performance guarantees.

Greedy search for elimination ordering (1/3)

We saw that the inference problem | = � = ? is NP-hard.

Moreover, even finding an optimal ordering for variable elimination

for running Sum-Product-VE(Φ, , ≺) is NP-hard.

A practical approach is therefore to search the ordering space for a sub-optimal

yet as-good-as-possible ordering.

We can search this space greedily, each time choosing a variable whose

elimination will result in least cost (for some definition of cost) given current

state of induced graph. We do not have formal complexity guarantees for this

approach, yet it works surprisingly well in practice.


* note that is the

induced graph.

Possible evaluation metrics:

min-neigh , = number of neighbors of

min-neigh-weight , = product of cardinalities of neighbors of

min-fill , = number of added fill-edges if is eliminated

min-fill-weight , = sum of weights of fill-edges, where the

weight of a fill-edge is the product of

cardinalities of edge's nodes


Neither of these is universally better than others.

A variant of the greedy approach is to introduce stochasticity into it, by not

always deterministically selecting the that minimizes , .

For example: upon each elimination round, select � � � , out of a

random half of remaining variables. This serves to introduce exploration into

the ordering search (rather than have it be fully exploitative).

Note that the greedy algorithm runs in polynomial time and it can compute the

number of operations the VE algorithm itself will execute. A suggested practice

for large networks where such pre-computation time is negligible is to execute

the greedy algorithm multiple times and use the best ordering obtained.

*extra material

Conditioning (1/6)

Consider MNs: Φ � = Φ � ∗ … ∗ �

Φ ∗ Φ � = � ∗ … ∗ � Φ ∗ Φ � = � ∗ … ∗ �

Sum-Product-VE calculated :

Sum-Product-VE(Φ, , ≺)

( = � − �) returned �∗ ,

such that: �∗ = = Σ ∈� Π�∈Φ � , = Σ ∈� � , ∗ … ∗ � , = Σ ∈� Φ , = Φ

... and if Φ is needed - renormalize.

Conditioning (2/6)

We used the ability to

perform a sum-product

calculation for getting

conditional

probabilities:

Cond-Prob-VE

calculated Φ , � and Φ � :

It returned �∗ s.t. �∗ (now = � − � − �)

= Σ ∈� Π�∈Φ[�=�] � , = Σ ∈� � [ = �] , ∗ … ∗ � [ = �] , = Σ ∈� � , , � ∗ … ∗ � , , � = Σ ∈� Φ , , � = Φ , �

and a scalar: � = Σ ∈� Φ , � = Φ �

... and if Φ |� is needed: Φ , � / Φ � = Φ |�

Conditioning (3/6)

Calculating conditional probabilities ∙ | = � can reduce sizes of

encountered intermediate factors � - since evidence variables can render other

variables independent.

Conditioning (4/6)

Example: same network as before, with evidence = � (using Cond-Prob-VE)

, , , , = � , � , � , � ,

Compute | = � for Φ, = { , , }, ≺= , , ))

Φ = {� , , � , , � , , � , }

Φ ← Φ[I = i ] = {� , , � , � , � � , , � , } = {� , , � [� ] , � [� ] , � , }

Conditioning

Compute | = � for Φ, = { , , }, ≺= , , ))

Φ = {� , , � [� ] , � [� ] , � , }

Eliminate

� = � , (factor size: 2) � = Σ�� = Σ�� ,

Conditioning

Compute | = � for Φ, = { , , }, ≺= , , ))

Φ = {� , � [� ] , � [� ] , � , }

Eliminate

� = � � [� ] (factor size: 1) � [� ] =

Σ � = Σ � � [� ]

� [� ] ∈ ℝ, a scalar dependent on � .

Conditioning

Compute | = � for Φ, = { , , }, ≺= , , ))

Φ = {� [� ] ∈ ℝ, � [� ] , � , }

Eliminate �

� = � [� ] � , (factor size: 2) � [� ] =

Σ�� = Σ�� [� ] � ,

Conditioning

Compute | = � for Φ, = { , , }, ≺= , , ))

Φ = {� [� ] ∈ ℝ, � [� ] }

�∗ = � [� ] ∗ � [� ] = � , �

Conditioning (5/6)

So, computing Φ , � can introduce smaller intermediate factors compared

to computation of Φ .

We can do the following (where we build upon Cond-Prob-VE - our ability to

calculate Φ , � and Φ � ):

For some ⊆ �, for every � ∈ , compute

Φ , � = Φ ∗ , � and Φ � = Φ ∗ �

then compute

Σ�∈� Φ , � = Φ ∗ Φ and Σ�∈� Φ � = Φ

and have:

Φ ∗ Φ / Φ = Φ

Computing Φ as such is called conditioning.

Conditioning (6/6)

Conditioning does no less work than ordinary VE we've seen so far, it generally

does more by running VE many times, but the maximal intermediate factor

ever encountered over VE runs can be smaller -- which is a necessity if we

cannot hold very large intermediate factors in memory.

Ch9: Exact Inference: Variable Eliminationhaimk/pgm-seminar/part_1_and_3_shimi.pdf ·...

Documents

Transcript of Ch9: Exact Inference: Variable Eliminationhaimk/pgm-seminar/part_1_and_3_shimi.pdf ·...