Part I TWO MESSAGE PASSING ALGORITHMS...Exact algorithms are too expensive and one typically resorts...

17
1 Lecture 8 – Apr 20, 2011 CSE 515, Statistical Methods, Spring 2011 Instructor: Su-In Lee University of Washington, Seattle Message Passing Algorithms for Exact Inference & Parameter Learning Readings: K&F 10.3, 10.4, 17.1, 17.2 TWO MESSAGE PASSING ALGORITHMS Part I 2

Transcript of Part I TWO MESSAGE PASSING ALGORITHMS...Exact algorithms are too expensive and one typically resorts...

  • 1

    Lecture 8 – Apr 20, 2011CSE 515, Statistical Methods, Spring 2011

    Instructor: Su-In LeeUniversity of Washington, Seattle

    Message Passing Algorithms for Exact Inference & Parameter Learning

    Readings: K&F 10.3, 10.4, 17.1, 17.2

    TWO MESSAGE PASSING ALGORITHMS

    Part I

    2

  • 2

    Sum-Product Message Passing Algorithm

    Claim: for each clique Ci: πi[Ci] = P(Ci)Variable elimination, treating Ci as a root clique

    Compute P(X)Find belief π of a clique that contains X and eliminate other RVs.If X appears in multiple cliques, they must agree

    3

    C

    D I

    SG

    L

    JH

    D G,I G,S G,J

    Clique tree

    ]D[21→δ

    ]D[12 →δ ]I,G[23→δ

    ]I,G[32 →δ ]S,G[53 →δ

    ]S,G[35 →δ

    ]S,G[45 →δ

    ]S,G[54 →δ

    C,D G,I,D G,S,I1 2 3

    G,J,S,L5

    G,H,J4

    π10[C,D]=P(c)P(D|C)

    π20[G,I,D]=P(G|I,D)

    π30[G,S,I]=P(I)P(S|I)

    π50[G,J,S,L]=P(L|G)P(J|L,S)

    π40[G,H,J]=P(H|G,J)

    δ2→30[G,I]=

    =∑D π20[G,I,D] δ1→2 [D]

    Belief π3[G,S,I]=π30[G,S,I] δ2→3 [G,I] δ5→3 [G,S]

    Clique Tree CalibrationA clique tree with potentials πi[Ci] is said to be calibrated if for all neighboring cliques Ci and Cj:

    Key advantage the clique tree inference algorithmComputes marginal distributions for all variables P(X1),…,P(Xn) using only twice the computation of the upward pass in the same tree.

    )(][][ ,,,,

    jijiSC

    jjSC

    ii SCCjijjii

    µππ == ∑∑−−

    4

    “Sepset belief”

    C,D G,I,Di j

    D ∑∑ ==IG

    jC

    iji DIGDCD,

    , ],,[],[)( ππµ

  • 3

    Calibrated Clique Tree as a DistributionAt convergence of the clique tree algorithm, we have that:

    Proof:

    Clique tree invariant: The clique beliefs π’s and sepset beliefs µ’sprovide a re-parameterization of the joint distribution, one that directly reveals the marginal distributions.

    ∏∏

    ∈↔

    ∈Φ =

    TCC jiji

    TC ii

    ji

    i

    S

    CP

    )( ,,)(

    ][)(

    µ

    πX

    5

    ∏∑∑ ∈ →−− == ijiijii Nk ikiSC iiSC ijiji CCS δππµ ][][)( ,,0

    ,,

    ∏∑ −∈ →→−= }{,0 )(][, jNk ikjiijiSC i ijii SC δδπ∏∑ −∈ →−→= }{0, ][)( , jNk ikiSC ijiij ijii CS δπδ

    )()( ,, jijijiij SS →→= δδ ∏∑ −∈ →−→ = }{0

    , jNk ikSCji ijii iδπδ

    Definition

    )(][][

    )(

    ][0

    )(

    0

    )( ,,

    XΦ∈∈↔ →→

    ∈ ∈ →

    ∈↔

    ∈ === ∏∏∏ ∏

    ∏∏

    PCC

    S

    CTC i

    TCC ijji

    TC Nk iki

    TCC jiji

    TC ii

    i i

    ji

    i ii

    ji

    i πδδ

    δπ

    µ

    π

    Distribution of Calibrated TreeFor calibrated tree

    Joint distribution can thus be written as

    A B C A,B B,CB

    Bayesian network Clique tree

    ][],[

    )(],[

    )(),()|(

    12

    22

    BCB

    BPCB

    BPCBPBCP

    µππ

    ===

    ][],[],[)|(),(),,(

    2

    21

    BCBBABCPBAPCBAP

    µππ

    ==

    6∏

    ∏∈↔

    ∈Φ =

    TCC ji

    TC i

    ji

    iP)( ,

    )(µ

    πX

    Clique tree invariant

  • 4

    An alternative approach for message passing in clique trees?

    7

    Message Passing: Belief PropagationRecall the clique tree calibration algorithm

    Upon calibration the final potential (belief) at i is:

    A message from i to j sums out the non-sepsetvariables from the product of initial potential and all messages except for the one from j to i

    Can also be viewed as multiplying all messages and dividing by the message from j to i

    Forms a basis of an alternative way of computing messages

    ∏ ∈ →= ii Nk iki δππ0

    ∏∑ −∈ →−→ = }{0, jNk ikSCji ijii i δπδ

    ij

    SC i

    ij

    Nk ikSCji

    jiiijii i

    ∈ →−→

    ∑∏∑==

    δ

    π

    δ

    δπδ ,,

    0

    8

    “Sepset belief”)( ,, jiji Sµ

  • 5

    Message Passing: Belief Propagation

    Root: C2C1 to C2 Message:C2 to C1 Message:

    Sum-product message passing

    Alternatively compute And then:

    Thus, the two approaches are equivalent

    X1 X2 X3 X1,X2 X2,X3X2

    Bayesian network Clique tree

    ],[)()(],[ 3202323221322 XXXXXX πδδπ →→=

    X4 X3,X4X3

    )|()(],[)( 121210

    22111 1

    XXPXPXXXXX ∑∑ ==→ πδ

    )(],[)( 323320

    2123 2

    XXXXX →→ ∑= δπδ

    ∑∑

    →→

    → ==3

    3 )(],[)(

    ],[)( 32332

    02

    221

    322212

    X

    X XXXX

    XXX δπ

    δ

    πδ

    9

    “Sepset belief” )( 22,1 Xµ

    Message Passing: Belief PropagationBased on the observation above,

    Different message passing scheme, belief propagationEach clique Ci maintains its fully updated beliefs πi

    product of initial clique potentials πi0 and messages from neighbors δk→iEach sepset also maintains its belief µi,j

    product of the messages in both direction δi→j, δj→iThe entire message passing process is executed in an equivalent way in terms of the clique and sepset beliefs – πi’s and µi,j’s.

    Basic idea (µi,j=δi→jδj→i)Each clique C i initializes the belief πi as πi0 (=∏φ) and then updates it by multiplying with message updates received from its neighbors.Store at each sepset Si,j the previous sepset belief µi,j regardless of the direction of the message passedWhen passing a message from Ci to Cj, divide the new sepset belief σi,jby previous µi,jUpdate the clique belief πj by multiplying with

    This is called belief update or belief propagation10

    ji

    ji

    ,

    ,

    µσ

    ∑ −= jii SC i, π

    Ci CjSi,j

  • 6

    Message Passing: Belief PropagationInitialize the clique tree

    For each clique Ci set

    For each edge Ci—Cj set

    While uninformed cliques existSelect Ci—CjSend message from Ci to Cj

    Marginalize the clique over the sepset

    Update the belief at Cj

    Update the sepset belief at Ci–Cj

    Equivalent to the sum-product message passing algorithm? Yes – a simple algebraic manipulation, left as PS#2.

    ∏ =← ii )(: φαφ φπ1, ←jiµ

    ∑ −→ ← jii SC iji , πσ

    jiji →←σµ ,

    ji

    jijj

    ,µσ

    ππ →←

    11

    Clique Tree InvariantBelief propagation can be viewed as reparameterizingthe joint distribution

    Upon calibration we showed

    How can we prove this holds in belief propagation?

    Initially this invariant holds since

    At each update step invariant is also maintainedMessage only changes πi and µi,j so most terms remain unchangedWe need to show that for new π’, µ’

    But this is exactly the message passing step

    Belief propagation reparameterizes P at each step

    ∏∏

    ∈↔

    ∈Φ =

    TCC jiji

    TC ii

    ji

    i

    S

    CP

    )( ,,)(

    ][)(

    µ

    πX

    )(1)(

    ][

    )( ,,

    XΦ∈

    ∈↔

    ∈ ==∏

    ∏∏

    PS

    CF

    TCC jiji

    TC ii

    ji

    i φφ

    µ

    π

    ji

    i

    ji

    i

    ,,''

    µπ

    µπ

    =

    ji

    ijii

    ,

    ,''µ

    πµπ =

    12

  • 7

    Answering QueriesPosterior distribution queries on variable X

    Sum out irrelevant variables from any clique containing X

    Posterior distribution queries on family X,Pa(X)The family preservation property implies that X,Pa(X) are in the same clique.Sum out irrelevant variables from clique containing X,Pa(X)

    Introducing evidence Z=z, Compute posterior of X where X appears in clique with Z

    Since clique tree is calibrated, multiply clique that contains X and Z with indicator function I(Z=z) and sum out irrelevant variables.

    Compute posterior of X if X does not share a clique with ZIntroduce indicator function I(Z=z) into some clique containing Z and propagate messages along path to clique containing XSum out irrelevant factors from clique containing X

    13

    ∏Φ∈

    Φ =φ

    φ)(XP ∏Φ∈

    Φ ===φ

    φ}{),( zZzZP 1X

    So far, we haven’t really discussed how to construct clique trees…

    14

  • 8

    Constructing Clique TreesTwo basic approaches

    1. Based on variable elimination2. Based on direct graph manipulation

    Using variable eliminationThe execution of a variable elimination algorithm can be associated with a cluster graph.

    Create a cluster Ci for each factor used during a VE runCreate an edge between Ci and Cj when a factor generated by Ci is used directly by Cj (or vice versa)

    We showed that cluster graph is a tree satisfying the running intersection property and thus it is a legal clique tree

    15

    Direct Graph ManipulationGoal: construct a tree that is family preserving and obeys the running intersection propertyThe induced graph IF,α is necessarily a chordal graph.

    The converse holds: any chordal graph can be used as the basis for inference.Any chordal graph can be associated with a clique tree (Theorem 4.12)

    Reminder: The induced graph IF,α over factors F and ordering α:Union of all of the graphs resulting from the different steps of the variable elimination algorithm.Xi and Xj are connected if they appeared in the same factor throughout the VE algorithm using α as the ordering

    16

  • 9

    Constructing Clique TreesThe induced graph IF,α is necessarily a chordal graph.

    Any chordal graph can be associated with a clique tree (Theorem 4.12)

    Step I: Triangulate the graph to construct a chordal graph HConstructing a chordal graph that subsumes an existing graph H0

    NP-hard to find a minimum triangulation where the largest clique in the resulting chordal graph has minimum size Exact algorithms are too expensive and one typically resorts to heuristic algorithms. (e.g. node elimination techniques; see K&F 9.4.3.2)

    Step II: Find cliques in H and make each a node in the clique treeFinding maximal cliques is NP-hardCan begin with a family, each member of which is guaranteed to be a clique, and then use a greedy algorithm that adds nodes to the clique until it no longer induces a fully connected subgraph.

    Step III: Construct a tree over the clique nodesUse maximum spanning tree algorithm on an undirected graph whose nodes are cliques selected above and edge weight is |Ci∩Cj|We can show that resulting graph obeys running intersection → valid clique tree

    17

    ExampleC

    D I

    SG

    L

    JH

    C

    D I

    SG

    L

    JH

    One possible triangulation

    C

    D I

    SG

    L

    JH

    MoralizedGraph

    C,D G,I,D G,S,I G,S,L L,S,J1 2 2 2

    G,H 111

    Cluster graph with edge weights

    11

    C,D

    G,I,D

    G,S,I G,S,L L,S,J

    G,H18

  • 10

    PARAMETER LEARNINGPart II

    19

    Learning IntroductionSo far, we assumed that the networks were given

    Where do the networks come from?Knowledge engineering with aid of expertsLearning: automated construction of networks

    Learn by examples or instances

    20

  • 11

    Learning Introduction

    Input: dataset of instances D={d[1],...d[m]}Output: Bayesian network

    Measures of successHow close is the learned network to the original distribution

    Use distance measures between distributionsOften hard because we do not have the true underlying distributionInstead, evaluate performance by how well the network predicts new unseen examples (“test data”)

    Classification accuracyHow close is the structure of the network to the true one?

    Use distance metric between structuresHard because we do not know the true structureInstead, ask whether independencies learned hold in test data

    21

    Prior KnowledgePrespecified structure

    Learn only CPDs

    Prespecified variablesLearn network structure and CPDs

    Hidden variablesLearn hidden variables, structure, and CPDs

    Complete/incomplete dataMissing dataUnobserved variables

    22

  • 12

    Learning Bayesian Networks

    Four types of problems will be covered

    23

    P(Y|X1,X2)

    X1 X2 y0 y1

    x10 x20 1 0

    x10 x21 0.2 0.8

    x11 x20 0.1 0.9

    x11 x21 0.02 0.98

    X1

    Y

    X2InducerData

    Prior information

    I. Known Structure, Complete DataGoal: Parameter estimationData does not contain missing values

    P(Y|X1,X2)

    X1 X2 y0 y1

    x10 x20 1 0

    x10 x21 0.2 0.8

    x11 x20 0.1 0.9

    x11 x21 0.02 0.98

    X1

    Y

    X2Inducer

    X1 X2 Y

    x10 x21 y0

    x11 x20 y0

    x10 x21 y1

    x10 x20 y0

    x11 x21 y1

    x10 x21 y1

    x11 x20 y0

    InputData

    X1

    Y

    X2Initial

    network

    24

  • 13

    II. Unknown Structure, Complete DataGoal: Structure learning & parameter estimationData does not contain missing values

    P(Y|X1,X2)

    X1 X2 y0 y1

    x10 x20 1 0

    x10 x21 0.2 0.8

    x11 x20 0.1 0.9

    x11 x21 0.02 0.98

    X1

    Y

    X2Inducer

    X1 X2 Y

    x10 x21 y0

    x11 x20 y0

    x10 x21 y1

    x10 x20 y0

    x11 x21 y1

    x10 x21 y1

    x11 x20 y0

    InputData

    X1

    Y

    X2Initial

    network

    25

    III. Known Structure, Incomplete DataGoal: Parameter estimationData contains missing values (e.g. Naïve Bayes)

    P(Y|X1,X2)

    X1 X2 y0 y1

    x10 x20 1 0

    x10 x21 0.2 0.8

    x11 x20 0.1 0.9

    x11 x21 0.02 0.98

    X1

    Y

    X2Inducer

    X1 X2 Y

    ? x21 y0

    x11 ? y0

    ? x21 ?

    x10 x20 y0

    ? x21 y1

    x10 x21 ?

    x11 ? y0

    InputData

    Initial network

    X1

    Y

    X2

    26

  • 14

    IV. Unknown Structure, Incomplete DataGoal: Structure learning & parameter estimationData contains missing values

    P(Y|X1,X2)

    X1 X2 y0 y1

    x10 x20 1 0

    x10 x21 0.2 0.8

    x11 x20 0.1 0.9

    x11 x21 0.02 0.98

    X1

    Y

    X2Inducer

    X1 X2 Y

    ? x21 y0

    x11 ? y0

    ? x21 ?

    x10 x20 y0

    ? x21 y1

    x10 x21 ?

    x11 ? y0

    InputData

    Initial network

    X1

    Y

    X2

    27

    Parameter EstimationInput

    Network structureChoice of parametric family for each CPD P(Xi|Pa(Xi))

    Goal: Learn CPD parameters

    Two main approachesMaximum likelihood estimationBayesian approaches

    28

  • 15

    Biased Coin Toss ExampleCoin can land in two positions: Head or Tail

    Estimation taskGiven toss examples x[1],...x[m] estimateP(X=h)= θ and P(X=t)= 1-θDenote by P(H) and P(T) to mean P(X=h) and P(X=t), respectively.

    Assumption: i.i.d samplesTosses are controlled by an (unknown) parameter θTosses are sampled from the same distributionTosses are independent of each other

    29

    X

    Biased Coin Toss ExampleGoal: find θ∈[0,1] that predicts the data well

    “Predicts the data well” = likelihood of the data given θ

    Example: probability of sequence H,T,T,H,H

    ∏∏ == =−==m

    i

    m

    iixPixxixPDPDL

    11)|][()],1[],...,1[|][()|():( θθθθ

    23 )1()|()|()|()|()|():,,,,( θθθθθθθθ −== HPHPTPTPHPHHTTHL

    0 0.2 0.4 0.6 0.8 1 θ

    L(D

    :θ)

    30

  • 16

    Maximum Likelihood EstimatorParameter θ that maximizes L(D:θ)

    In our example, θ=0.6 maximizes the sequence H,T,T,H,H

    0 0.2 0.4 0.6 0.8 1 θ

    L(D

    :θ)

    31

    Maximum Likelihood EstimatorGeneral case

    Observations: MH heads and MT tailsFind θ maximizing likelihood

    Equivalent to maximizing log-likelihood

    Differentiating the log-likelihood and solving for θ we get that the maximum likelihood parameter is:

    TH MMTH MML )1():,( θθθ −=

    )1log(log):,( θθθ −+= THTH MMMMl

    TH

    HMLE MM

    M+

    32

  • 17

    Acknowledgement

    These lecture notes were generated based on the slides from Prof Eran Segal.

    CSE 515 – Statistical Methods – Spring 2011 33