Download - Joseph K. Bradley

Transcript
Page 1: Joseph K. Bradley

Carnegie Mellon

Joseph K. Bradley

Sample Complexity of CRF Parameter

Learning

4 / 9 / 2012

Joint work with Carlos Guestrin

CMU Machine Learning Lunch talk on work

appearing in AISTATS 2012

Page 2: Joseph K. Bradley

Markov Random Fields (MRFs)

2

Goal: Model distribution P(X) over random variables X

X1: deadline?

X2: bags under eyes?

X3: sick?

X4: losing hair?

X5: overeating?

= P( deadline | bags under eyes, losing hair )E.g.,

Page 3: Joseph K. Bradley

Markov Random Fields (MRFs)

3

X1: deadline?

X2: bags under eyes?

X3: sick?

X4: losing hair?

X5: overeating?

factor

Goal: Model distribution P(X) over random variables X

Page 4: Joseph K. Bradley

Log-linear MRFs

4

Parameters Features

Our goal: Given structure Φ and data, learn parameters θ.

Binary X:

Real X:

Page 5: Joseph K. Bradley

Parameter Learning: MLETraditional learning: max-likelihood estimation (MLE)

Minimize objective:

5

Given data: n i.i.d. samples from

Loss

L2 regularization is more common. Our analysis applies to L1 & L2.

Gold Standard:MLE is (optimally) statistically efficient.

Regularization

Page 6: Joseph K. Bradley

Parameter Learning: MLETraditional learning: max-likelihood estimation (MLE)

Minimize objective:

6

Given data: n i.i.d. samples from

AlgorithmIterate:

Compute gradient.Step along gradient.

Page 7: Joseph K. Bradley

Parameter Learning: MLETraditional learning: max-likelihood estimation (MLE)

7

AlgorithmIterate:

Compute gradient.Step along gradient.

Page 8: Joseph K. Bradley

Parameter Learning: MLETraditional learning: max-likelihood estimation (MLE)

8

AlgorithmIterate:

Compute gradient.Step along gradient.

Requires inference. Provably hard for general MRFs.

Inference makeslearning hard.

Can we learn withoutintractable inference?

Page 9: Joseph K. Bradley

Conditional Random Fields (CRFs)

9

X1: deadline?

X2: bags under eyes?

X3: sick?

X4: losing hair?

X5: overeating?

MRFs CRFs (Lafferty et al., 2001)

E1: weather

E2: full moon

E3: Steelers game

Inference exponential in |X|,

not |E|.

Page 10: Joseph K. Bradley

Conditional Random Fields (CRFs)

10

MRFs CRFs (Lafferty et al., 2001)

But Z depends on E!Inference exponential in |X|,

not |E|.

Compute Z(e) for every training example!

Objective:Inference makes learningeven harder for CRFs.

Can we learn withoutintractable inference?

Page 11: Joseph K. Bradley

Outline

Parameter learningBefore: No PAC learning results for general MRFs or CRFs

11

Sample complexity resultsPAC learning via pseudolikelihood for general MRFs and CRFs

Empirical analysis of boundsTightness & dependence on model

Structured composite likelihoodLowering sample complexity

Page 12: Joseph K. Bradley

Related Work

Ravikumar et al. (2010): PAC bounds for regression Yi~X with Ising factors

Our theory is largely derived from this work.

Liang and Jordan (2008): Asymptotic bounds for pseudolikelihood, composite likelihood

Our finite sample bounds are of the same order.

Learning with approximate inferenceNo PAC-style bounds for general MRFs,CRFs.c.f.: Hinton (2002), Koller & Friedman (2009), Wainwright (2006)

12

Page 13: Joseph K. Bradley

Outline

Parameter learningBefore: No PAC learning results for general MRFs or CRFs

Sample complexity resultsPAC learning via pseudolikelihood for general MRFs and CRFs

Empirical analysis of boundsTightness & dependence on model

Structured composite likelihoodLowering sample complexity

13

Page 14: Joseph K. Bradley

Avoiding Intractable Inference

14

MLE loss:

Hard to compute.So replace it!

Page 15: Joseph K. Bradley

Pseudolikelihood (MPLE)

15

MLE loss:

Pseudolikelihood (MPLE) loss:

Intuition: Approximate distribution as product of local conditionals.

X1: deadline?

X2: bags under eyes?

X3: sick?

X4: losing hair?

X5: overeating?

(Besag, 1975)

Page 16: Joseph K. Bradley

Pseudolikelihood (MPLE)

16

MLE loss:

Pseudolikelihood (MPLE) loss:

Intuition: Approximate distribution as product of local conditionals.

X1: deadline?

X2: bags under eyes?

X3: sick?

X4: losing hair?

X5: overeating?

(Besag, 1975)

Page 17: Joseph K. Bradley

Pseudolikelihood (MPLE)

17

MLE loss:

Pseudolikelihood (MPLE) loss:

Intuition: Approximate distribution as product of local conditionals.

X1: deadline?

X2: bags under eyes?

X3: sick?

X4: losing hair?

X5: overeating?

(Besag, 1975)

Page 18: Joseph K. Bradley

Pseudolikelihood (MPLE)

18

MLE loss:

Pseudolikelihood (MPLE) loss:

No intractable inference required!

Previous work:•Pro: Consistent estimator•Con: Less statistically efficient than MLE•Con: No PAC bounds

(Besag, 1975)

Page 19: Joseph K. Bradley

Outline

19

Page 20: Joseph K. Bradley

Sample Complexity: MLE

20

TheoremGiven n i.i.d. samples from Pθ*(X),

MLE using L1 or L2 regularizationachieves avg. per-parameter errorwith probability ≥ 1-δif:

# parameters (length of θ)

Λmin: min eigenvalue of Hessian of loss at θ*:

Page 21: Joseph K. Bradley

Sample Complexity: MPLE

21

r = length of θ

ε = avg. per-parameter error

δ = probability of failure

For MLE:Λmin = min eigval of Hessian of loss at θ*:

Same form as for MLE:

For MPLE:Λmin = mini [ min eigval of Hessian of loss component i at θ* ]:

Page 22: Joseph K. Bradley

Joint vs. Disjoint Optimization

22

X1: deadline?

Pseudolikelihood (MPLE) loss:

Intuition: Approximate distribution as product of local conditionals.

Page 23: Joseph K. Bradley

Joint vs. Disjoint Optimization

23

X1: deadline?

Joint: MPLE

Disjoint: Regress Xi~X-i. Average parameter estimates.

Page 24: Joseph K. Bradley

Joint vs. Disjoint Optimization

24

Joint MPLE:

Disjoint MPLE:

Sample complexity bounds

Con: worse boundPro: data parallel

Page 25: Joseph K. Bradley

Bounds for Log Loss

25

We have seen MLE & MPLE sample complexity:

where

TheoremIf parameter estimation error ε is small,

then log loss converges quadratically in ε:

else log loss converges linearly in ε:

(Matches rates from Liang and

Jordan, 2008)

Page 26: Joseph K. Bradley

Outline

26

Page 27: Joseph K. Bradley

Synthetic CRFs

27

X1

X1

X2

X2

Random:

if

otherwiseAssociative:

Chains Stars Grids

factor strength

Page 28: Joseph K. Bradley

Tightness of Bounds

28

Parameter estimation error ≤ f(sample size)

Log loss ≤ f(parameter estimation error)

MPLE-disjointMPLEMLE Chain. |X|=4.Random factors.

Page 29: Joseph K. Bradley

Tightness of Bounds

29

L1 p

ara

m e

rror

L1 p

ara

m e

rror

bou

nd

Training set size

Log loss ≤ f(parameter estimation error)

MPLE-disjointMPLEMLE Chain. |X|=4.Random factors.

Page 30: Joseph K. Bradley

Tightness of Bounds

30

L1 p

ara

m e

rror

L1 p

ara

m e

rror

bou

nd

Training set size

MPLE-disjointMPLEMLE

Log

(b

ase

e)

loss

Training set size

Log

loss

bou

nd

,g

iven

para

ms

Chain. |X|=4.Random factors.

Page 31: Joseph K. Bradley

Tightness of Bounds

31

Parameter estimation error ≤ f(sample size)

Log loss ≤ f(parameter estimation error)

(looser) (tighter)

Page 32: Joseph K. Bradley

Predictive Power of Bounds

32

Parameter estimation error ≤ f(sample size)

(looser)

Is the bound still useful (predictive)?

Examine dependence on Λmin, r.

Page 33: Joseph K. Bradley

Predictive Power of Bounds

33

1/Λmin

L1 p

ara

m e

rror

L1 p

ara

m e

rror

bou

nd

Actual error vs. bound:•Different constants•Similar behavior•Nearly independent of r

Chains.Random factors.10,000 train exs.

r=23r=11r=5

MLE(similar results for MPLE)

Page 34: Joseph K. Bradley

Recall: Λmin

34

For MLE:Λmin = min eigval of Hessian of at θ*.

Sample complexity:

For MPLE:Λmin = mini [ min eigval of Hessian of at θ* ].

How do Λmin(MLE) and Λmin(MPLE) vary for different models?

Page 35: Joseph K. Bradley

Λmin ratio: MLE/MPLE: chains

35

Factor strength(Fixed |Y|=8)

Λm

in r

ati

oRandom factors

Model size |Y|(Fixed factor strength)

Λm

in r

ati

o

Associative factors

Factor strength(Fixed |Y|=8)

Λm

in r

ati

o

Model size |Y|(Fixed factor strength)

Λm

in r

ati

o

bett

er

Page 36: Joseph K. Bradley

Λmin ratio: MLE/MPLE: stars

36

Factor strength(Fixed |Y|=8)

Λm

in r

ati

o

Model size |Y|(Fixed factor strength)

Λm

in r

ati

o

Factor strength(Fixed |Y|=8)

Λm

in r

ati

o

Model size |Y|(Fixed factor strength)

Λm

in r

ati

o

Random factors Associative factors

bett

er

Page 37: Joseph K. Bradley

Outline

Parameter learningBefore: No PAC learning results for general MRFs or CRFs

Sample complexity resultsPAC learning via pseudolikelihood for general MRFs and CRFs

Empirical analysis of boundsTightness & dependence on model

Structured composite likelihoodLowering sample complexity

37

Page 38: Joseph K. Bradley

Grid Example

38

MLE: Estimate P(Y) all at once

Page 39: Joseph K. Bradley

Grid Example

39

MLE: Estimate P(Y) all at once

MPLE: Estimate P(Yi|Y-i) separatelyYi

Page 40: Joseph K. Bradley

Grid Example

40

MLE: Estimate P(Y) all at once

MPLE: Estimate P(Yi|Y-i) separately

YAi

Something in between? Estimate a larger

component, but keep inference tractable.

Composite Likelihood (MCLE):

Estimate P(YAi|Y-Ai) separately, where YAi in Y.

(Lindsay, 1988)

Page 41: Joseph K. Bradley

Grid Example

41

Weak horizontal

factors

Strong vertical factors

Composite Likelihood (MCLE):

Estimate P(YAi|Y-Ai) separately, where YAi in Y

Choosing MCLE components YAi:•Larger is better.•Keep inference tractable.•Choose using model structure.Good choice: vertical combs

Page 42: Joseph K. Bradley

Λmin ratio: MLE vs. MPLE,MCLE: grids

42

Random factors Associative factors

Grid width (Fixed factor strength)

Λm

in r

ati

o

combs

MPLE

Factor strength (Fixed |Y|=8)

Λm

in r

ati

o

combs

MPLE

Grid width (Fixed factor strength)

Λm

in r

ati

o

combs

MPLE

Factor strength (Fixed |Y|=8)

Λm

in r

ati

o

combs

MPLE

bett

er

Page 43: Joseph K. Bradley

Structured MCLE on a Grid

43

Grid size |X|

Log

loss

rati

o (

oth

er/

MLE

)

combs

MPLE

Grid size |X|Tra

inin

g t

ime (

sec)

combsMPLE

MLE

Grid with associative factors (fixed strength).10,000 training samples.Gibbs sampling for inference.

bett

er

Combs (MCLE) lower sample complexity...without increasing computation!

Page 44: Joseph K. Bradley

Averaging MCLE Estimators

44

MLE & MPLE sample complexity:

Λmin(MLE) = min eigval of Hessian of at θ*.Λmin(MPLE) = mini [ min eigval of Hessian of at θ* ].

MCLE sample complexity:

ρmin = minj [ sum over components Ai which estimate θj of

[ min eigval of Hessian of at θ* ].Mmax = maxj [ number of components Ai which estimate θj ].

Page 45: Joseph K. Bradley

Averaging MCLE Estimators

45

MLE & MPLE sample complexity:

MCLE sample complexity:

ρmin = minj [ sum over components Ai which estimate θj of

[ min eigval of Hessian of at θ* ].Mmax = maxj [ number of components Ai which estimate θj ].

11

33

22

44

Estimated byboth components

Estimated byone component

Mmax = 2 Mmax = 2 Mmax = 3

Page 46: Joseph K. Bradley

Averaging MCLE Estimators

46

MLE & MPLE sample complexity:

MCLE sample complexity:

For MPLE, a single bad estimator P(Xi|X-

i) can give a bad bound.

For MCLE, the effect of a bad estimator P(XAi|X-Ai) can be averaged out by other good estimators.

Page 47: Joseph K. Bradley

Structured MCLE on a Grid

47

Grid width

Λm

in

MPLE

MLE

Combs-both

Comb-vert

Comb-horiz

Grid with strong vertical (associative) factors.

bett

er

Page 48: Joseph K. Bradley

Summary: MLE vs. MPLE/MCLE

Relative performance of estimatorsIncreasing model diameter has little effect.MPLE/MCLE get worse with increasing:

Factor strengthNode degreeGrid width

48

Structured MCLE partly solves these problems.

Choose MCLE structure according to factor strength, node degree, grid structure.Same computational cost as MPLE.

Page 49: Joseph K. Bradley

SummaryPAC learning via MPLE & MCLE for general MRFs and CRFs.Empirical analysis:

Bounds are predictive of empirical behavior.Strong factors and high-degree nodes hurt MPLE.

Structured MCLECan have lower sample complexity than MPLE but same computational complexity.

49

Future work•Choosing MCLE structure on natural graphs.•Parallel learning: Improving statistical efficiency of disjoint optimization via limited communication.•Comparing with MLE using approximate inference.

Thank you!

Page 50: Joseph K. Bradley

Canonical Parametrization

Abbeel et al. (2006): Only previous method for PAC-learning high-treewidth discrete MRFs.PAC bounds for low-degree factor graphs over discrete X.Main idea:

Re-write P(X) as a ratio of many small factors P( XCi | X-Ci ).Fine print: Each factor is instantiated 2|Ci| times using a reference assignment.

Estimate each small factor P( XCi | X-Ci ) from data.

Plug factors into big expression for P(X).

50

TheoremIf the canonical parametrization uses the factorization of P(X), it is equivalent to MPLE with disjoint optimization. Computing MPLE directly is faster. Our analysis covers their learning method.