Inference in Sparse Graphs with Pairwise ... - Dylan Foster · Dylan Foster, Daniel Reichman, and...

1
If G connected, Error = O (p|V |). For grids of size × n, optimal Error = Θ(p 2 n). Until c 2, then O (pn) optimal. For graph G =(V,E ), let W = {W 1 ,...,W N } be a collection of subsets of V and let T =(W ,F ) and be a tree graph over W . T is a tree decomposition if 1. W W W = V . 2. For each uv E , some W W has u, v W . 3. For W 1 ,W 2 ,W 3 W , if W 2 on path from W 1 to W 3 , need W 1 W 3 W 2 . Main theorem: Recovery from tree decomposition Suppose we have G =(V,E ), E E . Tree decomp. T =(W ,F ) for G w/ constant width, overlap. (G (W )) for each W W . Then there is efficient Y s.t. Error O (p /2n). Statistical learning reduction: Take Y to be the empirical risk minimizer: Y = arg min Y F (X ) v V Y v ̸= Z v . Rate for ERM: v V Y v ̸= Y v ˜ O ( log|F (X )|/ϵ 2 ) w.h.p. over Z. Theorem: Optimal recovery for trees When G is a tree: Efficient algorithm Y with Hamming error Error O (pn) w.h.p. Lower bound of (pn). Error = O (pn) for all connected G! Key ideas: Contributions Characterize optimal recovery rates for trees. Lift result to general graphs via tree decomposition. Non-trivial recovery rates for all connected graphs, including sparse graphs where recovery without side information is impossible. All rates finite sample and high probability. All achieved efficiently. Goal: Error = O (h(p)n), with h(p) 0 as p 0. Huge body of work on solving / approximating MAP, MLE, etc., but how to establish tight bounds on statistical performance? Introduced in [Globerson-Roughgarden-Sontag-Yildirim‘15]. Fixed graph G =(V,E ), (|V | = n, |E | = m). Ground truth labels Y 1} V . Observe noisy edge labels X 1} E : X uv = Y u Y v , with prob. (1 p) Y u Y v , with prob. p Observe noisy vertex labels Z 1} V : Z u = Y u , with prob. (1 q ) Y u , with prob. q Inference in Sparse Graphs with Pairwise Measurements and Side Information Dylan Foster, Daniel Reichman, and Karthik Sridharan Theory-Practice Gap [email protected], [email protected] , [email protected] Goal: Obtain small Hamming error : E ( Y ) v V Y v (X, Z ) ̸= Y v . aka partial recovery. Model Basic problem: Recover latent node variables using noisy measurements on edges of a graph G =(V,E ). Community detection Inference for structured prediction (e.g. image segmentation) Alignment/registration/synchronization, correlation clustering, genome assembly, ... many more! Censored block model: [Abbe et al.`14] [Saade et al. `15] [Globerson,Yildirim, Roughgarden,Sontag`15] [Chen et al. `15] [Joachims/ Hopcroft`05] Motivation Contributions Our question: How do recovery prospects change with addition of side information? Proof sketch: How to take advantage of Chernoff? uv E {X uv ̸= Y u Y v } 2pn + O (1) w.h.p. Define hypothesis class: F (X ) Y 1} V | uv E X uv ̸= Y u Y v 2pn + O (1) . Then we have Y F (X ) w.h.p., and |F (X )| O 1 p pn F O (pn) X Y Result: Trees Result: General Graphs Further examples: Hypergrids, lattices, Newman-Watts — see paper for more. n × n grid: We recover O (p 2 n), but like so:

Transcript of Inference in Sparse Graphs with Pairwise ... - Dylan Foster · Dylan Foster, Daniel Reichman, and...

Page 1: Inference in Sparse Graphs with Pairwise ... - Dylan Foster · Dylan Foster, Daniel Reichman, and Karthik Sridharan Theory-Practice Gap djfoster@cs.cornell.edu, daniel.reichman@gmail.com,

Some Concrete Results

• If G connected, Error = O(p|V |).• For grids of size × n, optimal Error = Θ(p2n).

Until c ≤ 2, then O(pn) optimal.

• √n×

√n grid: We recover O(p2n), but like so:

Tree decompositionsFor graph G = (V,E), let W = W1, . . . ,WN be a collection ofsubsets of V and let T = (W , F ) and be a tree graph over W.T isatreedecompositionif

1. !W∈W W = V .2. For each uv ∈ E, some W ∈ W has u, v ∈ W .3. For W1,W2,W3 ∈ W , if W2 on path from W1 to W3, need

W1 ∩W3 ⊆ W2.Maintheorem: RecoveryfromtreedecompositionSuppose we have

• G′ = (V,E′), E′ ⊆ E.• Tree decomp. T = (W, F ) for G′ w/ constant width, overlap.• (G′(W )) ! ∆ for each W ∈ W .

Then there is efficient "Y s.t.

Error ≤ #O(p⌈∆/2⌉n).

Tree algorithm: proof sketch

Statisticallearningreduction:Take !Y to be the empirical risk minimizer:

!Y = arg minY ′∈F(X)

"

v∈V

#Y ′v = Zv

$.

RateforERM:"

v∈V

%!Yv = Yv

&≤ O

'log|F(X)|/ϵ2

(w.h.p. over Z.

We have |F(X)| ≈ O))

1p

*pn*→ E(!Y ) ≤ O(pn).

First result

Theorem: OptimalrecoveryfortreesWhen G isatree:

• Efficient algorithm !Y with Hamming errorError ≤ "O(pn) w.h.p.

• Lower bound of Ω(pn).• → Error = "O(pn) for all connected G!

Keyideas:• Edge MLE not enough! Squeeze more info out of Zs.• Adopt statistical learning viewpoint.

Side Information

Contributions• Characterize optimal recovery rates for trees.• Lift result to general graphs via tree decomposition.• Non-trivial recovery rates for all connected graphs,

including sparse graphs where recovery without sideinformation is impossible.

• All rates finite sample and high probability.• All achieved efficiently.

Key challenge

Huge body of work on solving / approximating MAP,MLE, etc., but how to establish tight bounds on

statistical performance?

Goal: Error = O(h(p)n), with h(p) → 0 as p → 0.

Key challenge

Huge body of work on solving / approximating MAP,MLE, etc., but how to establish tight bounds on

statistical performance?

Goal: Error = O(h(p)n), with h(p) → 0 as p → 0.ModelIntroduced in [Globerson-Roughgarden-Sontag-Yildirim‘15].

• Fixed graph G = (V,E), (|V | = n, |E| = m).• Ground truth labels Y ∈ ±1V .• Observe noisyedgelabels X ∈ ±1E :

Xuv =

!YuYv, with prob. (1− p)−YuYv, with prob. p

• Observe noisyvertexlabels Z ∈ ±1V :

Zu =

!Yu, with prob. (1− q)−Yu, with prob. q

Inference in Sparse Graphs with Pairwise Measurements and Side Information

Dylan Foster, Daniel Reichman, and Karthik Sridharan

Theory-Practice Gap

[email protected], [email protected], [email protected]

Model

Goal: Obtain small Hammingerror:

E(!Y ) !"

v∈V

#!Yv(X,Z) = Yv

$.

aka partial recovery.

Interestedin:• Finite sample behavior of E(!Y ) as function of p:

• Want α(G) such that E(!Y ) = O(pα(G)n).• Treat q = 1/2− ϵ as constant — Zs are very noisy.• Interested in p = ω(1/n) regime.

• α(G) for deterministic classes of graphs.• Sparse regime: (G) = O(1).

Model

Graph inferenceBasicproblem: Recover latent node variables using noisymeasurements on edges of a graph G = (V,E).

• Community detection

• Inference for structured prediction (e.g. image segmentation)

• Alignment/registration/synchronization,correlation clustering, genome assembly,... many more!

Censored block model:[Abbe et al.`14]

[Saade et al. `15]

[Globerson,Yildirim,Roughgarden,Sontag`15]

[Chen et al. `15]

[Joachims/Hopcroft`05]

Motivation

ContributionsOur question: How do recovery prospects change with addition of side information?

Proof sketch:Tree algorithm: proof sketch

HowtotakeadvantageofChernoff?!

uv∈EXuv = YuYv ≤ 2pn+O(1) w.h.p.

Define hypothesis class:

F(X) !"Y ′ ∈ ±1V |

!

uv∈E

#Xuv = Y ′

uY′v

$≤ 2pn+O(1)

%.

Thenwehave Y ∈ F(X) w.h.p., and |F(X)| ≈ O&&

1p

'pn'

Strategy: Learn Y from F(X) using Z!

F

O(pn)

X

Y

Result: Trees

Result: General Graphs

Further examples: Hypergrids, lattices, Newman-Watts — see paper for more.

Some Concrete Results

• If G connected, Error = O(p|V |).• For grids of size × n, optimal Error = Θ(p2n).

Until c ≤ 2, then O(pn) optimal.

• √n×

√n grid: We recover O(p2n), but like so: