Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent...

63
Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta

Transcript of Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent...

Page 1: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Part 3: Latent representationsand unsupervised learning

Dale Schuurmans

University of Alberta

Page 2: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Supervised versus unsupervised learning

Prominent training principles

Discriminative

y

x

y

x

x φ

Wednesday, 11 July, 12

typical for supervised

Generative

y

x

y

x

x φ

Wednesday, 11 July, 12

typical for unsupervised

Page 3: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Unsupervised representation learning

Consider generative training

y

x

y

x

x φ

Wednesday, 11 July, 12

Page 4: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Unsupervised representation learning

Examples

• dimensionality reduction (PCA, exponential family PCA)• sparse coding• independent component analysis• deep learning...

Usually involves learning both

a latent representation for data and a data reconstruction model

Contextcould be: unsupervised, semi-supervised, or supervised

Page 5: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Challenge

Optimal feature discovery appears to be generally intractable

Have to jointly train• latent representation• data reconstruction model

Usually resort to alternating minimization

(sole exception: PCA)

Page 6: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

First consider unsupervised feature discovery

Page 7: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Unsupervised feature discovery

Single layer case = matrix factorization

x

ϕ

BX B Φ≈

original data learned dictionary new representation

n!t n!m

m!t

t = # training examplesn = # original features

m = # new features

Choose B and Φ to minimize data reconstruction loss

L(BΦ;X ) =∑t

i=1 L(BΦ:i ;X:i )

Seek desired structure in latent feature representationΦ low rank : dimensionality reductionΦ sparse : sparse codingΦ rows independent : independent component analysis

Page 8: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Generalized matrix factorizationAssume reconstruction loss L(x; x) is convex in first argument

Bregman divergence

L(x; x) = DF (x‖x) = DF∗(f (x)‖f (x))

(F strictly convex potential with transfer f = ∇F )

Tries to make x ≈ x

Matching loss

L(x; x) = DF (x‖f −1(x)) = DF∗(x‖f (x))

Tries to make f (x) ≈ x

(A nonlinear predictor, but loss still convex in x)

x

ϕ

B

Regular exponential family

L(x; x) = − log pB(x|φ) = DF (x‖f −1(x))− F ∗(x)− const

Page 9: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Training problem

minB∈Rn×m

minΦ∈Rm×t

L(BΦ;X )

How to impose desired structure on Φ?

Page 10: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Training problem

minB∈Rn×m

minΦ∈Rm×t

L(BΦ;X )

How to impose desired structure on Φ?

Dimensionality reduction

Fix # features m < min(n, t)• But only known to be tractable if L(X ;X ) = ‖X − X‖2

F (PCA)• No known efficient algorithm for other standard losses

Problemrank(Φ) = m constraint is too hard

Page 11: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Training problem

minB∈Bm2

minΦ∈Rm×t

L(BΦ;X )+α‖Φ‖2,1

How to impose desired structure on Φ?

Relaxed dimensionality reduction (subspace learning)

Add rank reducing regularizer

‖Φ‖2,1 =∑m

j=1 ‖Φj :‖2

Favors null rows in Φ

But need to add constraint to B

B:j ∈ B2 = b : ‖b‖2 ≤ 1

(Otherwise can make Φ small just by making B large)

Page 12: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Training problem

minB∈Bmq

minΦ∈Rm×t

L(BΦ;X )+α‖Φ‖1,1

How to impose desired structure on Φ?

Sparse coding

Use sparsity inducing regularizer

‖Φ‖1,1 =∑m

j=1

∑ti=1 |Φji |

Favors sparse entries in Φ

Need to add constraint to B

B:j ∈ Bq = b : ‖b‖q ≤ 1

(Otherwise can make Φ small just by making B large)

Page 13: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Training problem

minB∈Rn×m

minΦ∈Rm×t

L(BΦ;X )+αD(Φ)

How to impose desired structure on Φ?

Independent components analysis

Usually enforces BΦ = X as a constraint• but interpolation is generally a bad idea• Instead just minimize reconstruction loss

plus a dependence measure D(Φ) as a regularizer

Difficulty

Formulating a reasonable convex dependence penalty

Page 14: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Training problem

Consider subspace learning and sparse coding

minB∈Bm

minΦ∈Rm×t

L(BΦ;X ) + α‖Φ‖

Choice of ‖Φ‖ and B determines type of representation recovered

Page 15: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Training problem

Consider subspace learning and sparse coding

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖

Choice of ‖Φ‖ and B determines type of representation recovered

ProblemStill have rank constraint imposed by # new features m

IdeaJust relax m→∞• Rely on sparsity inducing norm ‖Φ‖ to select features

Page 16: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Training problem

Consider subspace learning and sparse coding

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖

Still have a problem

Optimization problem is not jointly convex in B and Φ

Page 17: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Training problem

Consider subspace learning and sparse coding

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖

Still have a problem

Optimization problem is not jointly convex in B and Φ

Idea 1: Alternate!• convex in B given Φ• convex in Φ given B

Could use any other form of local training

Page 18: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Training problem

Consider subspace learning and sparse coding

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖

Still have a problem

Optimization problem is not jointly convex in B and Φ

Idea 2: Boost!• Implicitly fix B to universal dictionary• Keep row-wise sparse Φ• Incrementally select column in B (“weak learning problem”)• Update sparse Φ

Can prove convergence under broad conditions

Page 19: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Training problem

Consider subspace learning and sparse coding

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖

Still have a problem?

Optimization problem is not jointly convex in B and Φ

Idea 3: Solve!• Can easily solve for globally optimal joint B and Φ• But requires a significant reformulation

Page 20: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

A useful observation

Page 21: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Equivalent reformulation

Theoremmin

B∈B∞min

Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α|‖X‖|

• |‖ · ‖| is an induced matrix norm on X determined by B and ‖ · ‖p,1

Important fact

Norms are always convex

Computational strategy

1. Solve for optimal response matrix X first (convex minimization)2. Then recover optimal B and Φ from X

Page 22: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Example: subspace learning

minB∈B∞2

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖2,1

= minX∈Rn×t

L(X ;X ) + α‖X‖tr

Recovery

• Let UΣV ′ = svd(X )• Set B = U and Φ = ΣV ′

Preserves optimality

• ‖B:j‖2 = 1 hence B ∈ Bn2• ‖Φ‖2,1 = ‖ΣV ′‖2,1 =

∑j σj‖V:j‖2 =

∑j σj = ‖X‖tr

ThusL(X ;X ) + α‖X‖tr = L(BΦ;X ) + α‖Φ‖2,1

Page 23: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Example: sparse coding

minB∈B∞q

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖1,1

= minX∈Rn×t

L(X ;X ) + α‖X ′‖q,1

Recovery

B =

[1

‖X:1‖qX:1, ...,

1

‖X:t‖qX:t

](rescaled columns)

Φ =

‖X:1‖q 0. . .

0 ‖X:t‖q

(diagonal matrix)

Preserves optimality

• ‖B:j‖q = 1 hence B ∈ Btq• ‖Φ‖1,1 =

∑j ‖X:j‖q = ‖X ′‖q,1

ThusL(X ;X ) + α‖X ′‖q,1 = L(BΦ;X ) + α‖Φ‖1,1

Page 24: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Example: sparse coding

OutcomeSparse coding with ‖ · ‖1,1 regularization = vector quantization• drops some examples• memorizes remaining examples

Optimal solution is not overcomplete

Could not make these observations using local solvers

Page 25: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Simple extensions

• Missing observations in X• Robustness to outliers in X

minS∈Rn×t

minX∈Rn×t

L( (X + S)Ω ; XΩ ) + α|‖X‖|+ β‖S‖1,1

Ω = observed indices in XS = speckled outlier noise

(jointly convex in X and S)

Page 26: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Explaining the useful result

Theorem minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α|‖X‖|

for an induced matrix norm |‖X‖| = ‖X ′‖∗(B,p∗)

Page 27: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Explaining the useful result

Theorem minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α|‖X‖|

for an induced matrix norm |‖X‖| = ‖X ′‖∗(B,p∗)

A dual norm

‖X ′‖∗(B,p∗) = max‖Λ′‖(B,p∗)≤1

tr(Λ′X )

(standard definition of a dual norm)

Page 28: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Explaining the useful result

Theorem minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α|‖X‖|

for an induced matrix norm |‖X‖| = ‖X ′‖∗(B,p∗)

A dual norm

‖X ′‖∗(B,p∗) = max‖Λ′‖(B,p∗)≤1

tr(Λ′X )

(standard definition of a dual norm)

of a vector-norm induced matrix norm

‖Λ′‖(B,p∗) = maxb∈B‖Λ′b‖p∗

(easy to prove this yields a norm on matrices)

Page 29: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

minΦ:BΦ=X

‖Φ‖p,1

Page 30: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

minΦ:BΦ=X

‖Φ‖p,1

For any B ∈ B∞ that spans the columns of X

minΦ:BΦ=X

‖Φ‖p,1 = minΦ

maxΛ

max‖V ‖p∗,∞≤1

tr(V ′Φ) + tr(Λ′(X − BΦ))

= max‖V ‖p∗,∞≤1

maxΛ

minΦ

tr(Λ′X ) + tr(Φ′(V − B ′Λ))

= max‖V ‖p∗,∞≤1

maxΛ:B′Λ=V

tr(Λ′X )

= maxΛ:‖B′Λ‖p∗,∞≤1

tr(Λ′X )

Page 31: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1

tr(Λ′X )

Page 32: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )

Page 33: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞

tr(Λ′X )

Page 34: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )

Page 35: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B

tr(Λ′X )

Page 36: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ′X )

Page 37: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X )

Page 38: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X )︸ ︷︷ ︸

Page 39: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X )︸ ︷︷ ︸= min

X∈Rn×tL(X ;X ) + α‖X ′‖∗(B,p∗)

Page 40: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Proof outlinemin

B∈B∞min

Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X )︸ ︷︷ ︸= min

X∈Rn×tL(X ;X ) + α‖X ′‖∗(B,p∗)

done

Page 41: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Closed form induced norms

Theoremmin

B∈B∞min

Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α‖X ′‖∗(B,p∗)

Special cases

B2, ‖Φ‖2,1 7→ ‖X ′‖∗(B2,2) = ‖X‖tr (subspace learning)

Bq, ‖Φ‖1,1 7→ ‖X ′‖∗(Bq ,∞) = ‖X ′‖q,1 (sparse coding)

B1, ‖Φ‖p,1 7→ ‖X ′‖∗(B1,p∗)= ‖X‖p,1

Page 42: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Some simple experiments

Page 43: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Experimental results

Alternate : repeatedly optimize over B,Φ successivelyGlobal : recover global joint minimizer over B,Φ

Page 44: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Experimental results: Sparse coding

Objective value achieved

data setCOIL WBC BCI Ionos G241N

Alternate 1.314 4.918 0.898 1.612 1.312Global 0.207 0.659 0.306 0.330 0.207

×10−2

(squared loss, q = 2, α = 10−5)

Page 45: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Experimental results: Sparse coding

Run time (seconds)

data setCOIL WBC BCI Ionos G241N

Alternate 1.95 10.54 0.88 1.71 2.37Global 0.06 0.01 0.01 0.01 0.09

(squared loss, q = 2, α = 10−5)

Page 46: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Experimental results: Subspace learning

Objective value achieved

data setCOIL WBC BCI Ionos G241N

Alternate 1.314 4.957 0.903 1.632 1.313Global 0.072 0.072 0.092 0.079 0.205

×10−2

(squared loss, α = 10−5)

Page 47: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Experimental results: Subspace learning

Run time (seconds)

data setCOIL WBC BCI Ionos G241N

Alternate 2.40 9.31 1.12 0.47 2.43Global 2.18 0.06 0.19 0.06 2.11

(squared loss, α = 10−5)

Page 48: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Catch

Every norm is convexBut not every induced matrix norm is tractable

‖X‖2 = σmax(X )

‖X‖1 = maxj

∑i

|Xij |

‖X‖∞ = maxi

∑j

|Xij |

‖X‖p NP-hard to approximate for p 6= 1, 2,∞

QuestionAny other useful induced matrix norms that are tractable?

Yes!

Page 49: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Semi-supervised feature discovery

Page 50: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Semi-supervised feature discovery

B Φl≈

labeled

(n+k)!(tl+tu)

Xl

W

ΦuXu

Yl

unlabeled

inputs

outputs

tl = # labeled n = # original featurestu = # unlabeled k = # output dimensionst = tl + tu

yx

B W

ϕ

LearnΦ = [Φl , Φu] data representationB = input reconstruction model f (BΦ) ≈ XW = output reconstruction model h(WΦl) ≈ Yl

Page 51: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Semi-supervised feature discovery

Let

Z =

[Xl Xu

Yl ∅

]U =

[BW

]U =

[BW

]Formulation

minB∈B∞

minW∈W∞

minΦ∈R∞×t

Lu(BΦ;X ) + βLs(WΦl ;Yl) + α‖Φ‖p,1

= minZ∈R(n+k)×t

L(Z ;Z ) + α‖Z ′‖∗(U ,p∗)

NoteImposing separate constraints on B and W

Questions• Is the induced norm ‖Z ′‖∗(U ,p∗) efficiently computable?

• Can optimal B, W , Φ be recovered from optimal Z?

Page 52: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Example: sparse coding formulation

Regularizer: ‖Φ‖1,1

Constraints: Bq1 = b : ‖b‖q1 ≤ 1Wq2 = w : ‖w‖q2 ≤ γUq1q2

= B ×W

Theorem

‖Z ′‖∗(Uq1

q2,∞)

=∑

j max(‖ZX

:j ‖q1 ,1γ ‖Z

Y:j ‖q2

)efficiently computable

Recovery

Φjj = max(‖ZX

:j ‖q1 ,1γ ‖Z

Y:j ‖q2

)(diagonal matrix)

U = ZΦ−1

Preserves optimality

But still reduces to a form of vector quantization

Page 53: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Example: subspace learning formulation

Regularizer: ‖Φ‖2,1

Constraints: B2 = b : ‖b‖2 ≤ 1W2 = w : ‖w‖2 ≤ γU2

2 = B ×W

Theorem

‖Z ′‖∗(U22 ,∞) = max

ρ≥0‖D−1

ρ Z‖tr where Dρ =

[ √1 + γρ I 0

0√

1+γρρ I

]efficiently computable: quasi-concave in ρ

Page 54: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Example: subspace learning formulation

Lemma: dual norm

‖Λ′‖2(U2

2 ,2) = maxh:‖hX ‖2=1, ‖hY ‖2=γ

h′ΛΛ′h

= maxH:H0, tr(HIX )=1, tr(HIY )=γ

tr(HΛΛ′)

= minλ≥0, ν≥0

minΛ:ΛΛ′λIX +νIY

λ+ γν

= minλ≥0, ν≥0

minΛ:‖Dν/λΛ‖2

sp≤λ+γνλ+ γν

= minλ≥0, ν≥0

‖Dν/λΛ‖2sp

= minρ≥0

‖DρΛ‖2sp

Page 55: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Example: subspace learning formulation

Can easily derive target norm from dual norm

‖Z ′‖∗(U22 ,2) = max

‖Λ′‖(U2

2,2)≤1

tr(Λ′Z )

= maxρ≥0

maxΛ:‖DρΛ‖sp≤1

tr(Λ′Z )

= maxρ≥0

maxΛ:‖Λ‖sp≤1

tr(Λ′D−1ρ Z )

= maxρ≥0

‖D−1ρ Z‖tr

(proves theorem)

Page 56: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Example: subspace learning formulation

Computational strategy

Solve in dual, since ‖Λ′‖(U22 ,∞) can be computed efficiently via

partitioned power method iteration

minΛ

L?(Λ;Z ) + α?‖Λ′‖(U22 ,2)

Given Λ• Recover ZX and ZY

l by solving

minZX , ZY

Lu(ZX ;X ) + Ls(ZYl ;Yl)− tr(ZX ′ΛX )− tr(ZY

l′ΛY

l )

• Recover ZYu by minimizing ‖Z ′‖(U2

2 ,2) (keeping ZX , ZYl fixed)

Page 57: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Example: subspace learning formulation

Recovery

Given optimal Z , recover U and Φ iteratively by repeating:

• (Φ(`),Λ(`)) ∈ arg minΦ maxΛ ‖Φ‖2,1 + tr(Λ′(Z − U(`)Φ))

• u(`+1) ∈ arg maxu∈U22‖u′Λ(`)‖2

• U(`+1) = [U(`), u(`+1)]

Converges to optimal U and Φ

• U(`)Φ(`) = Z for all `

• ‖Φ(`)‖2,1 → ‖Z ′‖∗(U22 ,2)

Page 58: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Some simple experiments

Page 59: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Experimental results: Subspace learning

Staged : first locally optimize B,Φ, then optimize WAlternate : repeatedly optimize over B,W ,Φ successivelyGlobal : recover joint global minimizer over B,Φ,W

Page 60: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Experimental results: Subspace learning

Objective value achieved

data setCOIL WBC BCI Ionos G241N

Staged 1.384 1.321 0.799 0.769 1.381Alternate 0.076 0.122 0.609 0.081 0.076Global 0.070 0.113 0.069 0.078 0.070

(1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)

Page 61: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Experimental results: Subspace learning

Run time (seconds)

data setCOIL WBC BCI Ionos G241N

Staged 272 73 45 28 290Alternate 2352 324 227 112 2648Global 106 8 25 61 94

(1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)

Page 62: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Experimental results: Subspace learning

Transductive generalization error

data setCOIL WBC BCI Ionos G241N

Staged 0.476 0.200 0.452 0.335 0.484Alternate 0.464 0.388 0.440 0.457 0.478Global 0.388 0.134 0.380 0.243 0.380(Lee et al. 2009) 0.414 0.168 0.436 0.350 0.452(Goldberg et al. 2010) 0.484 0.288 0.540 0.338 0.524

(1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)

Page 63: Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Conclusion

Global training can be more efficient than local training

Alternation is inherently slow to converge

Global training simplifies practical application

• no under-training• only need to guard against over-fitting• can use standard regularization techniques