Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent...

Part 3: Latent representationsand unsupervised learning

Dale Schuurmans

University of Alberta

Supervised versus unsupervised learning

Prominent training principles

Discriminative

y

x

y

x

x φ

Wednesday, 11 July, 12

typical for supervised

Generative

y

x

y

x

x φ


typical for unsupervised

Unsupervised representation learning

Consider generative training

y

x

y

x

x φ


Unsupervised representation learning

Examples

• dimensionality reduction (PCA, exponential family PCA)• sparse coding• independent component analysis• deep learning...

Usually involves learning both

a latent representation for data and a data reconstruction model

Contextcould be: unsupervised, semi-supervised, or supervised

Challenge

Optimal feature discovery appears to be generally intractable

Have to jointly train• latent representation• data reconstruction model

Usually resort to alternating minimization

(sole exception: PCA)

First consider unsupervised feature discovery

Unsupervised feature discovery

Single layer case = matrix factorization

x

ϕ

BX B Φ≈

original data learned dictionary new representation

n!t n!m

m!t

t = # training examplesn = # original features

m = # new features

Choose B and Φ to minimize data reconstruction loss

L(BΦ;X ) =∑t

i=1 L(BΦ:i ;X:i )

Seek desired structure in latent feature representationΦ low rank : dimensionality reductionΦ sparse : sparse codingΦ rows independent : independent component analysis

Generalized matrix factorizationAssume reconstruction loss L(x; x) is convex in first argument

Bregman divergence

L(x; x) = DF (x‖x) = DF∗(f (x)‖f (x))

(F strictly convex potential with transfer f = ∇F )

Tries to make x ≈ x

Matching loss

L(x; x) = DF (x‖f −1(x)) = DF∗(x‖f (x))

Tries to make f (x) ≈ x

(A nonlinear predictor, but loss still convex in x)

x

ϕ

B

Regular exponential family

L(x; x) = − log pB(x|φ) = DF (x‖f −1(x))− F ∗(x)− const

Training problem

minB∈Rn×m

minΦ∈Rm×t

L(BΦ;X )

How to impose desired structure on Φ?

Training problem

minB∈Rn×m

minΦ∈Rm×t

L(BΦ;X )


Dimensionality reduction

Fix # features m < min(n, t)• But only known to be tractable if L(X ;X ) = ‖X − X‖2

F (PCA)• No known efficient algorithm for other standard losses

Problemrank(Φ) = m constraint is too hard

Training problem

minB∈Bm2

minΦ∈Rm×t

L(BΦ;X )+α‖Φ‖2,1


Relaxed dimensionality reduction (subspace learning)

Add rank reducing regularizer

‖Φ‖2,1 =∑m

j=1 ‖Φj :‖2

Favors null rows in Φ

But need to add constraint to B

B:j ∈ B2 = b : ‖b‖2 ≤ 1

(Otherwise can make Φ small just by making B large)

Training problem

minB∈Bmq

minΦ∈Rm×t

L(BΦ;X )+α‖Φ‖1,1


Sparse coding

Use sparsity inducing regularizer

‖Φ‖1,1 =∑m

j=1

∑ti=1 |Φji |

Favors sparse entries in Φ

Need to add constraint to B

B:j ∈ Bq = b : ‖b‖q ≤ 1

(Otherwise can make Φ small just by making B large)

Training problem

minB∈Rn×m

minΦ∈Rm×t

L(BΦ;X )+αD(Φ)


Independent components analysis

Usually enforces BΦ = X as a constraint• but interpolation is generally a bad idea• Instead just minimize reconstruction loss

plus a dependence measure D(Φ) as a regularizer

Difficulty

Formulating a reasonable convex dependence penalty

Training problem

Consider subspace learning and sparse coding

minB∈Bm

minΦ∈Rm×t

L(BΦ;X ) + α‖Φ‖

Choice of ‖Φ‖ and B determines type of representation recovered

Training problem


minB∈B∞

minΦ∈R∞×t


Choice of ‖Φ‖ and B determines type of representation recovered

ProblemStill have rank constraint imposed by # new features m

IdeaJust relax m→∞• Rely on sparsity inducing norm ‖Φ‖ to select features

Training problem


minB∈B∞

minΦ∈R∞×t


Still have a problem

Optimization problem is not jointly convex in B and Φ

Training problem


minB∈B∞

minΦ∈R∞×t




Idea 1: Alternate!• convex in B given Φ• convex in Φ given B

Could use any other form of local training

Training problem


minB∈B∞

minΦ∈R∞×t




Idea 2: Boost!• Implicitly fix B to universal dictionary• Keep row-wise sparse Φ• Incrementally select column in B (“weak learning problem”)• Update sparse Φ

Can prove convergence under broad conditions

Training problem


minB∈B∞

minΦ∈R∞×t


Still have a problem?


Idea 3: Solve!• Can easily solve for globally optimal joint B and Φ• But requires a significant reformulation

A useful observation

Equivalent reformulation

Theoremmin

B∈B∞min

Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α|‖X‖|

• |‖ · ‖| is an induced matrix norm on X determined by B and ‖ · ‖p,1

Important fact

Norms are always convex

Computational strategy

1. Solve for optimal response matrix X first (convex minimization)2. Then recover optimal B and Φ from X

Example: subspace learning

minB∈B∞2

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖2,1

= minX∈Rn×t

L(X ;X ) + α‖X‖tr

Recovery

• Let UΣV ′ = svd(X )• Set B = U and Φ = ΣV ′

Preserves optimality

• ‖B:j‖2 = 1 hence B ∈ Bn2• ‖Φ‖2,1 = ‖ΣV ′‖2,1 =

∑j σj‖V:j‖2 =

∑j σj = ‖X‖tr

ThusL(X ;X ) + α‖X‖tr = L(BΦ;X ) + α‖Φ‖2,1

Example: sparse coding

minB∈B∞q

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖1,1

= minX∈Rn×t

L(X ;X ) + α‖X ′‖q,1

Recovery

B =

[1

‖X:1‖qX:1, ...,

1

‖X:t‖qX:t

](rescaled columns)

Φ =

‖X:1‖q 0. . .

0 ‖X:t‖q

(diagonal matrix)


• ‖B:j‖q = 1 hence B ∈ Btq• ‖Φ‖1,1 =

∑j ‖X:j‖q = ‖X ′‖q,1

ThusL(X ;X ) + α‖X ′‖q,1 = L(BΦ;X ) + α‖Φ‖1,1

Example: sparse coding

OutcomeSparse coding with ‖ · ‖1,1 regularization = vector quantization• drops some examples• memorizes remaining examples

Optimal solution is not overcomplete

Could not make these observations using local solvers

Simple extensions

• Missing observations in X• Robustness to outliers in X

minS∈Rn×t

minX∈Rn×t

L( (X + S)Ω ; XΩ ) + α|‖X‖|+ β‖S‖1,1

Ω = observed indices in XS = speckled outlier noise

(jointly convex in X and S)

Explaining the useful result

Theorem minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α|‖X‖|

for an induced matrix norm |‖X‖| = ‖X ′‖∗(B,p∗)


Theorem minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α|‖X‖|


A dual norm

‖X ′‖∗(B,p∗) = max‖Λ′‖(B,p∗)≤1

tr(Λ′X )

(standard definition of a dual norm)


Theorem minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α|‖X‖|


A dual norm

‖X ′‖∗(B,p∗) = max‖Λ′‖(B,p∗)≤1

tr(Λ′X )

(standard definition of a dual norm)

of a vector-norm induced matrix norm

‖Λ′‖(B,p∗) = maxb∈B‖Λ′b‖p∗

(easy to prove this yields a norm on matrices)

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t

L(X ;X ) + α minB∈B∞

minΦ:BΦ=X

‖Φ‖p,1

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t


minΦ:BΦ=X

‖Φ‖p,1

For any B ∈ B∞ that spans the columns of X

minΦ:BΦ=X

‖Φ‖p,1 = minΦ

maxΛ

max‖V ‖p∗,∞≤1

tr(V ′Φ) + tr(Λ′(X − BΦ))

= max‖V ‖p∗,∞≤1

maxΛ

minΦ

tr(Λ′X ) + tr(Φ′(V − B ′Λ))

= max‖V ‖p∗,∞≤1

maxΛ:B′Λ=V

tr(Λ′X )

= maxΛ:‖B′Λ‖p∗,∞≤1

tr(Λ′X )

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t


maxΛ:‖B′Λ‖p∗,∞≤1

tr(Λ′X )

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t


maxΛ:‖B′Λ‖p∗,∞≤1︸︷︷︸ tr(Λ′X )

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t


maxΛ:‖B′Λ‖p∗,∞≤1︸︷︷︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞

tr(Λ′X )

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t


maxΛ:‖B′Λ‖p∗,∞≤1︸︷︷︸ tr(Λ′X )

= minX∈Rn×t

L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸︷︷︸ tr(Λ′X )

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t


maxΛ:‖B′Λ‖p∗,∞≤1︸︷︷︸ tr(Λ′X )

= minX∈Rn×t


= minX∈Rn×t

L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B

tr(Λ′X )

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t


maxΛ:‖B′Λ‖p∗,∞≤1︸︷︷︸ tr(Λ′X )

= minX∈Rn×t


= minX∈Rn×t

L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸︷︷︸ tr(Λ′X )

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t


maxΛ:‖B′Λ‖p∗,∞≤1︸︷︷︸ tr(Λ′X )

= minX∈Rn×t


= minX∈Rn×t


= minX∈Rn×t

L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X )

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t


maxΛ:‖B′Λ‖p∗,∞≤1︸︷︷︸ tr(Λ′X )

= minX∈Rn×t


= minX∈Rn×t


= minX∈Rn×t

L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X )︸︷︷︸

Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t


maxΛ:‖B′Λ‖p∗,∞≤1︸︷︷︸ tr(Λ′X )

= minX∈Rn×t


= minX∈Rn×t


= minX∈Rn×t

L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X )︸︷︷︸= min

X∈Rn×tL(X ;X ) + α‖X ′‖∗(B,p∗)

Proof outlinemin

B∈B∞min


= minX∈Rn×t

minB∈B∞

minΦ:BΦ=X

L(X ;X ) + α‖Φ‖p,1

= minX∈Rn×t


maxΛ:‖B′Λ‖p∗,∞≤1︸︷︷︸ tr(Λ′X )

= minX∈Rn×t


= minX∈Rn×t


= minX∈Rn×t

L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X )︸︷︷︸= min

X∈Rn×tL(X ;X ) + α‖X ′‖∗(B,p∗)

done

Closed form induced norms

Theoremmin

B∈B∞min


= minX∈Rn×t

L(X ;X ) + α‖X ′‖∗(B,p∗)

Special cases

B2, ‖Φ‖2,1 7→ ‖X ′‖∗(B2,2) = ‖X‖tr (subspace learning)

Bq, ‖Φ‖1,1 7→ ‖X ′‖∗(Bq ,∞) = ‖X ′‖q,1 (sparse coding)

B1, ‖Φ‖p,1 7→ ‖X ′‖∗(B1,p∗)= ‖X‖p,1

Some simple experiments

Experimental results

Alternate : repeatedly optimize over B,Φ successivelyGlobal : recover global joint minimizer over B,Φ

Experimental results: Sparse coding

Objective value achieved

data setCOIL WBC BCI Ionos G241N

Alternate 1.314 4.918 0.898 1.612 1.312Global 0.207 0.659 0.306 0.330 0.207

×10−2

(squared loss, q = 2, α = 10−5)

Experimental results: Sparse coding

Run time (seconds)



(squared loss, q = 2, α = 10−5)

Experimental results: Subspace learning




×10−2

(squared loss, α = 10−5)


Run time (seconds)



(squared loss, α = 10−5)

Catch

Every norm is convexBut not every induced matrix norm is tractable

‖X‖2 = σmax(X )

‖X‖1 = maxj

∑i

|Xij |

‖X‖∞ = maxi

∑j

|Xij |

‖X‖p NP-hard to approximate for p 6= 1, 2,∞

QuestionAny other useful induced matrix norms that are tractable?

Yes!

Semi-supervised feature discovery


B Φl≈

labeled

(n+k)!(tl+tu)

Xl

W

ΦuXu

Yl

unlabeled

inputs

outputs

tl = # labeled n = # original featurestu = # unlabeled k = # output dimensionst = tl + tu

yx

B W

ϕ

LearnΦ = [Φl , Φu] data representationB = input reconstruction model f (BΦ) ≈ XW = output reconstruction model h(WΦl) ≈ Yl


Let

Z =

[Xl Xu

Yl ∅

]U =

[BW

]U =

[BW

]Formulation

minB∈B∞

minW∈W∞

minΦ∈R∞×t

Lu(BΦ;X ) + βLs(WΦl ;Yl) + α‖Φ‖p,1

= minZ∈R(n+k)×t

L(Z ;Z ) + α‖Z ′‖∗(U ,p∗)

NoteImposing separate constraints on B and W

Questions• Is the induced norm ‖Z ′‖∗(U ,p∗) efficiently computable?

• Can optimal B, W , Φ be recovered from optimal Z?

Example: sparse coding formulation

Regularizer: ‖Φ‖1,1

Constraints: Bq1 = b : ‖b‖q1 ≤ 1Wq2 = w : ‖w‖q2 ≤ γUq1q2

= B ×W

Theorem

‖Z ′‖∗(Uq1

q2,∞)

=∑

j max(‖ZX

:j ‖q1 ,1γ ‖Z

Y:j ‖q2

)efficiently computable

Recovery

Φjj = max(‖ZX

:j ‖q1 ,1γ ‖Z

Y:j ‖q2

)(diagonal matrix)

U = ZΦ−1


But still reduces to a form of vector quantization

Example: subspace learning formulation

Regularizer: ‖Φ‖2,1

Constraints: B2 = b : ‖b‖2 ≤ 1W2 = w : ‖w‖2 ≤ γU2

2 = B ×W

Theorem

‖Z ′‖∗(U22 ,∞) = max

ρ≥0‖D−1

ρ Z‖tr where Dρ =

[ √1 + γρ I 0

0√

1+γρρ I

]efficiently computable: quasi-concave in ρ


Lemma: dual norm

‖Λ′‖2(U2

2 ,2) = maxh:‖hX ‖2=1, ‖hY ‖2=γ

h′ΛΛ′h

= maxH:H0, tr(HIX )=1, tr(HIY )=γ

tr(HΛΛ′)

= minλ≥0, ν≥0

minΛ:ΛΛ′λIX +νIY

λ+ γν

= minλ≥0, ν≥0

minΛ:‖Dν/λΛ‖2

sp≤λ+γνλ+ γν

= minλ≥0, ν≥0

‖Dν/λΛ‖2sp

= minρ≥0

‖DρΛ‖2sp


Can easily derive target norm from dual norm

‖Z ′‖∗(U22 ,2) = max

‖Λ′‖(U2

2,2)≤1

tr(Λ′Z )

= maxρ≥0

maxΛ:‖DρΛ‖sp≤1

tr(Λ′Z )

= maxρ≥0

maxΛ:‖Λ‖sp≤1

tr(Λ′D−1ρ Z )

= maxρ≥0

‖D−1ρ Z‖tr

(proves theorem)


Computational strategy

Solve in dual, since ‖Λ′‖(U22 ,∞) can be computed efficiently via

partitioned power method iteration

minΛ

L?(Λ;Z ) + α?‖Λ′‖(U22 ,2)

Given Λ• Recover ZX and ZY

l by solving

minZX , ZY

Lu(ZX ;X ) + Ls(ZYl ;Yl)− tr(ZX ′ΛX )− tr(ZY

l′ΛY

l )

• Recover ZYu by minimizing ‖Z ′‖(U2

2 ,2) (keeping ZX , ZYl fixed)


Recovery

Given optimal Z , recover U and Φ iteratively by repeating:

• (Φ(`),Λ(`)) ∈ arg minΦ maxΛ ‖Φ‖2,1 + tr(Λ′(Z − U(`)Φ))

• u(`+1) ∈ arg maxu∈U22‖u′Λ(`)‖2

• U(`+1) = [U(`), u(`+1)]

Converges to optimal U and Φ

• U(`)Φ(`) = Z for all `

• ‖Φ(`)‖2,1 → ‖Z ′‖∗(U22 ,2)

Some simple experiments


Staged : first locally optimize B,Φ, then optimize WAlternate : repeatedly optimize over B,W ,Φ successivelyGlobal : recover joint global minimizer over B,Φ,W




Staged 1.384 1.321 0.799 0.769 1.381Alternate 0.076 0.122 0.609 0.081 0.076Global 0.070 0.113 0.069 0.078 0.070

(1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)


Run time (seconds)


Staged 272 73 45 28 290Alternate 2352 324 227 112 2648Global 106 8 25 61 94



Transductive generalization error


Staged 0.476 0.200 0.452 0.335 0.484Alternate 0.464 0.388 0.440 0.457 0.478Global 0.388 0.134 0.380 0.243 0.380(Lee et al. 2009) 0.414 0.168 0.436 0.350 0.452(Goldberg et al. 2010) 0.484 0.288 0.540 0.338 0.524


Conclusion

Global training can be more efficient than local training

Alternation is inherently slow to converge

Global training simplifies practical application

• no under-training• only need to guard against over-fitting• can use standard regularization techniques

Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent...

Documents

Transcript of Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent...