of 63/63
Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta
• date post

14-Aug-2020
• Category

Documents

• view

0

0

Embed Size (px)

Transcript of Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent...

• Part 3: Latent representationsand unsupervised learning

Dale Schuurmans

University of Alberta

• Supervised versus unsupervised learning

Prominent training principles

Discriminative

y

x

y

x

x φ

Wednesday, 11 July, 12

typical for supervised

Generative

y

x

y

x

x φ

Wednesday, 11 July, 12

typical for unsupervised

• Unsupervised representation learning

Consider generative training

y

x

y

x

x φ

Wednesday, 11 July, 12

• Unsupervised representation learning

Examples

• dimensionality reduction (PCA, exponential family PCA)• sparse coding• independent component analysis• deep learning...

Usually involves learning both

a latent representation for data and a data reconstruction model

Contextcould be: unsupervised, semi-supervised, or supervised

• Challenge

Optimal feature discovery appears to be generally intractable

Have to jointly train• latent representation• data reconstruction model

Usually resort to alternating minimization

(sole exception: PCA)

• First consider unsupervised feature discovery

• Unsupervised feature discovery

Single layer case = matrix factorization

x

ϕ

BX B Φ≈

original data learned dictionary new representation

n!t n!m

m!t

t = # training examplesn = # original features

m = # new features

Choose B and Φ to minimize data reconstruction loss

L(BΦ;X ) =∑t

i=1 L(BΦ:i ;X:i )

Seek desired structure in latent feature representationΦ low rank : dimensionality reductionΦ sparse : sparse codingΦ rows independent : independent component analysis

• Generalized matrix factorizationAssume reconstruction loss L(x̂; x) is convex in first argument

Bregman divergence

L(x̂; x) = DF (x̂‖x) = DF∗(f (x)‖f (x̂))(F strictly convex potential with transfer f = ∇F )

Tries to make x̂ ≈ x

Matching loss

L(x̂; x) = DF (x̂‖f −1(x)) = DF∗(x‖f (x̂))Tries to make f (x̂) ≈ x

(A nonlinear predictor, but loss still convex in x̂)

x

ϕ

B

Regular exponential family

L(x̂; x) = − log pB(x|φ) = DF (x̂‖f −1(x))− F ∗(x)− const

• Training problem

minB∈Rn×m

minΦ∈Rm×t

L(BΦ;X )

How to impose desired structure on Φ?

• Training problem

minB∈Rn×m

minΦ∈Rm×t

L(BΦ;X )

How to impose desired structure on Φ?

Dimensionality reduction

Fix # features m < min(n, t)• But only known to be tractable if L(X̂ ;X ) = ‖X̂ − X‖2F (PCA)• No known efficient algorithm for other standard losses

Problemrank(Φ) = m constraint is too hard

• Training problem

minB∈Bm2

minΦ∈Rm×t

L(BΦ;X )+α‖Φ‖2,1

How to impose desired structure on Φ?

Relaxed dimensionality reduction (subspace learning)

‖Φ‖2,1 =∑m

j=1 ‖Φj :‖2

Favors null rows in Φ

But need to add constraint to B

B:j ∈ B2 = {b : ‖b‖2 ≤ 1}

(Otherwise can make Φ small just by making B large)

• Training problem

minB∈Bmq

minΦ∈Rm×t

L(BΦ;X )+α‖Φ‖1,1

How to impose desired structure on Φ?

Sparse coding

Use sparsity inducing regularizer

‖Φ‖1,1 =∑m

j=1

∑ti=1 |Φji |

Favors sparse entries in Φ

Need to add constraint to B

B:j ∈ Bq = {b : ‖b‖q ≤ 1}

(Otherwise can make Φ small just by making B large)

• Training problem

minB∈Rn×m

minΦ∈Rm×t

L(BΦ;X )+αD(Φ)

How to impose desired structure on Φ?

Independent components analysis

Usually enforces BΦ = X as a constraint• but interpolation is generally a bad idea• Instead just minimize reconstruction loss

plus a dependence measure D(Φ) as a regularizer

Difficulty

Formulating a reasonable convex dependence penalty

• Training problem

Consider subspace learning and sparse coding

minB∈Bm

minΦ∈Rm×t

L(BΦ;X ) + α‖Φ‖

Choice of ‖Φ‖ and B determines type of representation recovered

• Training problem

Consider subspace learning and sparse coding

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖

Choice of ‖Φ‖ and B determines type of representation recovered

ProblemStill have rank constraint imposed by # new features m

IdeaJust relax m→∞• Rely on sparsity inducing norm ‖Φ‖ to select features

• Training problem

Consider subspace learning and sparse coding

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖

Still have a problem

Optimization problem is not jointly convex in B and Φ

• Training problem

Consider subspace learning and sparse coding

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖

Still have a problem

Optimization problem is not jointly convex in B and Φ

Idea 1: Alternate!• convex in B given Φ• convex in Φ given BCould use any other form of local training

• Training problem

Consider subspace learning and sparse coding

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖

Still have a problem

Optimization problem is not jointly convex in B and Φ

Idea 2: Boost!• Implicitly fix B to universal dictionary• Keep row-wise sparse Φ• Incrementally select column in B (“weak learning problem”)• Update sparse ΦCan prove convergence under broad conditions

• Training problem

Consider subspace learning and sparse coding

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖

Still have a problem?

Optimization problem is not jointly convex in B and Φ

Idea 3: Solve!• Can easily solve for globally optimal joint B and Φ• But requires a significant reformulation

• A useful observation

• Equivalent reformulation

Theoremmin

B∈B∞min

Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α|‖X̂‖|

• |‖ · ‖| is an induced matrix norm on X̂ determined by B and ‖ · ‖p,1

Important fact

Norms are always convex

Computational strategy

1. Solve for optimal response matrix X̂ first (convex minimization)2. Then recover optimal B and Φ from X̂

• Example: subspace learning

minB∈B∞2

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖2,1

= minX̂∈Rn×t

L(X̂ ;X ) + α‖X̂‖tr

Recovery

• Let UΣV ′ = svd(X̂ )• Set B = U and Φ = ΣV ′

Preserves optimality

• ‖B:j‖2 = 1 hence B ∈ Bn2• ‖Φ‖2,1 = ‖ΣV ′‖2,1 =

∑j σj‖V:j‖2 =

∑j σj = ‖X̂‖tr

ThusL(X̂ ;X ) + α‖X̂‖tr = L(BΦ;X ) + α‖Φ‖2,1

• Example: sparse coding

minB∈B∞q

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖1,1

= minX̂∈Rn×t

L(X̂ ;X ) + α‖X̂ ′‖q,1

Recovery

B =

[1

‖X̂:1‖qX̂:1, ...,

1

‖X̂:t‖qX̂:t

](rescaled columns)

Φ =

‖X̂:1‖q 0. . .0 ‖X̂:t‖q

(diagonal matrix)Preserves optimality

• ‖B:j‖q = 1 hence B ∈ Btq• ‖Φ‖1,1 =

∑j ‖X̂:j‖q = ‖X̂ ′‖q,1

ThusL(X̂ ;X ) + α‖X̂ ′‖q,1 = L(BΦ;X ) + α‖Φ‖1,1

• Example: sparse coding

OutcomeSparse coding with ‖ · ‖1,1 regularization = vector quantization• drops some examples• memorizes remaining examples

Optimal solution is not overcomplete

Could not make these observations using local solvers

• Simple extensions

• Missing observations in X• Robustness to outliers in X

minS∈Rn×t

minX̂∈Rn×t

L( (X̂ + S)Ω ; XΩ ) + α|‖X̂‖|+ β‖S‖1,1

Ω = observed indices in XS = speckled outlier noise

(jointly convex in X̂ and S)

• Explaining the useful result

Theorem minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α|‖X̂‖|

for an induced matrix norm |‖X̂‖| = ‖X̂ ′‖∗(B,p∗)

• Explaining the useful result

Theorem minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α|‖X̂‖|

for an induced matrix norm |‖X̂‖| = ‖X̂ ′‖∗(B,p∗)

A dual norm

‖X̂ ′‖∗(B,p∗) = max‖Λ′‖(B,p∗)≤1tr(Λ′X̂ )

(standard definition of a dual norm)

• Explaining the useful result

Theorem minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α|‖X̂‖|

for an induced matrix norm |‖X̂‖| = ‖X̂ ′‖∗(B,p∗)

A dual norm

‖X̂ ′‖∗(B,p∗) = max‖Λ′‖(B,p∗)≤1tr(Λ′X̂ )

(standard definition of a dual norm)

of a vector-norm induced matrix norm

‖Λ′‖(B,p∗) = maxb∈B‖Λ′b‖p∗

(easy to prove this yields a norm on matrices)

• Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

minΦ:BΦ=X̂

‖Φ‖p,1

• Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

minΦ:BΦ=X̂

‖Φ‖p,1

For any B ∈ B∞ that spans the columns of X̂

minΦ:BΦ=X̂

‖Φ‖p,1 = minΦ

maxΛ

max‖V ‖p∗,∞≤1

tr(V ′Φ) + tr(Λ′(X̂ − BΦ))

= max‖V ‖p∗,∞≤1

maxΛ

minΦ

tr(Λ′X̂ ) + tr(Φ′(V − B ′Λ))

= max‖V ‖p∗,∞≤1

maxΛ:B′Λ=V

tr(Λ′X̂ )

= maxΛ:‖B′Λ‖p∗,∞≤1

tr(Λ′X̂ )

• Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1

tr(Λ′X̂ )

• Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

′X̂ )

• Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞

tr(Λ′X̂ )

• Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

′X̂ )

• Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B

tr(Λ′X̂ )

• Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ

′X̂ )

• Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X̂ )

• Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X̂ )︸ ︷︷ ︸

• Proof outline

minB∈B∞

minΦ∈R∞×t

L(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X̂ )︸ ︷︷ ︸= min

X̂∈Rn×tL(X̂ ;X ) + α‖X̂ ′‖∗(B,p∗)

• Proof outlinemin

B∈B∞min

Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

minB∈B∞

minΦ:BΦ=X̂

L(X̂ ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α minB∈B∞

maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ

′X̂ )

= minX̂∈Rn×t

L(X̂ ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

tr(Λ′X̂ )︸ ︷︷ ︸= min

X̂∈Rn×tL(X̂ ;X ) + α‖X̂ ′‖∗(B,p∗)

done

• Closed form induced norms

Theoremmin

B∈B∞min

Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1

= minX̂∈Rn×t

L(X̂ ;X ) + α‖X̂ ′‖∗(B,p∗)

Special cases

B2, ‖Φ‖2,1 7→ ‖X̂ ′‖∗(B2,2) = ‖X̂‖tr (subspace learning)

Bq, ‖Φ‖1,1 7→ ‖X̂ ′‖∗(Bq ,∞) = ‖X̂′‖q,1 (sparse coding)

B1, ‖Φ‖p,1 7→ ‖X̂ ′‖∗(B1,p∗) = ‖X̂‖p,1

• Some simple experiments

• Experimental results

Alternate : repeatedly optimize over B,Φ successivelyGlobal : recover global joint minimizer over B,Φ

• Experimental results: Sparse coding

Objective value achieved

data setCOIL WBC BCI Ionos G241N

Alternate 1.314 4.918 0.898 1.612 1.312Global 0.207 0.659 0.306 0.330 0.207

×10−2

(squared loss, q = 2, α = 10−5)

• Experimental results: Sparse coding

Run time (seconds)

data setCOIL WBC BCI Ionos G241N

Alternate 1.95 10.54 0.88 1.71 2.37Global 0.06 0.01 0.01 0.01 0.09

(squared loss, q = 2, α = 10−5)

• Experimental results: Subspace learning

Objective value achieved

data setCOIL WBC BCI Ionos G241N

Alternate 1.314 4.957 0.903 1.632 1.313Global 0.072 0.072 0.092 0.079 0.205

×10−2

(squared loss, α = 10−5)

• Experimental results: Subspace learning

Run time (seconds)

data setCOIL WBC BCI Ionos G241N

Alternate 2.40 9.31 1.12 0.47 2.43Global 2.18 0.06 0.19 0.06 2.11

(squared loss, α = 10−5)

• Catch

Every norm is convexBut not every induced matrix norm is tractable

‖X‖2 = σmax(X )

‖X‖1 = maxj

∑i

|Xij |

‖X‖∞ = maxi

∑j

|Xij |

‖X‖p NP-hard to approximate for p 6= 1, 2,∞

QuestionAny other useful induced matrix norms that are tractable?

Yes!

• Semi-supervised feature discovery

• Semi-supervised feature discovery

B Φl≈

labeled

(n+k)!(tl+tu)

Xl

W

ΦuXuYl

unlabeled

inputs

outputs

tl = # labeled n = # original featurestu = # unlabeled k = # output dimensionst = tl + tu

yx

B W

ϕ

LearnΦ = [Φl , Φu] data representationB = input reconstruction model f (BΦ) ≈ XW = output reconstruction model h(WΦl) ≈ Yl

• Semi-supervised feature discovery

Let

Z =

[Xl XuYl ∅

]U =

[BW

]U =

[BW

]Formulation

minB∈B∞

minW∈W∞

minΦ∈R∞×t

Lu(BΦ;X ) + βLs(WΦl ;Yl) + α‖Φ‖p,1

= minẐ∈R(n+k)×t

L̃(Ẑ ;Z ) + α‖Ẑ ′‖∗(U ,p∗)

NoteImposing separate constraints on B and W

Questions• Is the induced norm ‖Ẑ ′‖∗(U ,p∗) efficiently computable?• Can optimal B, W , Φ be recovered from optimal Ẑ?

• Example: sparse coding formulation

Regularizer: ‖Φ‖1,1Constraints: Bq1 = {b : ‖b‖q1 ≤ 1}

Wq2 = {w : ‖w‖q2 ≤ γ}Uq1q2 = B ×W

Theorem

‖Ẑ ′‖∗(Uq1q2 ,∞)

=∑

j max(‖ẐX:j ‖q1 ,

1γ ‖Ẑ

Y:j ‖q2

)efficiently computable

Recovery

Φjj = max(‖ẐX:j ‖q1 ,

1γ ‖Ẑ

Y:j ‖q2

)(diagonal matrix)

U = ẐΦ−1

Preserves optimality

But still reduces to a form of vector quantization

• Example: subspace learning formulation

Regularizer: ‖Φ‖2,1Constraints: B2 = {b : ‖b‖2 ≤ 1}

W2 = {w : ‖w‖2 ≤ γ}U22 = B ×W

Theorem

‖Ẑ ′‖∗(U22 ,∞) = maxρ≥0 ‖D−1ρ Ẑ‖tr where Dρ =

[ √1 + γρ I 0

0√

1+γρρ I

]efficiently computable: quasi-concave in ρ

• Example: subspace learning formulation

Lemma: dual norm

‖Λ′‖2(U22 ,2) = maxh:‖hX ‖2=1, ‖hY ‖2=γh′ΛΛ′h

= maxH:H�0, tr(HIX )=1, tr(HIY )=γ

tr(HΛΛ′)

= minλ≥0, ν≥0

minΛ:ΛΛ′�λIX +νIY

λ+ γν

= minλ≥0, ν≥0

minΛ:‖Dν/λΛ‖2sp≤λ+γν

λ+ γν

= minλ≥0, ν≥0

‖Dν/λΛ‖2sp

= minρ≥0

‖DρΛ‖2sp

• Example: subspace learning formulation

Can easily derive target norm from dual norm

‖Ẑ ′‖∗(U22 ,2) = max‖Λ′‖(U2

2,2)≤1

tr(Λ′Ẑ )

= maxρ≥0

maxΛ:‖DρΛ‖sp≤1

tr(Λ′Ẑ )

= maxρ≥0

maxΛ̃:‖Λ̃‖sp≤1

tr(Λ̃′D−1ρ Ẑ )

= maxρ≥0

‖D−1ρ Ẑ‖tr

(proves theorem)

• Example: subspace learning formulation

Computational strategy

Solve in dual, since ‖Λ′‖(U22 ,∞) can be computed efficiently viapartitioned power method iteration

minΛ

L̃?(Λ;Z ) + α?‖Λ′‖(U22 ,2)

Given Λ̂• Recover ẐX and ẐYl by solving

minẐX , ẐY

Lu(ẐX ;X ) + Ls(Ẑ

Yl ;Yl)− tr(ẐX ′Λ̂X )− tr(ẐYl ′Λ̂Yl )

• Recover ẐYu by minimizing ‖Ẑ ′‖(U22 ,2) (keeping ẐX , ẐYl fixed)

• Example: subspace learning formulation

Recovery

Given optimal Ẑ , recover U and Φ iteratively by repeating:

• (Φ(`),Λ(`)) ∈ arg minΦ maxΛ ‖Φ‖2,1 + tr(Λ′(Ẑ − U(`)Φ))• u(`+1) ∈ arg maxu∈U22 ‖u

′Λ(`)‖2

• U(`+1) = [U(`), u(`+1)]

Converges to optimal U and Φ

• U(`)Φ(`) = Ẑ for all `• ‖Φ(`)‖2,1 → ‖Ẑ ′‖∗(U22 ,2)

• Some simple experiments

• Experimental results: Subspace learning

Staged : first locally optimize B,Φ, then optimize WAlternate : repeatedly optimize over B,W ,Φ successivelyGlobal : recover joint global minimizer over B,Φ,W

• Experimental results: Subspace learning

Objective value achieved

data setCOIL WBC BCI Ionos G241N

Staged 1.384 1.321 0.799 0.769 1.381Alternate 0.076 0.122 0.609 0.081 0.076Global 0.070 0.113 0.069 0.078 0.070

(1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)

• Experimental results: Subspace learning

Run time (seconds)

data setCOIL WBC BCI Ionos G241N

Staged 272 73 45 28 290Alternate 2352 324 227 112 2648Global 106 8 25 61 94

(1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)

• Experimental results: Subspace learning

Transductive generalization error

data setCOIL WBC BCI Ionos G241N

Staged 0.476 0.200 0.452 0.335 0.484Alternate 0.464 0.388 0.440 0.457 0.478Global 0.388 0.134 0.380 0.243 0.380(Lee et al. 2009) 0.414 0.168 0.436 0.350 0.452(Goldberg et al. 2010) 0.484 0.288 0.540 0.338 0.524

(1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)

• Conclusion

Global training can be more efficient than local training

Alternation is inherently slow to converge

Global training simplifies practical application

• no under-training• only need to guard against over-fitting• can use standard regularization techniques

IntroductionUnsupervised feature discoveryA useful observationExperiments (unsupervised)Semi-supervised feature discoveryExperiments (semi-supervised)Conclusion