Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent...

of 63/63
Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta
  • date post

    14-Aug-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent...

  • Part 3: Latent representationsand unsupervised learning

    Dale Schuurmans

    University of Alberta

  • Supervised versus unsupervised learning

    Prominent training principles

    Discriminative

    y

    x

    y

    x

    x φ

    Wednesday, 11 July, 12

    typical for supervised

    Generative

    y

    x

    y

    x

    x φ

    Wednesday, 11 July, 12

    typical for unsupervised

  • Unsupervised representation learning

    Consider generative training

    y

    x

    y

    x

    x φ

    Wednesday, 11 July, 12

  • Unsupervised representation learning

    Examples

    • dimensionality reduction (PCA, exponential family PCA)• sparse coding• independent component analysis• deep learning...

    Usually involves learning both

    a latent representation for data and a data reconstruction model

    Contextcould be: unsupervised, semi-supervised, or supervised

  • Challenge

    Optimal feature discovery appears to be generally intractable

    Have to jointly train• latent representation• data reconstruction model

    Usually resort to alternating minimization

    (sole exception: PCA)

  • First consider unsupervised feature discovery

  • Unsupervised feature discovery

    Single layer case = matrix factorization

    x

    ϕ

    BX B Φ≈

    original data learned dictionary new representation

    n!t n!m

    m!t

    t = # training examplesn = # original features

    m = # new features

    Choose B and Φ to minimize data reconstruction loss

    L(BΦ;X ) =∑t

    i=1 L(BΦ:i ;X:i )

    Seek desired structure in latent feature representationΦ low rank : dimensionality reductionΦ sparse : sparse codingΦ rows independent : independent component analysis

  • Generalized matrix factorizationAssume reconstruction loss L(x̂; x) is convex in first argument

    Bregman divergence

    L(x̂; x) = DF (x̂‖x) = DF∗(f (x)‖f (x̂))(F strictly convex potential with transfer f = ∇F )

    Tries to make x̂ ≈ x

    Matching loss

    L(x̂; x) = DF (x̂‖f −1(x)) = DF∗(x‖f (x̂))Tries to make f (x̂) ≈ x

    (A nonlinear predictor, but loss still convex in x̂)

    x

    ϕ

    B

    Regular exponential family

    L(x̂; x) = − log pB(x|φ) = DF (x̂‖f −1(x))− F ∗(x)− const

  • Training problem

    minB∈Rn×m

    minΦ∈Rm×t

    L(BΦ;X )

    How to impose desired structure on Φ?

  • Training problem

    minB∈Rn×m

    minΦ∈Rm×t

    L(BΦ;X )

    How to impose desired structure on Φ?

    Dimensionality reduction

    Fix # features m < min(n, t)• But only known to be tractable if L(X̂ ;X ) = ‖X̂ − X‖2F (PCA)• No known efficient algorithm for other standard losses

    Problemrank(Φ) = m constraint is too hard

  • Training problem

    minB∈Bm2

    minΦ∈Rm×t

    L(BΦ;X )+α‖Φ‖2,1

    How to impose desired structure on Φ?

    Relaxed dimensionality reduction (subspace learning)

    Add rank reducing regularizer

    ‖Φ‖2,1 =∑m

    j=1 ‖Φj :‖2

    Favors null rows in Φ

    But need to add constraint to B

    B:j ∈ B2 = {b : ‖b‖2 ≤ 1}

    (Otherwise can make Φ small just by making B large)

  • Training problem

    minB∈Bmq

    minΦ∈Rm×t

    L(BΦ;X )+α‖Φ‖1,1

    How to impose desired structure on Φ?

    Sparse coding

    Use sparsity inducing regularizer

    ‖Φ‖1,1 =∑m

    j=1

    ∑ti=1 |Φji |

    Favors sparse entries in Φ

    Need to add constraint to B

    B:j ∈ Bq = {b : ‖b‖q ≤ 1}

    (Otherwise can make Φ small just by making B large)

  • Training problem

    minB∈Rn×m

    minΦ∈Rm×t

    L(BΦ;X )+αD(Φ)

    How to impose desired structure on Φ?

    Independent components analysis

    Usually enforces BΦ = X as a constraint• but interpolation is generally a bad idea• Instead just minimize reconstruction loss

    plus a dependence measure D(Φ) as a regularizer

    Difficulty

    Formulating a reasonable convex dependence penalty

  • Training problem

    Consider subspace learning and sparse coding

    minB∈Bm

    minΦ∈Rm×t

    L(BΦ;X ) + α‖Φ‖

    Choice of ‖Φ‖ and B determines type of representation recovered

  • Training problem

    Consider subspace learning and sparse coding

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖

    Choice of ‖Φ‖ and B determines type of representation recovered

    ProblemStill have rank constraint imposed by # new features m

    IdeaJust relax m→∞• Rely on sparsity inducing norm ‖Φ‖ to select features

  • Training problem

    Consider subspace learning and sparse coding

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖

    Still have a problem

    Optimization problem is not jointly convex in B and Φ

  • Training problem

    Consider subspace learning and sparse coding

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖

    Still have a problem

    Optimization problem is not jointly convex in B and Φ

    Idea 1: Alternate!• convex in B given Φ• convex in Φ given BCould use any other form of local training

  • Training problem

    Consider subspace learning and sparse coding

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖

    Still have a problem

    Optimization problem is not jointly convex in B and Φ

    Idea 2: Boost!• Implicitly fix B to universal dictionary• Keep row-wise sparse Φ• Incrementally select column in B (“weak learning problem”)• Update sparse ΦCan prove convergence under broad conditions

  • Training problem

    Consider subspace learning and sparse coding

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖

    Still have a problem?

    Optimization problem is not jointly convex in B and Φ

    Idea 3: Solve!• Can easily solve for globally optimal joint B and Φ• But requires a significant reformulation

  • A useful observation

  • Equivalent reformulation

    Theoremmin

    B∈B∞min

    Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α|‖X̂‖|

    • |‖ · ‖| is an induced matrix norm on X̂ determined by B and ‖ · ‖p,1

    Important fact

    Norms are always convex

    Computational strategy

    1. Solve for optimal response matrix X̂ first (convex minimization)2. Then recover optimal B and Φ from X̂

  • Example: subspace learning

    minB∈B∞2

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖2,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α‖X̂‖tr

    Recovery

    • Let UΣV ′ = svd(X̂ )• Set B = U and Φ = ΣV ′

    Preserves optimality

    • ‖B:j‖2 = 1 hence B ∈ Bn2• ‖Φ‖2,1 = ‖ΣV ′‖2,1 =

    ∑j σj‖V:j‖2 =

    ∑j σj = ‖X̂‖tr

    ThusL(X̂ ;X ) + α‖X̂‖tr = L(BΦ;X ) + α‖Φ‖2,1

  • Example: sparse coding

    minB∈B∞q

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖1,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α‖X̂ ′‖q,1

    Recovery

    B =

    [1

    ‖X̂:1‖qX̂:1, ...,

    1

    ‖X̂:t‖qX̂:t

    ](rescaled columns)

    Φ =

    ‖X̂:1‖q 0. . .0 ‖X̂:t‖q

    (diagonal matrix)Preserves optimality

    • ‖B:j‖q = 1 hence B ∈ Btq• ‖Φ‖1,1 =

    ∑j ‖X̂:j‖q = ‖X̂ ′‖q,1

    ThusL(X̂ ;X ) + α‖X̂ ′‖q,1 = L(BΦ;X ) + α‖Φ‖1,1

  • Example: sparse coding

    OutcomeSparse coding with ‖ · ‖1,1 regularization = vector quantization• drops some examples• memorizes remaining examples

    Optimal solution is not overcomplete

    Could not make these observations using local solvers

  • Simple extensions

    • Missing observations in X• Robustness to outliers in X

    minS∈Rn×t

    minX̂∈Rn×t

    L( (X̂ + S)Ω ; XΩ ) + α|‖X̂‖|+ β‖S‖1,1

    Ω = observed indices in XS = speckled outlier noise

    (jointly convex in X̂ and S)

  • Explaining the useful result

    Theorem minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α|‖X̂‖|

    for an induced matrix norm |‖X̂‖| = ‖X̂ ′‖∗(B,p∗)

  • Explaining the useful result

    Theorem minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α|‖X̂‖|

    for an induced matrix norm |‖X̂‖| = ‖X̂ ′‖∗(B,p∗)

    A dual norm

    ‖X̂ ′‖∗(B,p∗) = max‖Λ′‖(B,p∗)≤1tr(Λ′X̂ )

    (standard definition of a dual norm)

  • Explaining the useful result

    Theorem minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α|‖X̂‖|

    for an induced matrix norm |‖X̂‖| = ‖X̂ ′‖∗(B,p∗)

    A dual norm

    ‖X̂ ′‖∗(B,p∗) = max‖Λ′‖(B,p∗)≤1tr(Λ′X̂ )

    (standard definition of a dual norm)

    of a vector-norm induced matrix norm

    ‖Λ′‖(B,p∗) = maxb∈B‖Λ′b‖p∗

    (easy to prove this yields a norm on matrices)

  • Proof outline

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    minΦ:BΦ=X̂

    ‖Φ‖p,1

  • Proof outline

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    minΦ:BΦ=X̂

    ‖Φ‖p,1

    For any B ∈ B∞ that spans the columns of X̂

    minΦ:BΦ=X̂

    ‖Φ‖p,1 = minΦ

    maxΛ

    max‖V ‖p∗,∞≤1

    tr(V ′Φ) + tr(Λ′(X̂ − BΦ))

    = max‖V ‖p∗,∞≤1

    maxΛ

    minΦ

    tr(Λ′X̂ ) + tr(Φ′(V − B ′Λ))

    = max‖V ‖p∗,∞≤1

    maxΛ:B′Λ=V

    tr(Λ′X̂ )

    = maxΛ:‖B′Λ‖p∗,∞≤1

    tr(Λ′X̂ )

  • Proof outline

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    maxΛ:‖B′Λ‖p∗,∞≤1

    tr(Λ′X̂ )

  • Proof outline

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

    ′X̂ )

  • Proof outline

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞

    tr(Λ′X̂ )

  • Proof outline

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

    ′X̂ )

  • Proof outline

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B

    tr(Λ′X̂ )

  • Proof outline

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ

    ′X̂ )

  • Proof outline

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

    tr(Λ′X̂ )

  • Proof outline

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

    tr(Λ′X̂ )︸ ︷︷ ︸

  • Proof outline

    minB∈B∞

    minΦ∈R∞×t

    L(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

    tr(Λ′X̂ )︸ ︷︷ ︸= min

    X̂∈Rn×tL(X̂ ;X ) + α‖X̂ ′‖∗(B,p∗)

  • Proof outlinemin

    B∈B∞min

    Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    minB∈B∞

    minΦ:BΦ=X̂

    L(X̂ ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α minB∈B∞

    maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ

    ′X̂ )

    = minX̂∈Rn×t

    L(X̂ ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1

    tr(Λ′X̂ )︸ ︷︷ ︸= min

    X̂∈Rn×tL(X̂ ;X ) + α‖X̂ ′‖∗(B,p∗)

    done

  • Closed form induced norms

    Theoremmin

    B∈B∞min

    Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1

    = minX̂∈Rn×t

    L(X̂ ;X ) + α‖X̂ ′‖∗(B,p∗)

    Special cases

    B2, ‖Φ‖2,1 7→ ‖X̂ ′‖∗(B2,2) = ‖X̂‖tr (subspace learning)

    Bq, ‖Φ‖1,1 7→ ‖X̂ ′‖∗(Bq ,∞) = ‖X̂′‖q,1 (sparse coding)

    B1, ‖Φ‖p,1 7→ ‖X̂ ′‖∗(B1,p∗) = ‖X̂‖p,1

  • Some simple experiments

  • Experimental results

    Alternate : repeatedly optimize over B,Φ successivelyGlobal : recover global joint minimizer over B,Φ

  • Experimental results: Sparse coding

    Objective value achieved

    data setCOIL WBC BCI Ionos G241N

    Alternate 1.314 4.918 0.898 1.612 1.312Global 0.207 0.659 0.306 0.330 0.207

    ×10−2

    (squared loss, q = 2, α = 10−5)

  • Experimental results: Sparse coding

    Run time (seconds)

    data setCOIL WBC BCI Ionos G241N

    Alternate 1.95 10.54 0.88 1.71 2.37Global 0.06 0.01 0.01 0.01 0.09

    (squared loss, q = 2, α = 10−5)

  • Experimental results: Subspace learning

    Objective value achieved

    data setCOIL WBC BCI Ionos G241N

    Alternate 1.314 4.957 0.903 1.632 1.313Global 0.072 0.072 0.092 0.079 0.205

    ×10−2

    (squared loss, α = 10−5)

  • Experimental results: Subspace learning

    Run time (seconds)

    data setCOIL WBC BCI Ionos G241N

    Alternate 2.40 9.31 1.12 0.47 2.43Global 2.18 0.06 0.19 0.06 2.11

    (squared loss, α = 10−5)

  • Catch

    Every norm is convexBut not every induced matrix norm is tractable

    ‖X‖2 = σmax(X )

    ‖X‖1 = maxj

    ∑i

    |Xij |

    ‖X‖∞ = maxi

    ∑j

    |Xij |

    ‖X‖p NP-hard to approximate for p 6= 1, 2,∞

    QuestionAny other useful induced matrix norms that are tractable?

    Yes!

  • Semi-supervised feature discovery

  • Semi-supervised feature discovery

    B Φl≈

    labeled

    (n+k)!(tl+tu)

    Xl

    W

    ΦuXuYl

    unlabeled

    inputs

    outputs

    tl = # labeled n = # original featurestu = # unlabeled k = # output dimensionst = tl + tu

    yx

    B W

    ϕ

    LearnΦ = [Φl , Φu] data representationB = input reconstruction model f (BΦ) ≈ XW = output reconstruction model h(WΦl) ≈ Yl

  • Semi-supervised feature discovery

    Let

    Z =

    [Xl XuYl ∅

    ]U =

    [BW

    ]U =

    [BW

    ]Formulation

    minB∈B∞

    minW∈W∞

    minΦ∈R∞×t

    Lu(BΦ;X ) + βLs(WΦl ;Yl) + α‖Φ‖p,1

    = minẐ∈R(n+k)×t

    L̃(Ẑ ;Z ) + α‖Ẑ ′‖∗(U ,p∗)

    NoteImposing separate constraints on B and W

    Questions• Is the induced norm ‖Ẑ ′‖∗(U ,p∗) efficiently computable?• Can optimal B, W , Φ be recovered from optimal Ẑ?

  • Example: sparse coding formulation

    Regularizer: ‖Φ‖1,1Constraints: Bq1 = {b : ‖b‖q1 ≤ 1}

    Wq2 = {w : ‖w‖q2 ≤ γ}Uq1q2 = B ×W

    Theorem

    ‖Ẑ ′‖∗(Uq1q2 ,∞)

    =∑

    j max(‖ẐX:j ‖q1 ,

    1γ ‖Ẑ

    Y:j ‖q2

    )efficiently computable

    Recovery

    Φjj = max(‖ẐX:j ‖q1 ,

    1γ ‖Ẑ

    Y:j ‖q2

    )(diagonal matrix)

    U = ẐΦ−1

    Preserves optimality

    But still reduces to a form of vector quantization

  • Example: subspace learning formulation

    Regularizer: ‖Φ‖2,1Constraints: B2 = {b : ‖b‖2 ≤ 1}

    W2 = {w : ‖w‖2 ≤ γ}U22 = B ×W

    Theorem

    ‖Ẑ ′‖∗(U22 ,∞) = maxρ≥0 ‖D−1ρ Ẑ‖tr where Dρ =

    [ √1 + γρ I 0

    0√

    1+γρρ I

    ]efficiently computable: quasi-concave in ρ

  • Example: subspace learning formulation

    Lemma: dual norm

    ‖Λ′‖2(U22 ,2) = maxh:‖hX ‖2=1, ‖hY ‖2=γh′ΛΛ′h

    = maxH:H�0, tr(HIX )=1, tr(HIY )=γ

    tr(HΛΛ′)

    = minλ≥0, ν≥0

    minΛ:ΛΛ′�λIX +νIY

    λ+ γν

    = minλ≥0, ν≥0

    minΛ:‖Dν/λΛ‖2sp≤λ+γν

    λ+ γν

    = minλ≥0, ν≥0

    ‖Dν/λΛ‖2sp

    = minρ≥0

    ‖DρΛ‖2sp

  • Example: subspace learning formulation

    Can easily derive target norm from dual norm

    ‖Ẑ ′‖∗(U22 ,2) = max‖Λ′‖(U2

    2,2)≤1

    tr(Λ′Ẑ )

    = maxρ≥0

    maxΛ:‖DρΛ‖sp≤1

    tr(Λ′Ẑ )

    = maxρ≥0

    maxΛ̃:‖Λ̃‖sp≤1

    tr(Λ̃′D−1ρ Ẑ )

    = maxρ≥0

    ‖D−1ρ Ẑ‖tr

    (proves theorem)

  • Example: subspace learning formulation

    Computational strategy

    Solve in dual, since ‖Λ′‖(U22 ,∞) can be computed efficiently viapartitioned power method iteration

    minΛ

    L̃?(Λ;Z ) + α?‖Λ′‖(U22 ,2)

    Given Λ̂• Recover ẐX and ẐYl by solving

    minẐX , ẐY

    Lu(ẐX ;X ) + Ls(Ẑ

    Yl ;Yl)− tr(ẐX ′Λ̂X )− tr(ẐYl ′Λ̂Yl )

    • Recover ẐYu by minimizing ‖Ẑ ′‖(U22 ,2) (keeping ẐX , ẐYl fixed)

  • Example: subspace learning formulation

    Recovery

    Given optimal Ẑ , recover U and Φ iteratively by repeating:

    • (Φ(`),Λ(`)) ∈ arg minΦ maxΛ ‖Φ‖2,1 + tr(Λ′(Ẑ − U(`)Φ))• u(`+1) ∈ arg maxu∈U22 ‖u

    ′Λ(`)‖2

    • U(`+1) = [U(`), u(`+1)]

    Converges to optimal U and Φ

    • U(`)Φ(`) = Ẑ for all `• ‖Φ(`)‖2,1 → ‖Ẑ ′‖∗(U22 ,2)

  • Some simple experiments

  • Experimental results: Subspace learning

    Staged : first locally optimize B,Φ, then optimize WAlternate : repeatedly optimize over B,W ,Φ successivelyGlobal : recover joint global minimizer over B,Φ,W

  • Experimental results: Subspace learning

    Objective value achieved

    data setCOIL WBC BCI Ionos G241N

    Staged 1.384 1.321 0.799 0.769 1.381Alternate 0.076 0.122 0.609 0.081 0.076Global 0.070 0.113 0.069 0.078 0.070

    (1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)

  • Experimental results: Subspace learning

    Run time (seconds)

    data setCOIL WBC BCI Ionos G241N

    Staged 272 73 45 28 290Alternate 2352 324 227 112 2648Global 106 8 25 61 94

    (1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)

  • Experimental results: Subspace learning

    Transductive generalization error

    data setCOIL WBC BCI Ionos G241N

    Staged 0.476 0.200 0.452 0.335 0.484Alternate 0.464 0.388 0.440 0.457 0.478Global 0.388 0.134 0.380 0.243 0.380(Lee et al. 2009) 0.414 0.168 0.436 0.350 0.452(Goldberg et al. 2010) 0.484 0.288 0.540 0.338 0.524

    (1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)

  • Conclusion

    Global training can be more efficient than local training

    Alternation is inherently slow to converge

    Global training simplifies practical application

    • no under-training• only need to guard against over-fitting• can use standard regularization techniques

    IntroductionUnsupervised feature discoveryA useful observationExperiments (unsupervised)Semi-supervised feature discoveryExperiments (semi-supervised)Conclusion