From Unsupervised to Semi-supervised Learning Approaches From
Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent...
Transcript of Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent...
Part 3: Latent representationsand unsupervised learning
Dale Schuurmans
University of Alberta
Supervised versus unsupervised learning
Prominent training principles
Discriminative
y
x
y
x
x φ
Wednesday, 11 July, 12
typical for supervised
Generative
y
x
y
x
x φ
Wednesday, 11 July, 12
typical for unsupervised
Unsupervised representation learning
Consider generative training
y
x
y
x
x φ
Wednesday, 11 July, 12
Unsupervised representation learning
Examples
• dimensionality reduction (PCA, exponential family PCA)• sparse coding• independent component analysis• deep learning...
Usually involves learning both
a latent representation for data and a data reconstruction model
Contextcould be: unsupervised, semi-supervised, or supervised
Challenge
Optimal feature discovery appears to be generally intractable
Have to jointly train• latent representation• data reconstruction model
Usually resort to alternating minimization
(sole exception: PCA)
First consider unsupervised feature discovery
Unsupervised feature discovery
Single layer case = matrix factorization
x
ϕ
BX B Φ≈
original data learned dictionary new representation
n!t n!m
m!t
t = # training examplesn = # original features
m = # new features
Choose B and Φ to minimize data reconstruction loss
L(BΦ;X ) =∑t
i=1 L(BΦ:i ;X:i )
Seek desired structure in latent feature representationΦ low rank : dimensionality reductionΦ sparse : sparse codingΦ rows independent : independent component analysis
Generalized matrix factorizationAssume reconstruction loss L(x; x) is convex in first argument
Bregman divergence
L(x; x) = DF (x‖x) = DF∗(f (x)‖f (x))
(F strictly convex potential with transfer f = ∇F )
Tries to make x ≈ x
Matching loss
L(x; x) = DF (x‖f −1(x)) = DF∗(x‖f (x))
Tries to make f (x) ≈ x
(A nonlinear predictor, but loss still convex in x)
x
ϕ
B
Regular exponential family
L(x; x) = − log pB(x|φ) = DF (x‖f −1(x))− F ∗(x)− const
Training problem
minB∈Rn×m
minΦ∈Rm×t
L(BΦ;X )
How to impose desired structure on Φ?
Training problem
minB∈Rn×m
minΦ∈Rm×t
L(BΦ;X )
How to impose desired structure on Φ?
Dimensionality reduction
Fix # features m < min(n, t)• But only known to be tractable if L(X ;X ) = ‖X − X‖2
F (PCA)• No known efficient algorithm for other standard losses
Problemrank(Φ) = m constraint is too hard
Training problem
minB∈Bm2
minΦ∈Rm×t
L(BΦ;X )+α‖Φ‖2,1
How to impose desired structure on Φ?
Relaxed dimensionality reduction (subspace learning)
Add rank reducing regularizer
‖Φ‖2,1 =∑m
j=1 ‖Φj :‖2
Favors null rows in Φ
But need to add constraint to B
B:j ∈ B2 = b : ‖b‖2 ≤ 1
(Otherwise can make Φ small just by making B large)
Training problem
minB∈Bmq
minΦ∈Rm×t
L(BΦ;X )+α‖Φ‖1,1
How to impose desired structure on Φ?
Sparse coding
Use sparsity inducing regularizer
‖Φ‖1,1 =∑m
j=1
∑ti=1 |Φji |
Favors sparse entries in Φ
Need to add constraint to B
B:j ∈ Bq = b : ‖b‖q ≤ 1
(Otherwise can make Φ small just by making B large)
Training problem
minB∈Rn×m
minΦ∈Rm×t
L(BΦ;X )+αD(Φ)
How to impose desired structure on Φ?
Independent components analysis
Usually enforces BΦ = X as a constraint• but interpolation is generally a bad idea• Instead just minimize reconstruction loss
plus a dependence measure D(Φ) as a regularizer
Difficulty
Formulating a reasonable convex dependence penalty
Training problem
Consider subspace learning and sparse coding
minB∈Bm
minΦ∈Rm×t
L(BΦ;X ) + α‖Φ‖
Choice of ‖Φ‖ and B determines type of representation recovered
Training problem
Consider subspace learning and sparse coding
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖
Choice of ‖Φ‖ and B determines type of representation recovered
ProblemStill have rank constraint imposed by # new features m
IdeaJust relax m→∞• Rely on sparsity inducing norm ‖Φ‖ to select features
Training problem
Consider subspace learning and sparse coding
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖
Still have a problem
Optimization problem is not jointly convex in B and Φ
Training problem
Consider subspace learning and sparse coding
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖
Still have a problem
Optimization problem is not jointly convex in B and Φ
Idea 1: Alternate!• convex in B given Φ• convex in Φ given B
Could use any other form of local training
Training problem
Consider subspace learning and sparse coding
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖
Still have a problem
Optimization problem is not jointly convex in B and Φ
Idea 2: Boost!• Implicitly fix B to universal dictionary• Keep row-wise sparse Φ• Incrementally select column in B (“weak learning problem”)• Update sparse Φ
Can prove convergence under broad conditions
Training problem
Consider subspace learning and sparse coding
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖
Still have a problem?
Optimization problem is not jointly convex in B and Φ
Idea 3: Solve!• Can easily solve for globally optimal joint B and Φ• But requires a significant reformulation
A useful observation
Equivalent reformulation
Theoremmin
B∈B∞min
Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α|‖X‖|
• |‖ · ‖| is an induced matrix norm on X determined by B and ‖ · ‖p,1
Important fact
Norms are always convex
Computational strategy
1. Solve for optimal response matrix X first (convex minimization)2. Then recover optimal B and Φ from X
Example: subspace learning
minB∈B∞2
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖2,1
= minX∈Rn×t
L(X ;X ) + α‖X‖tr
Recovery
• Let UΣV ′ = svd(X )• Set B = U and Φ = ΣV ′
Preserves optimality
• ‖B:j‖2 = 1 hence B ∈ Bn2• ‖Φ‖2,1 = ‖ΣV ′‖2,1 =
∑j σj‖V:j‖2 =
∑j σj = ‖X‖tr
ThusL(X ;X ) + α‖X‖tr = L(BΦ;X ) + α‖Φ‖2,1
Example: sparse coding
minB∈B∞q
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖1,1
= minX∈Rn×t
L(X ;X ) + α‖X ′‖q,1
Recovery
B =
[1
‖X:1‖qX:1, ...,
1
‖X:t‖qX:t
](rescaled columns)
Φ =
‖X:1‖q 0. . .
0 ‖X:t‖q
(diagonal matrix)
Preserves optimality
• ‖B:j‖q = 1 hence B ∈ Btq• ‖Φ‖1,1 =
∑j ‖X:j‖q = ‖X ′‖q,1
ThusL(X ;X ) + α‖X ′‖q,1 = L(BΦ;X ) + α‖Φ‖1,1
Example: sparse coding
OutcomeSparse coding with ‖ · ‖1,1 regularization = vector quantization• drops some examples• memorizes remaining examples
Optimal solution is not overcomplete
Could not make these observations using local solvers
Simple extensions
• Missing observations in X• Robustness to outliers in X
minS∈Rn×t
minX∈Rn×t
L( (X + S)Ω ; XΩ ) + α|‖X‖|+ β‖S‖1,1
Ω = observed indices in XS = speckled outlier noise
(jointly convex in X and S)
Explaining the useful result
Theorem minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α|‖X‖|
for an induced matrix norm |‖X‖| = ‖X ′‖∗(B,p∗)
Explaining the useful result
Theorem minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α|‖X‖|
for an induced matrix norm |‖X‖| = ‖X ′‖∗(B,p∗)
A dual norm
‖X ′‖∗(B,p∗) = max‖Λ′‖(B,p∗)≤1
tr(Λ′X )
(standard definition of a dual norm)
Explaining the useful result
Theorem minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α|‖X‖|
for an induced matrix norm |‖X‖| = ‖X ′‖∗(B,p∗)
A dual norm
‖X ′‖∗(B,p∗) = max‖Λ′‖(B,p∗)≤1
tr(Λ′X )
(standard definition of a dual norm)
of a vector-norm induced matrix norm
‖Λ′‖(B,p∗) = maxb∈B‖Λ′b‖p∗
(easy to prove this yields a norm on matrices)
Proof outline
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
minΦ:BΦ=X
‖Φ‖p,1
Proof outline
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
minΦ:BΦ=X
‖Φ‖p,1
For any B ∈ B∞ that spans the columns of X
minΦ:BΦ=X
‖Φ‖p,1 = minΦ
maxΛ
max‖V ‖p∗,∞≤1
tr(V ′Φ) + tr(Λ′(X − BΦ))
= max‖V ‖p∗,∞≤1
maxΛ
minΦ
tr(Λ′X ) + tr(Φ′(V − B ′Λ))
= max‖V ‖p∗,∞≤1
maxΛ:B′Λ=V
tr(Λ′X )
= maxΛ:‖B′Λ‖p∗,∞≤1
tr(Λ′X )
Proof outline
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
maxΛ:‖B′Λ‖p∗,∞≤1
tr(Λ′X )
Proof outline
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )
Proof outline
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞
tr(Λ′X )
Proof outline
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )
Proof outline
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B
tr(Λ′X )
Proof outline
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ′X )
Proof outline
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1
tr(Λ′X )
Proof outline
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1
tr(Λ′X )︸ ︷︷ ︸
Proof outline
minB∈B∞
minΦ∈R∞×t
L(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1
tr(Λ′X )︸ ︷︷ ︸= min
X∈Rn×tL(X ;X ) + α‖X ′‖∗(B,p∗)
Proof outlinemin
B∈B∞min
Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
minB∈B∞
minΦ:BΦ=X
L(X ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α minB∈B∞
maxΛ:‖B′Λ‖p∗,∞≤1︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖B′Λ‖p∗,∞≤1, ∀B∈B∞︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖b′Λ‖p∗≤1, ∀b∈B︸ ︷︷ ︸ tr(Λ′X )
= minX∈Rn×t
L(X ;X ) + α maxΛ:‖Λ′‖(B,p∗)≤1
tr(Λ′X )︸ ︷︷ ︸= min
X∈Rn×tL(X ;X ) + α‖X ′‖∗(B,p∗)
done
Closed form induced norms
Theoremmin
B∈B∞min
Φ∈R∞×tL(BΦ;X ) + α‖Φ‖p,1
= minX∈Rn×t
L(X ;X ) + α‖X ′‖∗(B,p∗)
Special cases
B2, ‖Φ‖2,1 7→ ‖X ′‖∗(B2,2) = ‖X‖tr (subspace learning)
Bq, ‖Φ‖1,1 7→ ‖X ′‖∗(Bq ,∞) = ‖X ′‖q,1 (sparse coding)
B1, ‖Φ‖p,1 7→ ‖X ′‖∗(B1,p∗)= ‖X‖p,1
Some simple experiments
Experimental results
Alternate : repeatedly optimize over B,Φ successivelyGlobal : recover global joint minimizer over B,Φ
Experimental results: Sparse coding
Objective value achieved
data setCOIL WBC BCI Ionos G241N
Alternate 1.314 4.918 0.898 1.612 1.312Global 0.207 0.659 0.306 0.330 0.207
×10−2
(squared loss, q = 2, α = 10−5)
Experimental results: Sparse coding
Run time (seconds)
data setCOIL WBC BCI Ionos G241N
Alternate 1.95 10.54 0.88 1.71 2.37Global 0.06 0.01 0.01 0.01 0.09
(squared loss, q = 2, α = 10−5)
Experimental results: Subspace learning
Objective value achieved
data setCOIL WBC BCI Ionos G241N
Alternate 1.314 4.957 0.903 1.632 1.313Global 0.072 0.072 0.092 0.079 0.205
×10−2
(squared loss, α = 10−5)
Experimental results: Subspace learning
Run time (seconds)
data setCOIL WBC BCI Ionos G241N
Alternate 2.40 9.31 1.12 0.47 2.43Global 2.18 0.06 0.19 0.06 2.11
(squared loss, α = 10−5)
Catch
Every norm is convexBut not every induced matrix norm is tractable
‖X‖2 = σmax(X )
‖X‖1 = maxj
∑i
|Xij |
‖X‖∞ = maxi
∑j
|Xij |
‖X‖p NP-hard to approximate for p 6= 1, 2,∞
QuestionAny other useful induced matrix norms that are tractable?
Yes!
Semi-supervised feature discovery
Semi-supervised feature discovery
B Φl≈
labeled
(n+k)!(tl+tu)
Xl
W
ΦuXu
Yl
unlabeled
inputs
outputs
tl = # labeled n = # original featurestu = # unlabeled k = # output dimensionst = tl + tu
yx
B W
ϕ
LearnΦ = [Φl , Φu] data representationB = input reconstruction model f (BΦ) ≈ XW = output reconstruction model h(WΦl) ≈ Yl
Semi-supervised feature discovery
Let
Z =
[Xl Xu
Yl ∅
]U =
[BW
]U =
[BW
]Formulation
minB∈B∞
minW∈W∞
minΦ∈R∞×t
Lu(BΦ;X ) + βLs(WΦl ;Yl) + α‖Φ‖p,1
= minZ∈R(n+k)×t
L(Z ;Z ) + α‖Z ′‖∗(U ,p∗)
NoteImposing separate constraints on B and W
Questions• Is the induced norm ‖Z ′‖∗(U ,p∗) efficiently computable?
• Can optimal B, W , Φ be recovered from optimal Z?
Example: sparse coding formulation
Regularizer: ‖Φ‖1,1
Constraints: Bq1 = b : ‖b‖q1 ≤ 1Wq2 = w : ‖w‖q2 ≤ γUq1q2
= B ×W
Theorem
‖Z ′‖∗(Uq1
q2,∞)
=∑
j max(‖ZX
:j ‖q1 ,1γ ‖Z
Y:j ‖q2
)efficiently computable
Recovery
Φjj = max(‖ZX
:j ‖q1 ,1γ ‖Z
Y:j ‖q2
)(diagonal matrix)
U = ZΦ−1
Preserves optimality
But still reduces to a form of vector quantization
Example: subspace learning formulation
Regularizer: ‖Φ‖2,1
Constraints: B2 = b : ‖b‖2 ≤ 1W2 = w : ‖w‖2 ≤ γU2
2 = B ×W
Theorem
‖Z ′‖∗(U22 ,∞) = max
ρ≥0‖D−1
ρ Z‖tr where Dρ =
[ √1 + γρ I 0
0√
1+γρρ I
]efficiently computable: quasi-concave in ρ
Example: subspace learning formulation
Lemma: dual norm
‖Λ′‖2(U2
2 ,2) = maxh:‖hX ‖2=1, ‖hY ‖2=γ
h′ΛΛ′h
= maxH:H0, tr(HIX )=1, tr(HIY )=γ
tr(HΛΛ′)
= minλ≥0, ν≥0
minΛ:ΛΛ′λIX +νIY
λ+ γν
= minλ≥0, ν≥0
minΛ:‖Dν/λΛ‖2
sp≤λ+γνλ+ γν
= minλ≥0, ν≥0
‖Dν/λΛ‖2sp
= minρ≥0
‖DρΛ‖2sp
Example: subspace learning formulation
Can easily derive target norm from dual norm
‖Z ′‖∗(U22 ,2) = max
‖Λ′‖(U2
2,2)≤1
tr(Λ′Z )
= maxρ≥0
maxΛ:‖DρΛ‖sp≤1
tr(Λ′Z )
= maxρ≥0
maxΛ:‖Λ‖sp≤1
tr(Λ′D−1ρ Z )
= maxρ≥0
‖D−1ρ Z‖tr
(proves theorem)
Example: subspace learning formulation
Computational strategy
Solve in dual, since ‖Λ′‖(U22 ,∞) can be computed efficiently via
partitioned power method iteration
minΛ
L?(Λ;Z ) + α?‖Λ′‖(U22 ,2)
Given Λ• Recover ZX and ZY
l by solving
minZX , ZY
Lu(ZX ;X ) + Ls(ZYl ;Yl)− tr(ZX ′ΛX )− tr(ZY
l′ΛY
l )
• Recover ZYu by minimizing ‖Z ′‖(U2
2 ,2) (keeping ZX , ZYl fixed)
Example: subspace learning formulation
Recovery
Given optimal Z , recover U and Φ iteratively by repeating:
• (Φ(`),Λ(`)) ∈ arg minΦ maxΛ ‖Φ‖2,1 + tr(Λ′(Z − U(`)Φ))
• u(`+1) ∈ arg maxu∈U22‖u′Λ(`)‖2
• U(`+1) = [U(`), u(`+1)]
Converges to optimal U and Φ
• U(`)Φ(`) = Z for all `
• ‖Φ(`)‖2,1 → ‖Z ′‖∗(U22 ,2)
Some simple experiments
Experimental results: Subspace learning
Staged : first locally optimize B,Φ, then optimize WAlternate : repeatedly optimize over B,W ,Φ successivelyGlobal : recover joint global minimizer over B,Φ,W
Experimental results: Subspace learning
Objective value achieved
data setCOIL WBC BCI Ionos G241N
Staged 1.384 1.321 0.799 0.769 1.381Alternate 0.076 0.122 0.609 0.081 0.076Global 0.070 0.113 0.069 0.078 0.070
(1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)
Experimental results: Subspace learning
Run time (seconds)
data setCOIL WBC BCI Ionos G241N
Staged 272 73 45 28 290Alternate 2352 324 227 112 2648Global 106 8 25 61 94
(1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)
Experimental results: Subspace learning
Transductive generalization error
data setCOIL WBC BCI Ionos G241N
Staged 0.476 0.200 0.452 0.335 0.484Alternate 0.464 0.388 0.440 0.457 0.478Global 0.388 0.134 0.380 0.243 0.380(Lee et al. 2009) 0.414 0.168 0.436 0.350 0.452(Goldberg et al. 2010) 0.484 0.288 0.540 0.338 0.524
(1/3 labeled, 2/3 unlabeled, squared loss, α? = 10, β = 0.1)
Conclusion
Global training can be more efficient than local training
Alternation is inherently slow to converge
Global training simplifies practical application
• no under-training• only need to guard against over-fitting• can use standard regularization techniques