Download - Tensor Methods for Feature Learning

Transcript

Tensor Methods for Feature Learning

Anima Anandkumar

U.C. Irvine

Page 2: Tensor Methods for Feature Learning

Feature Learning For Efficient Classification

Find good transformations of input for improved classification

Figures used attributed to Fei-Fei Li, Rob Fergus, Antonio Torralba, et al.

Page 3: Tensor Methods for Feature Learning

Principles Behind Feature Learning

Classification/regression tasks: Predict y given x.

Find feature transform φ(x) to better predict y.

Feature learning: Learn φ(·) from data.

Page 4: Tensor Methods for Feature Learning

Principles Behind Feature Learning

Classification/regression tasks: Predict y given x.

Find feature transform φ(x) to better predict y.

φ(x)

Feature learning: Learn φ(·) from data.

Page 5: Tensor Methods for Feature Learning

Principles Behind Feature Learning

Classification/regression tasks: Predict y given x.

Find feature transform φ(x) to better predict y.

φ(x)

Feature learning: Learn φ(·) from data.

Learning φ(x) from Labeled vs. Unlabeled Samples

Labeled samples {xi, yi} and unlabeled samples {xi}.

Labeled samples should lead to better feature learning φ(·) but areharder to obtain.

Page 6: Tensor Methods for Feature Learning

Principles Behind Feature Learning

Classification/regression tasks: Predict y given x.

Find feature transform φ(x) to better predict y.

φ(x)

Feature learning: Learn φ(·) from data.

Learning φ(x) from Labeled vs. Unlabeled Samples

Labeled samples {xi, yi} and unlabeled samples {xi}.

Labeled samples should lead to better feature learning φ(·) but areharder to obtain.

Learn features φ(x) through latent variables related to x, y.

Page 7: Tensor Methods for Feature Learning

Conditional Latent Variable Models: Two Cases

Page 8: Tensor Methods for Feature Learning

Conditional Latent Variable Models: Two Cases

φ(x)

φ(φ(x))

Page 9: Tensor Methods for Feature Learning

Conditional Latent Variable Models: Two Cases

Multi-layer Neural Networks

φ(x)

φ(φ(x))

Page 10: Tensor Methods for Feature Learning

Conditional Latent Variable Models: Two Cases

Multi-layer Neural Networks

E[y|x] = σ(Ad σ(Ad−1 σ(· · ·A2 σ(A1x))))

φ(x)

φ(φ(x))

Page 11: Tensor Methods for Feature Learning

Conditional Latent Variable Models: Two Cases

Multi-layer Neural Networks

E[y|x] = σ(Ad σ(Ad−1 σ(· · ·A2 σ(A1x))))

φ(x)

φ(φ(x))

Mixture of Classifiers or GLMs

G(x) := E[y|x, h] = σ(〈Uh, x〉 + 〈b, h〉)

Page 12: Tensor Methods for Feature Learning

Conditional Latent Variable Models: Two Cases

Multi-layer Neural Networks

E[y|x] = σ(Ad σ(Ad−1 σ(· · ·A2 σ(A1x))))

φ(x)

φ(φ(x))

Mixture of Classifiers or GLMs

G(x) := E[y|x, h] = σ(〈Uh, x〉 + 〈b, h〉)

Page 13: Tensor Methods for Feature Learning

Challenges in Learning LVMs

Challenge: Identifiability Conditions

When can model be identified (given infinite computation and data)?

Does identifiability also lead to tractable algorithms?

Computational Challenges

Maximum likelihood is NP-hard in most scenarios.

Practice: Local search approaches such as Back-propagation, EM,Variational Bayes have no consistency guarantees.

Sample Complexity

Sample complexity needs to be low for high-dimensional regime.

Guaranteed and efficient learning through tensor methods

Page 14: Tensor Methods for Feature Learning

Outline

1 Introduction

2 Spectral and Tensor Methods

3 Generative Models for Feature Learning

4 Proposed Framework

5 Conclusion

Page 15: Tensor Methods for Feature Learning

Classical Spectral Methods: Matrix PCA and CCA

Unsupervised Setting: PCAFor centered samples {xi}, find projection P withRank(P ) = k s.t.

minP

∑

i∈[n]

‖xi − Pxi‖2.

Result: Eigen-decomposition of S = Cov(X).

Supervised Setting: CCAFor centered samples {xi, yi}, find

maxa,b

a⊤E[xy⊤]b√

a⊤E[xx⊤]a b⊤E[yy⊤]b.

Result: Generalized eigen decomposition.

x y

〈a, x〉〈b, y〉

Page 16: Tensor Methods for Feature Learning

Beyond SVD: Spectral Methods on Tensors

How to learn the mixture models without separation constraints?

◮ PCA uses covariance matrix of data. Are higher order moments helpful?

Unified framework?

◮ Moment-based estimation of probabilistic latent variable models?

SVD gives spectral decomposition of matrices.◮ What are the analogues for tensors?

Page 17: Tensor Methods for Feature Learning

Moment Matrices and Tensors

Multivariate Moments

M1 := E[x], M2 := E[x⊗ x], M3 := E[x⊗ x⊗ x].

Matrix

E[x⊗ x] ∈ Rd×d is a second order tensor.

E[x⊗ x]i1,i2 = E[xi1xi2 ].

For matrices: E[x⊗ x] = E[xx⊤].

Tensor

E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.

E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].

Page 18: Tensor Methods for Feature Learning

Spectral Decomposition of Tensors

M2 =∑

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

M3 =∑

λiui ⊗ vi ⊗ wi

= + ....

Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2

u⊗ v ⊗ w is a rank-1 tensor since its (i1, i2, i3)th entry is ui1vi2wi3 .

Guaranteed recovery. (Anandkumar et al 2012, Zhang & Golub 2001).

Page 19: Tensor Methods for Feature Learning

Moment Tensors for Conditional Models

Multivariate Moments: Many possibilities...

E[x⊗ y],E[x⊗ x⊗ y],E[φ(x)⊗ y] . . . .

Feature Transformations of the Input: x 7→ φ(x)

How to exploit them?

Are moments E[φ(x)⊗ y] useful?

If φ(x) is a matrix/tensor, we have matrix/tensor moments.

Can carry out spectral decomposition of the moments.

Page 20: Tensor Methods for Feature Learning

Moment Tensors for Conditional Models

Multivariate Moments: Many possibilities...

E[x⊗ y],E[x⊗ x⊗ y],E[φ(x)⊗ y] . . . .

Feature Transformations of the Input: x 7→ φ(x)

How to exploit them?

Are moments E[φ(x)⊗ y] useful?

If φ(x) is a matrix/tensor, we have matrix/tensor moments.

Can carry out spectral decomposition of the moments.

Construct φ(x) based on input distribution?

Page 21: Tensor Methods for Feature Learning

Outline

1 Introduction

2 Spectral and Tensor Methods

3 Generative Models for Feature Learning

4 Proposed Framework

5 Conclusion

Page 22: Tensor Methods for Feature Learning

Score Function of Input Distribution

Score function S(x) := −∇ log p(x)

1-d PDF

(a) p(x) = 1

Zexp(−E(x))

1-d Score

(b) ∂

∂xlog p(x) = − ∂

∂xE(x)

Figures from Alain and Bengio 2014.

Page 23: Tensor Methods for Feature Learning

Score Function of Input Distribution

Score function S(x) := −∇ log p(x)

1-d PDF

(a) p(x) = 1

Zexp(−E(x))

1-d Score

(b) ∂

∂xlog p(x) = − ∂

∂xE(x)

2-d Score

Figures from Alain and Bengio 2014.

Page 24: Tensor Methods for Feature Learning

Why Score Function Features?

S(x) := −∇ log p(x)

Utilizes generative models for input.

Can be learnt from unlabeled data.

Score matching methods work for non-normalized models.

Page 25: Tensor Methods for Feature Learning

Why Score Function Features?

S(x) := −∇ log p(x)

Utilizes generative models for input.

Can be learnt from unlabeled data.

Score matching methods work for non-normalized models.

Approximation of score function using denoising auto-encoders

∇ log p(x) ≈r∗(x+ n)− x

σ2

Page 26: Tensor Methods for Feature Learning

Why Score Function Features?

S(x) := −∇ log p(x)

Utilizes generative models for input.

Can be learnt from unlabeled data.

Score matching methods work for non-normalized models.

Approximation of score function using denoising auto-encoders

∇ log p(x) ≈r∗(x+ n)− x

σ2

Recall our goal: construct moments E[y ⊗ φ(x)]

Page 27: Tensor Methods for Feature Learning

Why Score Function Features?

S(x) := −∇ log p(x)

Utilizes generative models for input.

Can be learnt from unlabeled data.

Score matching methods work for non-normalized models.

Approximation of score function using denoising auto-encoders

∇ log p(x) ≈r∗(x+ n)− x

σ2

Recall our goal: construct moments E[y ⊗ φ(x)]

Beyond vector features?

Page 28: Tensor Methods for Feature Learning

Matrix and Tensor-valued Features

Higher order score functions

Sm(x) := (−1)m∇(m)p(x)

p(x)

Can be a matrix or a tensor instead of a vector.

Can be used to construct matrix and tensor moments E[y ⊗ φ(x)].

Page 29: Tensor Methods for Feature Learning

Outline

1 Introduction

2 Spectral and Tensor Methods

3 Generative Models for Feature Learning

4 Proposed Framework

5 Conclusion

Page 30: Tensor Methods for Feature Learning

Operations on Score Function Features

Form the cross-moments: E [y ⊗ Sm(x)].

Page 31: Tensor Methods for Feature Learning

Operations on Score Function Features

Form the cross-moments: E [y ⊗ Sm(x)].

Our result

E [y ⊗ Sm(x)] = E

[

∇(m)G(x)]

, G(x) := E[y|x].

Page 32: Tensor Methods for Feature Learning

Operations on Score Function Features

Form the cross-moments: E [y ⊗ Sm(x)].

Our result

E [y ⊗ Sm(x)] = E

[

∇(m)G(x)]

, G(x) := E[y|x].

Extension of Stein’s lemma.

Page 33: Tensor Methods for Feature Learning

Operations on Score Function Features

Form the cross-moments: E [y ⊗ Sm(x)].

Our result

E [y ⊗ Sm(x)] = E

[

∇(m)G(x)]

, G(x) := E[y|x].

Extension of Stein’s lemma.

Extract discriminative directions through spectral decomposition

E [y ⊗ Sm(x)] = E

[

∇(m)G(x)]

=∑

j∈[k]

λj · uj ⊗ uj . . .⊗ uj︸︷︷︸

m times.

Page 34: Tensor Methods for Feature Learning

Operations on Score Function Features

Form the cross-moments: E [y ⊗ Sm(x)].

Our result

E [y ⊗ Sm(x)] = E

[

∇(m)G(x)]

, G(x) := E[y|x].

Extension of Stein’s lemma.

Extract discriminative directions through spectral decomposition

E [y ⊗ Sm(x)] = E

[

∇(m)G(x)]

=∑

j∈[k]

λj · uj ⊗ uj . . .⊗ uj︸︷︷︸

m times.

Construct σ(u⊤j x) for some nonlinearity σ.

Page 35: Tensor Methods for Feature Learning

Automated Extraction of Discriminative Features

Page 36: Tensor Methods for Feature Learning

Learning Mixtures of Classifiers/GLMs

A mixture of r classifiers, hidden choice variable h ∈ {e1, . . . , er}.

E[y|x, h] = g(〈Uh, x〉 + 〈b, h〉)

∗ U = [u1|u2 . . . |ur] are the weight vectors of GLMs.

∗ b is the vector of biases.

M3 = E[y · S3(x)] =∑

i∈[r]

λi · ui ⊗ ui ⊗ ui.

First results for learning non-linear mixtures using spectral methods

Page 37: Tensor Methods for Feature Learning

Learning Multi-layer Neural Networks

E[y|x] = 〈a2, σ(A⊤1 x)〉 .

Our result

M3 = E[y · S3(x)] =∑

i∈[r]

λi · A1,i ⊗A1,i ⊗A1,i.

Page 38: Tensor Methods for Feature Learning

Framework Applied to MNISTUnlabeled data: {xi}

Auto-Encoder

Train SVM withσ(〈xi, uj〉)

r(x)Labeled data:{(xi, yi)}

uj ’s

Compute score functionSm(x) using r(x)

Form cross-moment:E [y · Sm(x)]

Spectral/tensor method:

= + ....

Tensor T = u⊗3

1+ u

⊗3

Page 39: Tensor Methods for Feature Learning

Outline

1 Introduction

2 Spectral and Tensor Methods

3 Generative Models for Feature Learning

4 Proposed Framework

5 Conclusion

Conclusion: Learning Conditional Models using

Tensor Methods

Tensor Decomposition

Efficient sample and computational complexities

Better performance compared to EM, Variational Bayes etc.

Scalable and embarrassingly parallel: handle large datasets.

Score function features

Score function features crucial for learning conditional models.

Related: Guaranteed Non-convex Methods

Overcomplete Dictionary Learning/Sparse Coding: Decompose datainto a sparse combination of unknown dictionary elements.

Non-convex robust PCA: Same guarantees as convex relaxationmethods, lower computational complexity. Extensions to tensorsetting.

Page 41: Tensor Methods for Feature Learning

Co-authors and Resources

Majid Janzamin Hanie Sedghi

Niranjan UN

Papers available at http://newport.eecs.uci.edu/anandkumar/

http://newport.eecs.uci.edu/anandkumar/

Top Related

Symmetry Energy, Tensor Force & Maximum … Energy, Tensor Force & Maximum Rotation of Neutron Stars ECT* EMMI Workshop “Neutron-rich matter & Neutron Stars” September 30 th –

Tensor Analysis - Universität Wien · Tensor Analysis Author: Harald Höller ... the simple Euclidean vector space won't provide the ... The nabla operator applied on a tensor of

Managing Feature Models

2.6 Tensor Analysis 133 - Panjab University

Foundations of tensor algebra and analysis 1 Tensor algebra · Symmetry-properties of a fourth-order tensor: A tensor E fulﬁlls minor symmetry if: E = Eti = Eto or E = Edl = Edr.

Machine Learning (CSE 446): Learning as Minimizing Loss ...

Mining Large Time-evolving Graphs Using Matrix and Tensor ...christos/TALKS/SIGMOD-07-tutorial/tensor3.pdf · • Tensor tools •Cseudi sesta • Tensor Basics •Tucker – Tucker

Isospin dependent interactions in RMF & ρ NN tensor coupling