Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform...

When do neural networks outperformkernel methods?

Song Mei

Stanford University

June 29, 2020

Joint work with Behrooz Ghorbani, Theodor Misiakiewicz, and Andrea Montanari

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 1 / 15

Neural tangent model

I Multi-layers NN: fN (x;θ), x 2 Rd, θ 2 RN

I Expanding around θ0:

fN (x;θ) = fN (x;θ0) + hθ � θ0;rθfN (x;θ0)i+ o(kθ � θ0k2):

I Neural tangent model:

fNT;N (x;β;θ0) = hβ;rθfN (x;θ0)i:

I Coupled gradient flow:d

dtθt = �rθE[(y � fN (x;θt))2]; θ0 = θ0;

dtβt = �rβE[(y � fNT;N (x;βt;θ0))

2]; β0 = 0:

I Under proper initialization and over-parameterization:

limN!1

jfN (x;θt)� fNT;N (x;βt)j = 0:

[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b], ....Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 2 / 15

2]; β0 = 0:

limN!1

2]; β0 = 0:

limN!1

2]; β0 = 0:

limN!1

2]; β0 = 0:

limN!1

How about generalization?

I [Arora, Du, Hu, Li, Salakhutdinov, Wang, 2019]:Cifar10 experiments. NT: 23% test error. NN: less than 5% test error.

I [Arora, Du, Li, Salakhutdinov, Wang, Yu, 2019]:Small dataset, NT sometimes generalize better than NN.

I [Shankar, Fang, Guo, Fridovich-Keil, Schmidt, Ragan-Kelley, Recht, 2020][Li, Wang, Yu, Du, Hu, Salakhutdinov, Arora, 2019]:Smaller gap between NT and NN on Cifar10 (10% for NT).

Sometimes there is a large gap, while sometimes the gap is small.

How about generalization?

I [Arora, Du, Hu, Li, Salakhutdinov, Wang, 2019]:Cifar10 experiments. NT: 23% test error. NN: less than 5% test error.

I [Arora, Du, Li, Salakhutdinov, Wang, Yu, 2019]:Small dataset, NT sometimes generalize better than NN.

I [Shankar, Fang, Guo, Fridovich-Keil, Schmidt, Ragan-Kelley, Recht, 2020][Li, Wang, Yu, Du, Hu, Salakhutdinov, Arora, 2019]:Smaller gap between NT and NN on Cifar10 (10% for NT).

Sometimes there is a large gap, while sometimes the gap is small.

Focus of this talk

When is there a large performance gap between NN and NT?

Two-layers neural networksNeural networks:

FNN;N =nfN (x;Θ) =

ai�(hwi;xi) : ai 2 R;wi 2 Rdo:

Linearization:

fN (x;Θ) = fN (x;Θ0) +

�ai�(hw0i ;xi)

| {z }Top layer linearization

a0i�0(hw0

i ;xi)h�wi;xi

| {z }Bottom layer linearization

+o(�):

Linearized neural networks (W = (wi)i2[N] �iid Unif(Sd�1)):

FRF;N (W ) =nf =

ai�(hwi;xi) : ai 2 R; i 2 [N ]o;

FNT;N (W ) =nf =

�0(hwi;xi)hbi;xi : bi 2 Rd

; i 2 [N ]o:

Two-layers neural networksNeural networks:

FNN;N =nfN (x;Θ) =

ai�(hwi;xi) : ai 2 R;wi 2 Rdo:

Linearization:

fN (x;Θ) = fN (x;Θ0) +

�ai�(hw0i ;xi)

| {z }Top layer linearization

a0i�0(hw0

i ;xi)h�wi;xi

| {z }Bottom layer linearization

+o(�):

Linearized neural networks (W = (wi)i2[N] �iid Unif(Sd�1)):

FRF;N (W ) =nf =

ai�(hwi;xi) : ai 2 R; i 2 [N ]o;

FNT;N (W ) =nf =

�0(hwi;xi)hbi;xi : bi 2 Rd

; i 2 [N ]o:

Spiked features model

I Signal features and junk features

x = (x1;x2) 2 Rd; x1 2 Rds ; x2 2 Rd�ds ;

ds = d�; 0 � � � 1;

Cov(x1) = snrf � Ids ; Cov(x2) = Id�ds ;

snrf = d�; 0 � � <1 (feature SNR):

I Response depend on signal features

y = f?(x) + "; f?(x) = '(x1):

x1f⋆(x1, x2) = φ(x1)

Figure: Anisotropic features:� > 0, snrf > 1

I Feature SNR: snrf = d� � 1.I Effective dimension: de� = ds _ (d=snrf). We have ds � de� � d.I Larger snrf induces smaller de� .

More precisely: x � Unif(Sds (rpds))� Unif(Sd�ds (

pd)). Generalizable to multi-spheres.

x = (x1;x2) 2 Rd; x1 2 Rds ; x2 2 Rd�ds ;

ds = d�; 0 � � � 1;

y = f?(x) + "; f?(x) = '(x1):

x1f⋆(x1, x2) = φ(x1)

Figure: Isotropic features:� = 0, snrf = 1

x = (x1;x2) 2 Rd; x1 2 Rds ; x2 2 Rd�ds ;

ds = d�; 0 � � � 1;

y = f?(x) + "; f?(x) = '(x1):

x1f⋆(x1, x2) = φ(x1)

Figure: Isotropic features:� = 0, snrf = 1

Approximation error with N neurons

Approximation error: R(f?;F) = inff2F kf? � fk2L2 .

Theorem (Ghorbani, Mei, Misiakiewicz, Montanari, 2020)Assume de�

`+� � N � de�`+1�� and “generic condition” on �, we have

R(f?;FRF;N (W )) = kP>`f?k2L2 + od;P(�);

R(f?;FNT;N (W )) = kP>`+1f?k2L2 + od;P(�):

On the contrary, assume ds`+� � N � ds

`+1��, we have

R(f?;FNN;N ) � kP>`+1f?k2L2 + od(�):

Moreover, R(f?;FNN;N ) is independent of snrf .

P>`: projection orthogonal to the space of degree-` polynomials.Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 7 / 15

Approximation error with N neurons

Dim de� � ds _ (d=snrf) and ds � de� � d.

To approx. a degree-` poly. in x1:

I NN need at most ds` parameters*.

I RF need de�` parameters.

I NT need de�`�1 � d parameters.

Approximation power: NN � RF � NT.

* If we don’t count parameters with value 0.Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 8 / 15

Extreme case: low feature SNR

Fix 0 < � < 1, low snrf : � = 0.

I NN need at most d�` parameters*.

I RF need d` parameters.

I NT need d` parameters.

Approximation power: NN > RF = NT.

x1f⋆(x1, x2) = φ(x1)

Figure: Isotropic features: � = 0,snrf = 1

Extreme case: high feature SNR

Fix 0 < � < 1, high snrf : �� 1.

I NN need at most d�` parameters*.

I RF need d�` parameters.

I NT need d�(`�1)+1 parameters.

Approximation power: NN � RF > NT.

x1f⋆(x1, x2) = φ(x1)

Numerical simulations

1.2 1.4 1.6 1.8 2.0 2.2log(Params)

log(d)

LinearQuadraticCubicQuartic

Colorbar: � 2 [0; 1].Dot-dashed: NN.Dashed lines: RF;Continuous lines: NT;Dimension: d = 1024.Eff. dim: ds = 16.

Conclusion(a) Power: NN � RF � NT. (b) Risk of NN independent of snrf .(c) Larger snrf induces larger power of fRF;NTg.

Similar results for generalization error with finite samples n

Extreme case: low feature SNR

Fix 0 < � < 1, low snrf : � = 0.

To fit a degree-` poly. in x1:

I 9�, NN need at most d�` samples.

I fRF;NTg kernel need d` samples.

Potential generalization power:NN > Kernel methods.

x1f⋆(x1, x2) = φ(x1)

Figure: Isotropic features: � = 0,snrf = 1

Extreme case: high feature SNR

Fix 0 < � < 1, high snrf : �� 1.

To fit a degree-` poly. in x1:

I 9�, NN need at most d�` samples.

I fRF;NTg kernel need d�` samples.

Potential generalization power:NN � Kernel methods.

x1f⋆(x1, x2) = φ(x1)

Implications

Adding isotropic noise in features (i.e., decreasing snrf),performance gap between NN and fRF;NTg becomes larger.

Numerical simulations

0.0 0.5 1.0 1.5 2.0 2.5 3.0Noise Strength, τ

0.92Cl

icatio

LinearQuadraticNNRF KRRNT KRRRFNT

Figure: Underlying assumption: labels depend on low frequency components ofimages.

Message

In spiked features model, a controlling parameter of the performance gapbetween NN and fRF;NTg is

snrf = Feature SNR =Signal features varianceJunk features variance

I Small snrf , there is a large separation.I Large snrf , fRF;NTg performs closer to NN.

Somewhat implicitly, NN first finds the signal features (PCA), and thenperform kernel methods on these features.

snrf 6= SNR = kf?k2L2=E["2]

Thank you!

Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform...

Documents

Transcript of Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform...

Frank Redig Joint work with G. Carinci C. Giardin a M. Jara · Some news on the symmetric inclusion process Frank Redig Joint work with G. Carinci C. Giardin a M. Jara January 10,

Bounded Degree High Dimensional Expandershelper.ipam.ucla.edu/publications/ccgws4/ccgws4_11999.pdf · Tali Kaufman . Bar-Ilan University, Israel . Joint work with David Kazhdan and

1 Approximate Satisfiability and Equivalence Michel de Rougemont University Paris II & LRI Joint work with E. Fischer, Technion, F. Magniez, LRI, LICS.

A MULTIPATH SPARSE BEAMFORMING METHOD AFSANEH ASAEI JOINT WORK WITH: BARAN GÖZCÜ, VOLKAN CEVHER, MOHAMMAD J. TAGHIZADEH, BHIKSHA RAJ, HERVE BOURLARD.

OPTIMAL PHEROMONE UTILIZATION Roman Kecher Joint work with: Yehuda Afek and Moshe Sulamy Tel-Aviv University.

RobustQuadraticallyConstrainedProgramshwolkowi/henry/reports/... · RobustQuadraticallyConstrainedPrograms Garud Iyengar IEOR Department, Columbia University Joint work with Donald

INEXACT COORDINATE DESCENTprichtar/Optimization_and...INEXACT COORDINATE DESCENT Rachael Tappenden (Joint work with Jacek Gondzio & Peter Richt arik) We extend the work of Richt arik

Capacity and Scheduling in Heterogeneous Networks€¦ · Capacity and Scheduling in Heterogeneous Networks Phil Whiting Macquarie University North Ryde, NSW 2109 Joint Work with

Riesz bases, Meyer's quasicrystals, and bounded remainder sets...Riesz bases, Meyer’s quasicrystals, and bounded remainder sets SigridGrepstad June7,2018 Joint work with Nir Lev

Anthony Gruber Joint work with: M. Gunzburger, L. Ju, Z ...

Monogamy of non- signalling correlations Aram Harrow (MIT) Simons Institute, 27 Feb 2014 based on joint work with Fernando Brandão (UCL) arXiv:1210.6367.

Noncommutative Geometry and Conformal Geometry (joint work … · RP+HW: Noncommutative geometry and conformal geometry. II. Connes-Chern character and the local equivariant index

-expansions and multiple tilings...-expansions and multiple tilings -expansions and multiple tilings Charlene Kalle, joint work with Wolfgang Steiner (PhD supervisor Karma Dajani)

Recent APE Work - Stanford University · 2013. 3. 12. · Recent Work • FACET / LCLS / LCLS-II Host level architecture and ﬁxes • orbitbyphase. BPMs and corrector phase analysis

Rui Loja Fernandes (based on joint work with Ioan Marcut˘2020/04/09 · Local models around Poisson submanifolds Rui Loja Fernandes... (based on joint work with Ioan Marcut˘,) Department

Large Deviations for Multi-valued Stochastic Diﬀerential ...jiagang.pdf · Large Deviations for Multi-valued Stochastic Diﬀerential Equations Joint work with Siyan Xu, Xicheng

Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Limitations for Quantum PCPs Fernando G.S.L. Brandão University College London Based on joint work arXiv:1310.0017 with Aram Harrow MIT Simons Institute,

Jayce Getz McGill University Funded by NSERC. (joint work with

Univoque numbers Christiane Frougny Joint work with Jean-Paul …cf/exposes/univoqueexpo.pdf · Univoque numbers Christiane Frougny Joint work with Jean-Paul Allouche and Kevin Hare