Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform...

26
When do neural networks outperform kernel methods? Song Mei Stanford University June 29, 2020 Joint work with Behrooz Ghorbani, Theodor Misiakiewicz, and Andrea Montanari Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 1 / 15

Transcript of Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform...

Page 1: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

When do neural networks outperformkernel methods?

Song Mei

Stanford University

June 29, 2020

Joint work with Behrooz Ghorbani, Theodor Misiakiewicz, and Andrea Montanari

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 1 / 15

Page 2: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Neural tangent model

I Multi-layers NN: fN (x;θ), x 2 Rd, θ 2 RN

I Expanding around θ0:

fN (x;θ) = fN (x;θ0) + hθ � θ0;rθfN (x;θ0)i+ o(kθ � θ0k2):

I Neural tangent model:

fNT;N (x;β;θ0) = hβ;rθfN (x;θ0)i:

I Coupled gradient flow:d

dtθt = �rθE[(y � fN (x;θt))2]; θ0 = θ0;

d

dtβt = �rβE[(y � fNT;N (x;βt;θ0))

2]; β0 = 0:

I Under proper initialization and over-parameterization:

limN!1

jfN (x;θt)� fNT;N (x;βt)j = 0:

[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b], ....Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 2 / 15

Page 3: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Neural tangent model

I Multi-layers NN: fN (x;θ), x 2 Rd, θ 2 RN

I Expanding around θ0:

fN (x;θ) = fN (x;θ0) + hθ � θ0;rθfN (x;θ0)i+ o(kθ � θ0k2):

I Neural tangent model:

fNT;N (x;β;θ0) = hβ;rθfN (x;θ0)i:

I Coupled gradient flow:d

dtθt = �rθE[(y � fN (x;θt))2]; θ0 = θ0;

d

dtβt = �rβE[(y � fNT;N (x;βt;θ0))

2]; β0 = 0:

I Under proper initialization and over-parameterization:

limN!1

jfN (x;θt)� fNT;N (x;βt)j = 0:

[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b], ....Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 2 / 15

Page 4: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Neural tangent model

I Multi-layers NN: fN (x;θ), x 2 Rd, θ 2 RN

I Expanding around θ0:

fN (x;θ) = fN (x;θ0) + hθ � θ0;rθfN (x;θ0)i+ o(kθ � θ0k2):

I Neural tangent model:

fNT;N (x;β;θ0) = hβ;rθfN (x;θ0)i:

I Coupled gradient flow:d

dtθt = �rθE[(y � fN (x;θt))2]; θ0 = θ0;

d

dtβt = �rβE[(y � fNT;N (x;βt;θ0))

2]; β0 = 0:

I Under proper initialization and over-parameterization:

limN!1

jfN (x;θt)� fNT;N (x;βt)j = 0:

[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b], ....Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 2 / 15

Page 5: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Neural tangent model

I Multi-layers NN: fN (x;θ), x 2 Rd, θ 2 RN

I Expanding around θ0:

fN (x;θ) = fN (x;θ0) + hθ � θ0;rθfN (x;θ0)i+ o(kθ � θ0k2):

I Neural tangent model:

fNT;N (x;β;θ0) = hβ;rθfN (x;θ0)i:

I Coupled gradient flow:d

dtθt = �rθE[(y � fN (x;θt))2]; θ0 = θ0;

d

dtβt = �rβE[(y � fNT;N (x;βt;θ0))

2]; β0 = 0:

I Under proper initialization and over-parameterization:

limN!1

jfN (x;θt)� fNT;N (x;βt)j = 0:

[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b], ....Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 2 / 15

Page 6: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Neural tangent model

I Multi-layers NN: fN (x;θ), x 2 Rd, θ 2 RN

I Expanding around θ0:

fN (x;θ) = fN (x;θ0) + hθ � θ0;rθfN (x;θ0)i+ o(kθ � θ0k2):

I Neural tangent model:

fNT;N (x;β;θ0) = hβ;rθfN (x;θ0)i:

I Coupled gradient flow:d

dtθt = �rθE[(y � fN (x;θt))2]; θ0 = θ0;

d

dtβt = �rβE[(y � fNT;N (x;βt;θ0))

2]; β0 = 0:

I Under proper initialization and over-parameterization:

limN!1

jfN (x;θt)� fNT;N (x;βt)j = 0:

[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b], ....Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 2 / 15

Page 7: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

How about generalization?

I [Arora, Du, Hu, Li, Salakhutdinov, Wang, 2019]:Cifar10 experiments. NT: 23% test error. NN: less than 5% test error.

I [Arora, Du, Li, Salakhutdinov, Wang, Yu, 2019]:Small dataset, NT sometimes generalize better than NN.

I [Shankar, Fang, Guo, Fridovich-Keil, Schmidt, Ragan-Kelley, Recht, 2020][Li, Wang, Yu, Du, Hu, Salakhutdinov, Arora, 2019]:Smaller gap between NT and NN on Cifar10 (10% for NT).

Sometimes there is a large gap, while sometimes the gap is small.

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 3 / 15

Page 8: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

How about generalization?

I [Arora, Du, Hu, Li, Salakhutdinov, Wang, 2019]:Cifar10 experiments. NT: 23% test error. NN: less than 5% test error.

I [Arora, Du, Li, Salakhutdinov, Wang, Yu, 2019]:Small dataset, NT sometimes generalize better than NN.

I [Shankar, Fang, Guo, Fridovich-Keil, Schmidt, Ragan-Kelley, Recht, 2020][Li, Wang, Yu, Du, Hu, Salakhutdinov, Arora, 2019]:Smaller gap between NT and NN on Cifar10 (10% for NT).

Sometimes there is a large gap, while sometimes the gap is small.

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 3 / 15

Page 9: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Focus of this talk

When is there a large performance gap between NN and NT?

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 4 / 15

Page 10: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Two-layers neural networksNeural networks:

FNN;N =nfN (x;Θ) =

NXi=1

ai�(hwi;xi) : ai 2 R;wi 2 Rdo:

Linearization:

fN (x;Θ) = fN (x;Θ0) +

NXi=1

�ai�(hw0i ;xi)

| {z }Top layer linearization

+

NXi=1

a0i�0(hw0

i ;xi)h�wi;xi

| {z }Bottom layer linearization

+o(�):

Linearized neural networks (W = (wi)i2[N] �iid Unif(Sd�1)):

FRF;N (W ) =nf =

NXi=1

ai�(hwi;xi) : ai 2 R; i 2 [N ]o;

FNT;N (W ) =nf =

NXi=1

�0(hwi;xi)hbi;xi : bi 2 Rd

; i 2 [N ]o:

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 5 / 15

Page 11: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Two-layers neural networksNeural networks:

FNN;N =nfN (x;Θ) =

NXi=1

ai�(hwi;xi) : ai 2 R;wi 2 Rdo:

Linearization:

fN (x;Θ) = fN (x;Θ0) +

NXi=1

�ai�(hw0i ;xi)

| {z }Top layer linearization

+

NXi=1

a0i�0(hw0

i ;xi)h�wi;xi

| {z }Bottom layer linearization

+o(�):

Linearized neural networks (W = (wi)i2[N] �iid Unif(Sd�1)):

FRF;N (W ) =nf =

NXi=1

ai�(hwi;xi) : ai 2 R; i 2 [N ]o;

FNT;N (W ) =nf =

NXi=1

�0(hwi;xi)hbi;xi : bi 2 Rd

; i 2 [N ]o:

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 5 / 15

Page 12: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Spiked features model

I Signal features and junk features

x = (x1;x2) 2 Rd; x1 2 Rds ; x2 2 Rd�ds ;

ds = d�; 0 � � � 1;

Cov(x1) = snrf � Ids ; Cov(x2) = Id�ds ;

snrf = d�; 0 � � <1 (feature SNR):

I Response depend on signal features

y = f?(x) + "; f?(x) = '(x1):

x1f⋆(x1, x2) = φ(x1)

x2

Figure: Anisotropic features:� > 0, snrf > 1

I Feature SNR: snrf = d� � 1.I Effective dimension: de� = ds _ (d=snrf). We have ds � de� � d.I Larger snrf induces smaller de� .

More precisely: x � Unif(Sds (rpds))� Unif(Sd�ds (

pd)). Generalizable to multi-spheres.

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 6 / 15

Page 13: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Spiked features model

I Signal features and junk features

x = (x1;x2) 2 Rd; x1 2 Rds ; x2 2 Rd�ds ;

ds = d�; 0 � � � 1;

Cov(x1) = snrf � Ids ; Cov(x2) = Id�ds ;

snrf = d�; 0 � � <1 (feature SNR):

I Response depend on signal features

y = f?(x) + "; f?(x) = '(x1):

x1f⋆(x1, x2) = φ(x1)

x2

Figure: Isotropic features:� = 0, snrf = 1

I Feature SNR: snrf = d� � 1.I Effective dimension: de� = ds _ (d=snrf). We have ds � de� � d.I Larger snrf induces smaller de� .

More precisely: x � Unif(Sds (rpds))� Unif(Sd�ds (

pd)). Generalizable to multi-spheres.

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 6 / 15

Page 14: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Spiked features model

I Signal features and junk features

x = (x1;x2) 2 Rd; x1 2 Rds ; x2 2 Rd�ds ;

ds = d�; 0 � � � 1;

Cov(x1) = snrf � Ids ; Cov(x2) = Id�ds ;

snrf = d�; 0 � � <1 (feature SNR):

I Response depend on signal features

y = f?(x) + "; f?(x) = '(x1):

x1f⋆(x1, x2) = φ(x1)

x2

Figure: Isotropic features:� = 0, snrf = 1

I Feature SNR: snrf = d� � 1.I Effective dimension: de� = ds _ (d=snrf). We have ds � de� � d.I Larger snrf induces smaller de� .

More precisely: x � Unif(Sds (rpds))� Unif(Sd�ds (

pd)). Generalizable to multi-spheres.

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 6 / 15

Page 15: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Approximation error with N neurons

Approximation error: R(f?;F) = inff2F kf? � fk2L2 .

Theorem (Ghorbani, Mei, Misiakiewicz, Montanari, 2020)Assume de�

`+� � N � de�`+1�� and “generic condition” on �, we have

R(f?;FRF;N (W )) = kP>`f?k2L2 + od;P(�);

R(f?;FNT;N (W )) = kP>`+1f?k2L2 + od;P(�):

On the contrary, assume ds`+� � N � ds

`+1��, we have

R(f?;FNN;N ) � kP>`+1f?k2L2 + od(�):

Moreover, R(f?;FNN;N ) is independent of snrf .

P>`: projection orthogonal to the space of degree-` polynomials.Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 7 / 15

Page 16: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Approximation error with N neurons

Dim de� � ds _ (d=snrf) and ds � de� � d.

To approx. a degree-` poly. in x1:

I NN need at most ds` parameters*.

I RF need de�` parameters.

I NT need de�`�1 � d parameters.

Approximation power: NN � RF � NT.

* If we don’t count parameters with value 0.Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 8 / 15

Page 17: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Extreme case: low feature SNR

Fix 0 < � < 1, low snrf : � = 0.

To approx. a degree-` poly. in x1:

I NN need at most d�` parameters*.

I RF need d` parameters.

I NT need d` parameters.

Approximation power: NN > RF = NT.

x1f⋆(x1, x2) = φ(x1)

x2

Figure: Isotropic features: � = 0,snrf = 1

* If we don’t count parameters with value 0.Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 8 / 15

Page 18: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Extreme case: high feature SNR

Fix 0 < � < 1, high snrf : �� 1.

To approx. a degree-` poly. in x1:

I NN need at most d�` parameters*.

I RF need d�` parameters.

I NT need d�(`�1)+1 parameters.

Approximation power: NN � RF > NT.

x1f⋆(x1, x2) = φ(x1)

x2

Figure: Anisotropic features:� > 0, snrf > 1

* If we don’t count parameters with value 0.Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 8 / 15

Page 19: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Numerical simulations

1.2 1.4 1.6 1.8 2.0 2.2log(Params)

log(d)

0.0

0.2

0.4

0.6

0.8

1.0

R/R 0

LinearQuadraticCubicQuartic

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Colorbar: � 2 [0; 1].Dot-dashed: NN.Dashed lines: RF;Continuous lines: NT;Dimension: d = 1024.Eff. dim: ds = 16.

Conclusion(a) Power: NN � RF � NT. (b) Risk of NN independent of snrf .(c) Larger snrf induces larger power of fRF;NTg.

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 9 / 15

Page 20: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Similar results for generalization error with finite samples n

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 10 / 15

Page 21: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Extreme case: low feature SNR

Fix 0 < � < 1, low snrf : � = 0.

To fit a degree-` poly. in x1:

I 9�, NN need at most d�` samples.

I fRF;NTg kernel need d` samples.

Potential generalization power:NN > Kernel methods.

x1f⋆(x1, x2) = φ(x1)

x2

Figure: Isotropic features: � = 0,snrf = 1

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 11 / 15

Page 22: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Extreme case: high feature SNR

Fix 0 < � < 1, high snrf : �� 1.

To fit a degree-` poly. in x1:

I 9�, NN need at most d�` samples.

I fRF;NTg kernel need d�` samples.

Potential generalization power:NN � Kernel methods.

x1f⋆(x1, x2) = φ(x1)

x2

Figure: Anisotropic features:� > 0, snrf > 1

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 11 / 15

Page 23: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Implications

Adding isotropic noise in features (i.e., decreasing snrf),performance gap between NN and fRF;NTg becomes larger.

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 12 / 15

Page 24: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Numerical simulations

0.0 0.5 1.0 1.5 2.0 2.5 3.0Noise Strength, τ

0.82

0.84

0.86

0.88

0.90

0.92Cl

assif

icatio

n Ac

cura

cy (T

est S

et)

LinearQuadraticNNRF KRRNT KRRRFNT

τ=

0.0

τ=

1.0

τ=

2.0

τ=

3.0

Figure: Underlying assumption: labels depend on low frequency components ofimages.

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 13 / 15

Page 25: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Message

In spiked features model, a controlling parameter of the performance gapbetween NN and fRF;NTg is

snrf = Feature SNR =Signal features varianceJunk features variance

:

I Small snrf , there is a large separation.I Large snrf , fRF;NTg performs closer to NN.

Somewhat implicitly, NN first finds the signal features (PCA), and thenperform kernel methods on these features.

snrf 6= SNR = kf?k2L2=E["2]

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 14 / 15

Page 26: Whendoneuralnetworksoutperform kernelmethods?songmei/Presentation/...Whendoneuralnetworksoutperform kernelmethods? SongMei Stanford University June29,2020 Joint work with Behrooz Ghorbani,

Thank you!

Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 15 / 15