Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and...

16
Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique Fédérale de Lausanne October 28, 2019

Transcript of Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and...

Page 1: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Neural Tangent KernelConvergence and Generalization of DNNs

Arthur Jacot, Franck Gabriel, Clément Hongler

Ecole Polytechnique Fédérale de Lausanne

October 28, 2019

arthur
Page 2: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Neural Networks

I L + 1 layers of n` neurons with activations α(`)(x) ∈ Rn`

α(0)(x) = x

α̃(`+1)(x) =

√1− β2√

n`W (`)α(`)(x) + βb(`)

α(`+1)(x) = σ(α̃(`+1)(x)

)I Parameters: connections weights W (`) ∈ Rn`×n`+1 and bias b(`) ∈ Rn`+1 .I Weights / bias balance: β ∈ [0,1].I Non-linearity: σ : R→ R.

I Network function fθ(x) = α̃(L)(x)

arthur
arthur
arthur
arthur
arthur
arthur
Page 3: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Initialization: DNNs as Gaussian processes

I In the infinite width limit n1, ...,nL−1 →∞I Initialize the parameters θ ∼ N (0, IdP).

I The preactivations α̃(`)i (·; θ) : Rn0 → R converge to iid Gaussian processes of

covariance Σ(`) (Cho and Saul, 2009; Lee et al., 2018):

Σ(1)(x , y) = (1− β2)xT y + β2

Σ(`+1)(x , y) = (1− β2)Eα∼N(0,Σ(`)) [σ(α(x))σ(α(y))] + β2

I The network function fθ = α̃(L) is also asymptotically Gaussian.

arthur
arthur
arthur
arthur
arthur
Page 4: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Training: Neural Tangent Kernel

I Training set of size N: inputs X ∈ RN×n0 and outputs Y ∈ RN×nL .I Gradient descent on the MSE C(θ) = 1

N ‖fθ(X )− Y‖2

∂tθ = −∇C(θ) =2N

N∑i=1

∇fθ(xi)(Yi − fθ(xi))

I Evolution of fθ:

∂t fθ(x) = (∇fθ(x))T ∂tθ =2N

N∑i=1

(∇fθ(x))T ∇fθ(xi)(yi − fθ(xi))

I Neural Tangent Kernel (NTK):

Θ(L)(x , y) := (∇fθ(x))T ∇fθ(y)

arthur
arthur
arthur
arthur
arthur
arthur
Page 5: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Asymptotics of the NTK

Theorem (Arora et al., 2019; Jacot et al., 2018)Let n1, . . . ,nL−1 →∞, for any t:

Θ(L)(t)→ Θ(L)∞

where

Θ(L)∞ (x , y) =

L∑`=1

Σ(`)(x , y)Σ̇(`+1)(x , y)...Σ̇(L)(x , y)

withΣ̇(L)(x , x ′) = (1− β2)Eα∼N (0,Σ(L−1))[σ̇(α(x))σ̇(α(x ′))].

arthur
arthur
arthur
arthur
arthur
arthur
Page 6: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

ConvergenceThe continuous-time dynamics of the outputs Yθ(t) = fθ(t)(X ) are described by thelinear ODE

∂tYθ(t),k =2N

Θ(L)∞ (X ,X )

(Y − Yθ(t)

)with solutionYθ(t) = Y − e

2tN Θ

(L)∞ (X ,X)

(Y − Yθ(0)

).

Convergence and loss surface(Jacot et al.,2019a):

1. Eigenvalues: speed of convergencealong the eigenvector

2. Narrow valley structure:2.1 Large eig. are the ’cliffs’2.2 Small eig. are the ’bottom’

3. Condition number κ = λmax/λmin

2 1 0 1 2

2

1

0

1

2

0.100

1.000

1.000

10.000

10.000

Figure: Narrow valley in the loss surface.

arthur
arthur
arthur
Page 7: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Freeze and ChaosTwo regimes appear (as in Poole et al. (2016); Schoenholz et al. (2017))

Theorem (Jacot et al., 2019b)Consider a twice differentiable σ satisfying Ez∼N (0,1) [σ(z)] = 1 and two inputs x , ywith ‖x‖ = ‖y‖ =

√n0. The characteristic value

rσ,β = (1− β2)Ez∼N (0,1)

[σ̇(z)2

]determines two regimes:FREEZE (r<1): For x , y ∈ Sn0

1 ≥ Θ(L)∞ (x , y)

Θ(L)∞ (x , x)

≥ 1− CLrL

CHAOS (r>1): If x 6= ±y:Θ

(L)∞ (x , y)

Θ(L)∞ (x , x)

→ 0

arthur
arthur
arthur
Page 8: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Freeze and Chaos: Properties

I FREEZE (r<1):I Almost constant NTK Gram MatrixI Bad conditioning λmax/λmin

I Slow convergenceI CHAOS (r>1):

I Almost identity NTK Gram MatrixI Fast convergenceI Generalization?

I EDGE: For the RELU: r = 1− β2 ≈ 1I Converges to Id + cI Strong constant mode for large NI Slow convergence

3 2 1 0 1 2 3angle

0.0

0.2

0.4

0.6

0.8

1.0

(6) (x

,y)

r = 0.75r = 0.99r = 1.2Batch Norm

Figure: The NTK in the different regimes forL = 6.

arthur
arthur
arthur
arthur
arthur
arthur
Page 9: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Chaos: normalization

I Normalize the non-linearity (over z ∼ N (0,1)):

σ̄(x) =σ(x)− E [σ(z)]√

Var [σ(z)].

I If σ 6= id we have E[

˙̄σ(z)2] > 1, and for small enough β > 0

r = (1− β2)E[

˙̄σ(z)2]> 1.

I Asymptotically equivalent to Layer Normalization.

I For Batch Normalization we only have:I Applying BN after the last non-linearity controls the constant mode:

1N2

∑ij

Θ(L)(xi , xj) = β2

arthur
Page 10: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Chaos: normalization

I Normalize the non-linearity (over z ∼ N (0,1)):

σ̄(x) =σ(x)− E [σ(z)]√

Var [σ(z)].

I If σ 6= id we have E[

˙̄σ(z)2] > 1, and for small enough β > 0

r = (1− β2)E[

˙̄σ(z)2]> 1.

I Asymptotically equivalent to Layer Normalization.I For Batch Normalization we only have:

I Applying BN after the last non-linearity controls the constant mode:1

N2

∑ij

Θ(L)(xi , xj) = β2

arthur
Page 11: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Generative Adversarial Networks

I Two networks(Goodfellow et al., 2014)I Generator G: generates data G(z) from a random code z.I Discriminator D: classifies real/generated data to guide the generator.

I If the generator lies in the FREEZEI Mode collapse: The generator G becomes constant.

I Solutions:I Batch NormalizationI Chaotic generator

arthur
Page 12: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Deconvolutional Generator

For G a deconvolutional network with stride s

TheoremIn the FREEZE (r>1):

1− r v+1

1− rL − C1(v + 1)r v ≤Θ

(L)p,p′ (x , y)

Θ(L)p,p (x , x)

≤ 1− r v+1

1− rL

for v the max. in {0,L− 1} s.t. sv divides p − p′

I Checkerboard patterns: images which repeat every sv pixelsI Solution: Batch Norm / Chaos

arthur
Page 13: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

NTK PCA

FRE

EZE

1 2 3 4 5 6 7 8

1

2

3

4

CH

AO

S

1 2 3 4 5 6 7 8

1

2

3

4

BAT

CH

NO

RM

1 2 3 4 5 6 7 8

1

2

3

4

Page 14: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Conclusion

1. Certain architectures lead to dominating eigenvalues:1.1 Constant mode1.2 Checkerboard patterns

2. Slows down training3. Mode collapse in GANS4. Use the NTK to identify and fix them

arthur
arthur
Page 15: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Bibliography I

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. (2019). Onexact computation with an infinitely wide neural net. arXiv preprintarXiv:1904.11955.

Cho, Y. and Saul, L. K. (2009). Kernel Methods for Deep Learning. In Advances inNeural Information Processing Systems 22, pages 342–350. CurranAssociates, Inc.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., and Bengio, Y. (2014). Generative Adversarial Networks.NIPS’14 Proceedings of the 27th International Conference on NeuralInformation Processing Systems - Volume 2, pages 2672–2680.

Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel:Convergence and Generalization in Neural Networks. In Advances in NeuralInformation Processing Systems 31, pages 8580–8589. Curran Associates,Inc.

Page 16: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Bibliography II

Jacot, A., Gabriel, F., and Hongler, C. (2019a). The asymptotic spectrum of thehessian of dnn throughout training.

Jacot, A., Gabriel, F., and Hongler, C. (2019b). Freeze and chaos for dnns: anNTK view of batch normalization, checkerboard and boundary effects.CoRR, abs/1907.05715.

Lee, J. H., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., andSohl-Dickstein, J. (2018). Deep Neural Networks as Gaussian Processes.ICLR.

Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016).Exponential expressivity in deep neural networks through transient chaos. InLee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors,Advances in Neural Information Processing Systems 29, pages 3360–3368.Curran Associates, Inc.

Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deepinformation propagation.