Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and...

Neural Tangent KernelConvergence and Generalization of DNNs

Arthur Jacot, Franck Gabriel, Clément Hongler

Ecole Polytechnique Fédérale de Lausanne

October 28, 2019

arthur

Neural Networks

I L + 1 layers of n` neurons with activations α(`)(x) ∈ Rn`

α(0)(x) = x

α̃(`+1)(x) =

√1− β2√

n`W (`)α(`)(x) + βb(`)

α(`+1)(x) = σ(α̃(`+1)(x)

)I Parameters: connections weights W (`) ∈ Rn`×n`+1 and bias b(`) ∈ Rn`+1 .I Weights / bias balance: β ∈ [0,1].I Non-linearity: σ : R→ R.

I Network function fθ(x) = α̃(L)(x)

arthur

arthur

arthur

arthur

arthur

arthur

Initialization: DNNs as Gaussian processes

I In the infinite width limit n1, ...,nL−1 →∞I Initialize the parameters θ ∼ N (0, IdP).

I The preactivations α̃(`)i (·; θ) : Rn0 → R converge to iid Gaussian processes of

covariance Σ(`) (Cho and Saul, 2009; Lee et al., 2018):

Σ(1)(x , y) = (1− β2)xT y + β2

Σ(`+1)(x , y) = (1− β2)Eα∼N(0,Σ(`)) [σ(α(x))σ(α(y))] + β2

I The network function fθ = α̃(L) is also asymptotically Gaussian.

arthur

arthur

arthur

arthur

arthur

Training: Neural Tangent Kernel

I Training set of size N: inputs X ∈ RN×n0 and outputs Y ∈ RN×nL .I Gradient descent on the MSE C(θ) = 1

N ‖fθ(X )− Y‖2

∂tθ = −∇C(θ) =2N

N∑i=1

∇fθ(xi)(Yi − fθ(xi))

I Evolution of fθ:

∂t fθ(x) = (∇fθ(x))T ∂tθ =2N

N∑i=1

(∇fθ(x))T ∇fθ(xi)(yi − fθ(xi))

I Neural Tangent Kernel (NTK):

Θ(L)(x , y) := (∇fθ(x))T ∇fθ(y)

arthur

arthur

arthur

arthur

arthur

arthur

Asymptotics of the NTK

Theorem (Arora et al., 2019; Jacot et al., 2018)Let n1, . . . ,nL−1 →∞, for any t:

Θ(L)(t)→ Θ(L)∞

where

Θ(L)∞ (x , y) =

L∑`=1

Σ(`)(x , y)Σ̇(`+1)(x , y)...Σ̇(L)(x , y)

withΣ̇(L)(x , x ′) = (1− β2)Eα∼N (0,Σ(L−1))[σ̇(α(x))σ̇(α(x ′))].

arthur

arthur

arthur

arthur

arthur

arthur

ConvergenceThe continuous-time dynamics of the outputs Yθ(t) = fθ(t)(X ) are described by thelinear ODE

∂tYθ(t),k =2N

Θ(L)∞ (X ,X )

(Y − Yθ(t)

)with solutionYθ(t) = Y − e

2tN Θ

(L)∞ (X ,X)

(Y − Yθ(0)

).

Convergence and loss surface(Jacot et al.,2019a):

1. Eigenvalues: speed of convergencealong the eigenvector

2. Narrow valley structure:2.1 Large eig. are the ’cliffs’2.2 Small eig. are the ’bottom’

3. Condition number κ = λmax/λmin

2 1 0 1 2

2

1

0

1

2

0.100

1.000

1.000

10.000

10.000

Figure: Narrow valley in the loss surface.

arthur

arthur

arthur

Freeze and ChaosTwo regimes appear (as in Poole et al. (2016); Schoenholz et al. (2017))

Theorem (Jacot et al., 2019b)Consider a twice differentiable σ satisfying Ez∼N (0,1) [σ(z)] = 1 and two inputs x , ywith ‖x‖ = ‖y‖ =

√n0. The characteristic value

rσ,β = (1− β2)Ez∼N (0,1)

[σ̇(z)2

]determines two regimes:FREEZE (r<1): For x , y ∈ Sn0

1 ≥ Θ(L)∞ (x , y)

Θ(L)∞ (x , x)

≥ 1− CLrL

CHAOS (r>1): If x 6= ±y:Θ

(L)∞ (x , y)

Θ(L)∞ (x , x)

→ 0

arthur

arthur

arthur

Freeze and Chaos: Properties

I FREEZE (r<1):I Almost constant NTK Gram MatrixI Bad conditioning λmax/λmin

I Slow convergenceI CHAOS (r>1):

I Almost identity NTK Gram MatrixI Fast convergenceI Generalization?

I EDGE: For the RELU: r = 1− β2 ≈ 1I Converges to Id + cI Strong constant mode for large NI Slow convergence

3 2 1 0 1 2 3angle

0.0

0.2

0.4

0.6

0.8

1.0

(6) (x

,y)

r = 0.75r = 0.99r = 1.2Batch Norm

Figure: The NTK in the different regimes forL = 6.

arthur

arthur

arthur

arthur

arthur

arthur

Chaos: normalization

I Normalize the non-linearity (over z ∼ N (0,1)):

σ̄(x) =σ(x)− E [σ(z)]√

Var [σ(z)].

I If σ 6= id we have E[

˙̄σ(z)2] > 1, and for small enough β > 0

r = (1− β2)E[

˙̄σ(z)2]> 1.

I Asymptotically equivalent to Layer Normalization.

I For Batch Normalization we only have:I Applying BN after the last non-linearity controls the constant mode:

1N2

∑ij

Θ(L)(xi , xj) = β2

arthur

Chaos: normalization

I Normalize the non-linearity (over z ∼ N (0,1)):

σ̄(x) =σ(x)− E [σ(z)]√

Var [σ(z)].

I If σ 6= id we have E[

˙̄σ(z)2] > 1, and for small enough β > 0

r = (1− β2)E[

˙̄σ(z)2]> 1.

I Asymptotically equivalent to Layer Normalization.I For Batch Normalization we only have:

I Applying BN after the last non-linearity controls the constant mode:1

N2

∑ij

Θ(L)(xi , xj) = β2

arthur

Generative Adversarial Networks

I Two networks(Goodfellow et al., 2014)I Generator G: generates data G(z) from a random code z.I Discriminator D: classifies real/generated data to guide the generator.

I If the generator lies in the FREEZEI Mode collapse: The generator G becomes constant.

I Solutions:I Batch NormalizationI Chaotic generator

arthur

Deconvolutional Generator

For G a deconvolutional network with stride s

TheoremIn the FREEZE (r>1):

1− r v+1

1− rL − C1(v + 1)r v ≤Θ

(L)p,p′ (x , y)

Θ(L)p,p (x , x)

≤ 1− r v+1

1− rL

for v the max. in {0,L− 1} s.t. sv divides p − p′

I Checkerboard patterns: images which repeat every sv pixelsI Solution: Batch Norm / Chaos

arthur

NTK PCA

FRE

EZE

1 2 3 4 5 6 7 8

1

2

3

4

CH

AO

S

1 2 3 4 5 6 7 8

1

2

3

4

BAT

CH

NO

RM

1 2 3 4 5 6 7 8

1

2

3

4

Conclusion

1. Certain architectures lead to dominating eigenvalues:1.1 Constant mode1.2 Checkerboard patterns

2. Slows down training3. Mode collapse in GANS4. Use the NTK to identify and fix them

arthur

arthur

Bibliography I

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. (2019). Onexact computation with an infinitely wide neural net. arXiv preprintarXiv:1904.11955.

Cho, Y. and Saul, L. K. (2009). Kernel Methods for Deep Learning. In Advances inNeural Information Processing Systems 22, pages 342–350. CurranAssociates, Inc.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., and Bengio, Y. (2014). Generative Adversarial Networks.NIPS’14 Proceedings of the 27th International Conference on NeuralInformation Processing Systems - Volume 2, pages 2672–2680.

Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel:Convergence and Generalization in Neural Networks. In Advances in NeuralInformation Processing Systems 31, pages 8580–8589. Curran Associates,Inc.

Bibliography II

Jacot, A., Gabriel, F., and Hongler, C. (2019a). The asymptotic spectrum of thehessian of dnn throughout training.

Jacot, A., Gabriel, F., and Hongler, C. (2019b). Freeze and chaos for dnns: anNTK view of batch normalization, checkerboard and boundary effects.CoRR, abs/1907.05715.

Lee, J. H., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., andSohl-Dickstein, J. (2018). Deep Neural Networks as Gaussian Processes.ICLR.

Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016).Exponential expressivity in deep neural networks through transient chaos. InLee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors,Advances in Neural Information Processing Systems 29, pages 3360–3368.Curran Associates, Inc.

Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deepinformation propagation.

Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and...

Documents

Transcript of Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and...