Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and...
Transcript of Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and...
Neural Tangent KernelConvergence and Generalization of DNNs
Arthur Jacot, Franck Gabriel, Clément Hongler
Ecole Polytechnique Fédérale de Lausanne
October 28, 2019
Neural Networks
I L + 1 layers of n` neurons with activations α(`)(x) ∈ Rn`
α(0)(x) = x
α̃(`+1)(x) =
√1− β2√
n`W (`)α(`)(x) + βb(`)
α(`+1)(x) = σ(α̃(`+1)(x)
)I Parameters: connections weights W (`) ∈ Rn`×n`+1 and bias b(`) ∈ Rn`+1 .I Weights / bias balance: β ∈ [0,1].I Non-linearity: σ : R→ R.
I Network function fθ(x) = α̃(L)(x)
Initialization: DNNs as Gaussian processes
I In the infinite width limit n1, ...,nL−1 →∞I Initialize the parameters θ ∼ N (0, IdP).
I The preactivations α̃(`)i (·; θ) : Rn0 → R converge to iid Gaussian processes of
covariance Σ(`) (Cho and Saul, 2009; Lee et al., 2018):
Σ(1)(x , y) = (1− β2)xT y + β2
Σ(`+1)(x , y) = (1− β2)Eα∼N(0,Σ(`)) [σ(α(x))σ(α(y))] + β2
I The network function fθ = α̃(L) is also asymptotically Gaussian.
Training: Neural Tangent Kernel
I Training set of size N: inputs X ∈ RN×n0 and outputs Y ∈ RN×nL .I Gradient descent on the MSE C(θ) = 1
N ‖fθ(X )− Y‖2
∂tθ = −∇C(θ) =2N
N∑i=1
∇fθ(xi)(Yi − fθ(xi))
I Evolution of fθ:
∂t fθ(x) = (∇fθ(x))T ∂tθ =2N
N∑i=1
(∇fθ(x))T ∇fθ(xi)(yi − fθ(xi))
I Neural Tangent Kernel (NTK):
Θ(L)(x , y) := (∇fθ(x))T ∇fθ(y)
Asymptotics of the NTK
Theorem (Arora et al., 2019; Jacot et al., 2018)Let n1, . . . ,nL−1 →∞, for any t:
Θ(L)(t)→ Θ(L)∞
where
Θ(L)∞ (x , y) =
L∑`=1
Σ(`)(x , y)Σ̇(`+1)(x , y)...Σ̇(L)(x , y)
withΣ̇(L)(x , x ′) = (1− β2)Eα∼N (0,Σ(L−1))[σ̇(α(x))σ̇(α(x ′))].
ConvergenceThe continuous-time dynamics of the outputs Yθ(t) = fθ(t)(X ) are described by thelinear ODE
∂tYθ(t),k =2N
Θ(L)∞ (X ,X )
(Y − Yθ(t)
)with solutionYθ(t) = Y − e
2tN Θ
(L)∞ (X ,X)
(Y − Yθ(0)
).
Convergence and loss surface(Jacot et al.,2019a):
1. Eigenvalues: speed of convergencealong the eigenvector
2. Narrow valley structure:2.1 Large eig. are the ’cliffs’2.2 Small eig. are the ’bottom’
3. Condition number κ = λmax/λmin
2 1 0 1 2
2
1
0
1
2
0.100
1.000
1.000
10.000
10.000
Figure: Narrow valley in the loss surface.
Freeze and ChaosTwo regimes appear (as in Poole et al. (2016); Schoenholz et al. (2017))
Theorem (Jacot et al., 2019b)Consider a twice differentiable σ satisfying Ez∼N (0,1) [σ(z)] = 1 and two inputs x , ywith ‖x‖ = ‖y‖ =
√n0. The characteristic value
rσ,β = (1− β2)Ez∼N (0,1)
[σ̇(z)2
]determines two regimes:FREEZE (r<1): For x , y ∈ Sn0
1 ≥ Θ(L)∞ (x , y)
Θ(L)∞ (x , x)
≥ 1− CLrL
CHAOS (r>1): If x 6= ±y:Θ
(L)∞ (x , y)
Θ(L)∞ (x , x)
→ 0
Freeze and Chaos: Properties
I FREEZE (r<1):I Almost constant NTK Gram MatrixI Bad conditioning λmax/λmin
I Slow convergenceI CHAOS (r>1):
I Almost identity NTK Gram MatrixI Fast convergenceI Generalization?
I EDGE: For the RELU: r = 1− β2 ≈ 1I Converges to Id + cI Strong constant mode for large NI Slow convergence
3 2 1 0 1 2 3angle
0.0
0.2
0.4
0.6
0.8
1.0
(6) (x
,y)
r = 0.75r = 0.99r = 1.2Batch Norm
Figure: The NTK in the different regimes forL = 6.
Chaos: normalization
I Normalize the non-linearity (over z ∼ N (0,1)):
σ̄(x) =σ(x)− E [σ(z)]√
Var [σ(z)].
I If σ 6= id we have E[
˙̄σ(z)2] > 1, and for small enough β > 0
r = (1− β2)E[
˙̄σ(z)2]> 1.
I Asymptotically equivalent to Layer Normalization.
I For Batch Normalization we only have:I Applying BN after the last non-linearity controls the constant mode:
1N2
∑ij
Θ(L)(xi , xj) = β2
Chaos: normalization
I Normalize the non-linearity (over z ∼ N (0,1)):
σ̄(x) =σ(x)− E [σ(z)]√
Var [σ(z)].
I If σ 6= id we have E[
˙̄σ(z)2] > 1, and for small enough β > 0
r = (1− β2)E[
˙̄σ(z)2]> 1.
I Asymptotically equivalent to Layer Normalization.I For Batch Normalization we only have:
I Applying BN after the last non-linearity controls the constant mode:1
N2
∑ij
Θ(L)(xi , xj) = β2
Generative Adversarial Networks
I Two networks(Goodfellow et al., 2014)I Generator G: generates data G(z) from a random code z.I Discriminator D: classifies real/generated data to guide the generator.
I If the generator lies in the FREEZEI Mode collapse: The generator G becomes constant.
I Solutions:I Batch NormalizationI Chaotic generator
Deconvolutional Generator
For G a deconvolutional network with stride s
TheoremIn the FREEZE (r>1):
1− r v+1
1− rL − C1(v + 1)r v ≤Θ
(L)p,p′ (x , y)
Θ(L)p,p (x , x)
≤ 1− r v+1
1− rL
for v the max. in {0,L− 1} s.t. sv divides p − p′
I Checkerboard patterns: images which repeat every sv pixelsI Solution: Batch Norm / Chaos
NTK PCA
FRE
EZE
1 2 3 4 5 6 7 8
1
2
3
4
CH
AO
S
1 2 3 4 5 6 7 8
1
2
3
4
BAT
CH
NO
RM
1 2 3 4 5 6 7 8
1
2
3
4
Conclusion
1. Certain architectures lead to dominating eigenvalues:1.1 Constant mode1.2 Checkerboard patterns
2. Slows down training3. Mode collapse in GANS4. Use the NTK to identify and fix them
Bibliography I
Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. (2019). Onexact computation with an infinitely wide neural net. arXiv preprintarXiv:1904.11955.
Cho, Y. and Saul, L. K. (2009). Kernel Methods for Deep Learning. In Advances inNeural Information Processing Systems 22, pages 342–350. CurranAssociates, Inc.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., and Bengio, Y. (2014). Generative Adversarial Networks.NIPS’14 Proceedings of the 27th International Conference on NeuralInformation Processing Systems - Volume 2, pages 2672–2680.
Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel:Convergence and Generalization in Neural Networks. In Advances in NeuralInformation Processing Systems 31, pages 8580–8589. Curran Associates,Inc.
Bibliography II
Jacot, A., Gabriel, F., and Hongler, C. (2019a). The asymptotic spectrum of thehessian of dnn throughout training.
Jacot, A., Gabriel, F., and Hongler, C. (2019b). Freeze and chaos for dnns: anNTK view of batch normalization, checkerboard and boundary effects.CoRR, abs/1907.05715.
Lee, J. H., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., andSohl-Dickstein, J. (2018). Deep Neural Networks as Gaussian Processes.ICLR.
Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016).Exponential expressivity in deep neural networks through transient chaos. InLee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors,Advances in Neural Information Processing Systems 29, pages 3360–3368.Curran Associates, Inc.
Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deepinformation propagation.