Mean-ﬁeldtheoryoftwo-layersneuralnetworks: dimension...

Mean-field theory of two-layers neural networks:dimension-free bounds and kernel limit

Song Mei, Theodor Misiakiewicz, and Andrea Montanari

Stanford University

June 26, 2019

COLT 2019

Song Mei (Stanford University) Mean Field Dynamics for Neural Network June 26, 2019 1 / 12

Gradient dynamics of two-layers neural networkI Two layers neural network:

Θ =(θ1; : : : ;θN ); θi = (ai;wi) 2 RD:

y(x;Θ) =1

ai�(hwi;xi):

I Risk function:

RN (Θ) = Ex;yh�y �

ai�(hwi;xi))�2i

I SGD/gradient flow:

Θk+1 =Θk � �kr`N (Θk;xk; yk);

dtΘt =�rRN (Θ

Θ =(θ1; : : : ;θN ); θi = (ai;wi) 2 RD:

y(x;Θ) =1

ai�(hwi;xi):

I Risk function:

ai�(hwi;xi))�2i

dtΘt =�rRN (Θ

Θ =(θ1; : : : ;θN ); θi = (ai;wi) 2 RD:

y(x;Θ) =1

ai�(hwi;xi):

I Risk function:

ai�(hwi;xi))�2i

dtΘt =�rRN (Θ

Two-layers neural networks

Hidden layer Output layerInput layer

Figure: Architecture for N = 4. θi = (ai;wi)

Related literatures

I Mean field distributional dynamics:

@t�t(θ) = r � (r(θ; �t)�t):

I Non-linear dynamics. Converges in some cases.I [Mei, Montanari, Nguyen, 2018], [Rotskoff and Vanden-Eijnden, 2018],

[Chizat and Bach, 2018a], [Sirignano and Spiliopoulos, 2018].

I Neural tangent kernel (NTK) dynamics:

@tkutk22 = �hut;Huti:

I Linear dynamics. Always converges to 0 empirical risk.I [Jacot, Gabriel, and Clement, 2018], [Li and Liang, 2018], [Du, Zhai,

Poczos, Singh, 2018].

Related literatures

@t�t(θ) = r � (r(θ; �t)�t):

Related literatures

@t�t(θ) = r � (r(θ; �t)�t):

Related literatures

@t�t(θ) = r � (r(θ; �t)�t):

Related literatures

@t�t(θ) = r � (r(θ; �t)�t):

Related literatures

@t�t(θ) = r � (r(θ; �t)�t):

This work

(a) Improved bound for SGD - PDE interpolation.

(b) Relationship of the mean field limit and the kernel limit.

SGD and distributional dynamics (DD)

I SGD for Θk, with (xk; yk) � Px;y, i 2 [N ],

θk+1i = θki � 2skNrθi`N (Θk;xk; yk): (SGD)

I [MMN18]: sk = "�(k"), k = t=", N !1, "! 0:

�(N)k �

�θki) �t 2 P(RD)� [0;1):

I Distributional dynamics (DD) for �t,

@t�t(θ) =2�(t)rθ � (�t(θ)rθ(θ; �t)); (DD)

where(θ; �) =

�R(�)

��(θ)= V (θ) +

ZU(θ;θ0)�(dθ0):

I [MMN18]: sk = "�(k"), k = t=", N !1, "! 0:

�(N)k �

�θki) �t 2 P(RD)� [0;1):

@t�t(θ) =2�(t)rθ � (�t(θ)rθ(θ; �t)); (DD)

where(θ; �) =

�R(�)

��(θ)= V (θ) +

I [MMN18]: sk = "�(k"), k = t=", N !1, "! 0:

�(N)k �

�θki) �t 2 P(RD)� [0;1):

@t�t(θ) =2�(t)rθ � (�t(θ)rθ(θ; �t)); (DD)

where(θ; �) =

�R(�)

��(θ)= V (θ) +

An improved boundAssumption

(i) � bounded; (ii) rw�(hx;wi) sub-Gaussian; (iii) r bdd. Lipschitz.

Theorem (M., Misiakiwicz, Montanari, 2019)

Let (θ0i )i�N �iid �0. Then, 8f bounded Lipschitz, w.h.p,

supt�T

�� 1N

f(θbt="ci )�

Zf(θ)�t(θ)

�� Func(T ) �

An example: learning a spherically symmetric Lipschitz function usingN = Od(1) neurons and n = Od(d) samples.

Caveat: this improved bound is not strong. In other cases the factorFunc(T ) could potentially be huge.

supt�T

�� 1N

f(θbt="ci )�

Zf(θ)�t(θ)

�� Func(T ) �

supt�T

�� 1N

f(θbt="ci )�

Zf(θ)�t(θ)

�� Func(T ) �

supt�T

�� 1N

f(θbt="ci )�

Zf(θ)�t(θ)

�� Func(T ) �

This work

(a) Improved bound for SGD - PDE interpolation.

(b) Relationship of the mean field limit and the kernel limit.

Recovering the kernel limit

Same idea appeared in [Chizat and Bach, 2018b], where the kernellimit was called “lazy training".

Setup:

Prediction function: f�;N (x;θ) =�

�?(x;θj);

Risk function: R�;N (θ) =Exh�f(x)� f�;N (x;θ)

�2i;

Gradient flow:dθtjdt

2�2rθjR�;N (θ

The coupled dynamics

Denote ��;Nt = (1=N)PN

j=1 �θtj. Distributional dynamics:

@t��;Nt = (1=�)rθ � (�

�;Nt rθ(θ; �

�;Nt )):

Denote u�;Nt (z) = f(z)� f�;N (z;θt). Residual dynamics:

@tku�;Nt k2L2 = �hu

�;Nt ;H

��;Nt

u�;Nt i:

HereH�(x; z) �

Zhrθ�?(x;θ);rθ�?(z;θ)i�(dθ);

�(θ; ��;N ) =� Ex[u�;Nt (x)�?(x;θ)]:

@t��;Nt = (1=�)rθ � (�

�;Nt rθ(θ; �

�;Nt )):

�;Nt ;H

��;Nt

u�;Nt i:

HereH�(x; z) �

�(θ; ��;N ) =� Ex[u�;Nt (x)�?(x;θ)]:

@t��;Nt = (1=�)rθ � (�

�;Nt rθ(θ; �

�;Nt )):

�;Nt ;H

��;Nt

u�;Nt i:

HereH�(x; z) �

�(θ; ��;N ) =� Ex[u�;Nt (x)�?(x;θ)]:

The mean field limit and kernel limit

@t��;Nt =(1=�)rθ � (�

�;Nt [rθ(θ; �

�;Nt )]);

@tku�;Nt k2L2 =� hu

�;Nt ;H

��;Nt

u�;Nt i:

I The mean field limit: fix � = O(1) and let N !1.I The kernel limit: let �!1 after N !1.I The benefit of kernel limit: the kernel will not change, and the

residual dynamics becomes self contained. The empirical risk willconverge to 0. Full derivation see appendix H of [Mei,Misiakiewics, Montanari, 2019].

@t��;Nt =(1=�)rθ � (�

�;Nt [rθ(θ; �

�;Nt )]);

�;Nt ;H

��;Nt

u�;Nt i:

@t��;Nt =(1=�)rθ � (�

�;Nt [rθ(θ; �

�;Nt )]);

�;Nt ;H

��;Nt

u�;Nt i:

Summary

I Gave an interpretation of neural tangent kernel in the mean fieldpoint of view. (Also in [Chizat and Bach, 2018b])

I Whether NTK can explain the success of neural network is still anopen problem. [Arora, Du, Hu, Li, Salakhutdinov, Wang, 2019],[Ghorbani, Mei, Misiakiwics, and Montanari, 2019a, 2019b],[Allen-Zhu and Li, 2019].

Summary

I Gave an interpretation of neural tangent kernel in the mean fieldpoint of view. (Also in [Chizat and Bach, 2018b])

I Whether NTK can explain the success of neural network is still anopen problem. [Arora, Du, Hu, Li, Salakhutdinov, Wang, 2019],[Ghorbani, Mei, Misiakiwics, and Montanari, 2019a, 2019b],[Allen-Zhu and Li, 2019].

Mean-ﬁeldtheoryoftwo-layersneuralnetworks: dimension...

Documents

Transcript of Mean-ﬁeldtheoryoftwo-layersneuralnetworks: dimension...

Extremal Functions for a Mean Field Equation in Two …chang/ccl-20017.pdfExtremal Functions for a Mean Field Equation in Two Dimension Sun-Yung A. Chang, Chiun-Chuan Chen and Chang-Shou

Mean Drift Forces Far-field Approach

DICHOTOMY FOR THE HAUSDORFF DIMENSION ... - math…

μ-Dimension 2015

Cap 02 Movimiento en Una Dimension

The Arithmetical Complexity of Dimension and Randomness

Presentation fm

VC Dimension, VC Density, and an Application to ...

The Metric Dimension Problem.

Golden Mean And The Pentagon

Dimension of Fat Sierpinski Gasketsmatmj/overlapgasket.pdf · Manchester, M13 9PL, UK (e-mail: tjordan@maths.man.ac.uk) Dimension of Fat Sierpinski Gaskets Abstract In this paper

Extended Mean Field Study of Complex 4-Theory at Finite ... · Motivation ’4-TheoryMean Field TheoryExtended Mean Field TheoryResultsOutlook Extended Mean Field Study of Complex

An introduction to the representation dimension of artin ...ringel/opus/arbeit/china.pdf · the representation dimension of artin algebras. ... M-dimension M-dimX is the minimal value

In matrix notation the factor model takes the form: p ...An observable random vector X of dimension p has mean vector µand covariance matrix Σ The (orthogonal) factor model postulates

Three Body Mean Motion Resonances

Density imbalanced mass asymmetric mixtures in one dimension

BEC in one dimension - uni-stuttgart.de

HETEROGENEOUS UBIQUITOUS SYSTEMS IN d DIMENSION · 2010. 10. 8. · HETEROGENEOUS UBIQUITOUS SYSTEMS IN RdAND HAUSDORFF DIMENSION JULIEN BARRAL AND STEPHANE SEURET´ ABSTRACT.Let

Presentation skills- presentation - simetehontes€¦ · Presentation skills- presentation - simetehontes Created Date: 20171001184553Z ...

Presentation - Setting specifications, statistical ... · • Ppk should be used as this is the actual risk of failing specification Random data of mean 10 and SD 1, thus natural