A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. ·...

57
A Mean Field View of the Landscape of Two-Layers Neural Networks Andrea Montanari [with Song Mei, Phan-Minh Nguyen] Stanford University May 14, 2018 Andrea Montanari (Stanford) Two layers May 14, 2018 1 / 40

Transcript of A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. ·...

Page 1: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

A Mean Field View of the Landscapeof Two-Layers Neural Networks

Andrea Montanari[with Song Mei, Phan-Minh Nguyen]

Stanford University

May 14, 2018

Andrea Montanari (Stanford) Two layers May 14, 2018 1 / 40

Page 2: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Non-convex high-dimensional statistics

Data n(y1; x 1); (y2; x 2); � � � ; (ynxn)

o�iid P 2 P(R� Rd)

Empirical risk minimization

minimize bRn(θ) �1n

nXi=1

`(θ; x i ; yi )

Andrea Montanari (Stanford) Two layers May 14, 2018 2 / 40

Page 3: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Example: ‘One-neuron neural network’

4 2 0 2 4⟨θ,x⟩0.0

0.5

1.0

1.5

2.0

`(θ;y=

1,x

)

z i = (yi ; x i ), yi 2 f0; 1g, x i 2 Rd , P(yi = 1jx i ) = �(hθ0; x i i)

bRn(θ) =1n

nXi=1

�yi � �(hθ; x i i)

�2;

�(u) =1

1+ e�u:

Andrea Montanari (Stanford) Two layers May 14, 2018 3 / 40

Page 4: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Population vs. empirical risk

w0

w1

-3 -2 -1 0 1 2 3

w2

-3

-2

-1

0

1

2

3

w0 wn

w1

-3 -2 -1 0 1 2 3

w2

-3

-2

-1

0

1

2

3

I Population risk � Empirical risk.I Population risk has unique global minimum or. . .I . . . can construct a good initialization.

[Mei, Bai, Montanari 2017]

Andrea Montanari (Stanford) Two layers May 14, 2018 4 / 40

Page 5: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

This talk

I More complicate models (two-layers NNs)

I Higher level description

Andrea Montanari (Stanford) Two layers May 14, 2018 5 / 40

Page 6: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Outline

1 Two-layers neural networks

2 Concrete examples

3 Noisy stochastic gradient descent

4 Conclusion

[Mei, M., Nguyen, arXiv:1804.06561]

Andrea Montanari (Stanford) Two layers May 14, 2018 6 / 40

Page 7: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Two layers neural networks

Andrea Montanari (Stanford) Two layers May 14, 2018 7 / 40

Page 8: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

N -hidden units (neurons)

θ = (θ1; : : : ;θN ) 2 RD�N

by(x ;θ) = 1N

NXi=1

��(x ;θi ) ; �� : Rd � RD ! R

Classical example θi = (w i ; ai ; bi ) 2 Rd+2

��(x ;θi ) = ai �(hw i ; x i+ bi ) :

Andrea Montanari (Stanford) Two layers May 14, 2018 8 / 40

Page 9: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Empirical risk

bRn ;N (θ) =1n

nXi=1

`

0@yi ; 1N

NXj=1

��(x i ;θj )

1A

Andrea Montanari (Stanford) Two layers May 14, 2018 9 / 40

Page 10: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Empirical risk

I For simplicity `(y ; by) = (y � by)2

bRn ;N (θ) =1n

nXi=1

0@yi � 1N

NXj=1

��(x i ;θj )

1A2

Landscape analysis?

Partial success. . .[Arora, Bhaskara, Ge, Ma, 2014; Janzamin, Sedghi, Anandkumar, 2015; Ge,Lee, Ma, 2017; Soltanolkotabi, Javanmard, Lee, 2017; Zhang, Lee, Jordan,2017; Zhong, Song, Jain, Bartlett, Dhillon, 2017; . . . ]

Andrea Montanari (Stanford) Two layers May 14, 2018 10 / 40

Page 11: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Empirical risk

I For simplicity `(y ; by) = (y � by)2

bRn ;N (θ) =1n

nXi=1

0@yi � 1N

NXj=1

��(x i ;θj )

1A2

Landscape analysis?

Partial success. . .[Arora, Bhaskara, Ge, Ma, 2014; Janzamin, Sedghi, Anandkumar, 2015; Ge,Lee, Ma, 2017; Soltanolkotabi, Javanmard, Lee, 2017; Zhang, Lee, Jordan,2017; Zhong, Song, Jain, Bartlett, Dhillon, 2017; . . . ]

Andrea Montanari (Stanford) Two layers May 14, 2018 10 / 40

Page 12: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Empirical risk

I For simplicity `(y ; by) = (y � by)2

bRn ;N (θ) =1n

nXi=1

0@yi � 1N

NXj=1

��(x i ;θj )

1A2

Landscape analysis?

Partial success. . .[Arora, Bhaskara, Ge, Ma, 2014; Janzamin, Sedghi, Anandkumar, 2015; Ge,Lee, Ma, 2017; Soltanolkotabi, Javanmard, Lee, 2017; Zhang, Lee, Jordan,2017; Zhong, Song, Jain, Bartlett, Dhillon, 2017; . . . ]

Andrea Montanari (Stanford) Two layers May 14, 2018 10 / 40

Page 13: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Seeking a simpler structure

1. Stochastic gradient descent (SGD)

θk+1i = θki � 2sk

�yk � by(x k ;θ

k )�rθi��(x k ;θ

ki ) :

2. One-Pass: each example visited once

f(yk ; x k )gk�1 �iid P

Not a bad assumption!More on this later!

Andrea Montanari (Stanford) Two layers May 14, 2018 11 / 40

Page 14: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Seeking a simpler structure

1. Stochastic gradient descent (SGD)

θk+1i = θki � 2sk

�yk � by(x k ;θ

k )�rθi��(x k ;θ

ki ) :

2. One-Pass: each example visited once

f(yk ; x k )gk�1 �iid P

Not a bad assumption!More on this later!

Andrea Montanari (Stanford) Two layers May 14, 2018 11 / 40

Page 15: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Questions

I Number of iterations k to achieve risk RN (θk ) � Rtarget?

I Scaling with N , D?

I Can achieve Rtarget = infθ RN (θ) + "?

Andrea Montanari (Stanford) Two layers May 14, 2018 12 / 40

Page 16: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

SGD minimizes population risk

RN (θ) = En�

y �1N

NXj=1

��(x ;θj )�2o

= R# +2N

NXi=1

V (θi ) +1N 2

Xi ;j=1N

U (θi ;θj ) ;

V (θ) � �E�y ��(x ;θ)

;

U (θ1;θ2) � E���(x ;θ1)��(x ;θ2)

:

I Gas of N particles with in D dimensionsI Exchangeable!I U ( � ; � ) � 0

Andrea Montanari (Stanford) Two layers May 14, 2018 13 / 40

Page 17: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

SGD minimizes population risk

RN (θ) = En�

y �1N

NXj=1

��(x ;θj )�2o

= R# +2N

NXi=1

V (θi ) +1N 2

Xi ;j=1N

U (θi ;θj ) ;

V (θ) � �E�y ��(x ;θ)

;

U (θ1;θ2) � E���(x ;θ1)��(x ;θ2)

:

I Gas of N particles with in D dimensionsI Exchangeable!I U ( � ; � ) � 0

Andrea Montanari (Stanford) Two layers May 14, 2018 13 / 40

Page 18: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

SGD minimizes population risk

RN (θ) = En�

y �1N

NXj=1

��(x ;θj )�2o

= R# +2N

NXi=1

V (θi ) +1N 2

Xi ;j=1N

U (θi ;θj ) ;

V (θ) � �E�y ��(x ;θ)

;

U (θ1;θ2) � E���(x ;θ1)��(x ;θ2)

:

I Gas of N particles with in D dimensionsI Exchangeable!I U ( � ; � ) � 0

Andrea Montanari (Stanford) Two layers May 14, 2018 13 / 40

Page 19: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

SGD minimizes population risk

RN (θ) = En�

y �1N

NXj=1

��(x ;θj )�2o

= R# +2N

NXi=1

V (θi ) +1N 2

Xi ;j=1N

U (θi ;θj ) ;

V (θ) � �E�y ��(x ;θ)

;

U (θ1;θ2) � E���(x ;θ1)��(x ;θ2)

:

I Gas of N particles with in D dimensionsI Exchangeable!I U ( � ; � ) � 0

Andrea Montanari (Stanford) Two layers May 14, 2018 13 / 40

Page 20: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Exchangeability )RN (θ) depends on θ1; : : : ;θN only through �(N ) =

PNi=1 �θi=N :

R : P(RD)! R

R(�) � R# + 2ZV (θ) �(dθ) +

ZU (θ1;θ2) �(dθ1) �(dθ2)

LemmaIfRU (θ;θ)�opt(dθ) <1, then

inf�R(�) � inf

θRN (θ) � inf

�R(�) +

CN

:

[cf convnets, Bengio et al. 2006]

Andrea Montanari (Stanford) Two layers May 14, 2018 14 / 40

Page 21: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Exchangeability )RN (θ) depends on θ1; : : : ;θN only through �(N ) =

PNi=1 �θi=N :

R : P(RD)! R

R(�) � R# + 2ZV (θ) �(dθ) +

ZU (θ1;θ2) �(dθ1) �(dθ2)

LemmaIfRU (θ;θ)�opt(dθ) <1, then

inf�R(�) � inf

θRN (θ) � inf

�R(�) +

CN

:

[cf convnets, Bengio et al. 2006]

Andrea Montanari (Stanford) Two layers May 14, 2018 14 / 40

Page 22: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Exchangeability )RN (θ) depends on θ1; : : : ;θN only through �(N ) =

PNi=1 �θi=N :

R : P(RD)! R

R(�) � R# + 2ZV (θ) �(dθ) +

ZU (θ1;θ2) �(dθ1) �(dθ2)

LemmaIfRU (θ;θ)�opt(dθ) <1, then

inf�R(�) � inf

θRN (θ) � inf

�R(�) +

CN

:

[cf convnets, Bengio et al. 2006]

Andrea Montanari (Stanford) Two layers May 14, 2018 14 / 40

Page 23: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Puzzle

R(�) � R# + 2ZV (θ) �(dθ) +

ZU (θ1;θ2) �(dθ1) �(dθ2)

I Hard nonconvex problem RN (θ) ) Convex R(�)

I Did we trivialize the problem?

SolutionI Not all ‘small changes’ in � can be realized by SGD dynamicsI Mass must be conserved locallyI Is there a scaling limit for SGD dynamics?

Andrea Montanari (Stanford) Two layers May 14, 2018 15 / 40

Page 24: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Puzzle

R(�) � R# + 2ZV (θ) �(dθ) +

ZU (θ1;θ2) �(dθ1) �(dθ2)

I Hard nonconvex problem RN (θ) ) Convex R(�)

I Did we trivialize the problem?

SolutionI Not all ‘small changes’ in � can be realized by SGD dynamicsI Mass must be conserved locallyI Is there a scaling limit for SGD dynamics?

Andrea Montanari (Stanford) Two layers May 14, 2018 15 / 40

Page 25: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Puzzle

R(�) � R# + 2ZV (θ) �(dθ) +

ZU (θ1;θ2) �(dθ1) �(dθ2)

I Hard nonconvex problem RN (θ) ) Convex R(�)

I Did we trivialize the problem?

SolutionI Not all ‘small changes’ in � can be realized by SGD dynamicsI Mass must be conserved locallyI Is there a scaling limit for SGD dynamics?

Andrea Montanari (Stanford) Two layers May 14, 2018 15 / 40

Page 26: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Puzzle

R(�) � R# + 2ZV (θ) �(dθ) +

ZU (θ1;θ2) �(dθ1) �(dθ2)

I Hard nonconvex problem RN (θ) ) Convex R(�)

I Did we trivialize the problem?

SolutionI Not all ‘small changes’ in � can be realized by SGD dynamicsI Mass must be conserved locallyI Is there a scaling limit for SGD dynamics?

Andrea Montanari (Stanford) Two layers May 14, 2018 15 / 40

Page 27: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

‘Distributional dynamics’

@t�t = 2�(t)rθ ���trθ(θ; �t )

�;

(θ; �) ��R(�)

��(θ)= V (θ) +

ZU (θ;θ0) �(dθ0) :

Claim sk = "�(k"), k = t=", N !1, "! 0:

�(N )k �

1N

NXi=1

�θki) �t

Andrea Montanari (Stanford) Two layers May 14, 2018 16 / 40

Page 28: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

‘Distributional dynamics’

@t�t = 2�(t)rθ ���trθ(θ; �t )

�;

(θ; �) ��R(�)

��(θ)= V (θ) +

ZU (θ;θ0) �(dθ0) :

Claim sk = "�(k"), k = t=", N !1, "! 0:

�(N )k �

1N

NXi=1

�θki) �t

Andrea Montanari (Stanford) Two layers May 14, 2018 16 / 40

Page 29: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

More precisely

Theorem (Mei, Montanari, Nguyen, 2018)

Assume rV ;rU bounded Lipschitz, and rθ��(x ;θ) subgaussian.Let (θ0)i�N �iid �0. Then, for any bounded Lipschitz function f :

supt�T

����� 1NNXi=1

f (θbt="ci ; t)�Zf (θ; t)�t (dθ)

����� � KeKT errN ;D(z ) ;

errN ;D(z ) �

s1N_ " �

24sD _ logN"

+ z

35 ;

with probability at least 1� 4 e�z2=2.

Andrea Montanari (Stanford) Two layers May 14, 2018 17 / 40

Page 30: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Proof idea

I Propagation-of-chaos argument.

I Intuition Fk = �(fyi ; x igi<k ):

Efθk+1i jFkg = θki � 2"�(k")Ef

�yk � by(x k ;θ

k )�rθi��(x k ;θ

ki )jFkg

= θki � 2"�(k")rV (θki ) + 2"�(k")1N

NXj=1

r1U (θki ;θkj )

= θki � 2"�(k")r(θki ; �(N )k ) :

Andrea Montanari (Stanford) Two layers May 14, 2018 18 / 40

Page 31: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Proof idea

I Propagation-of-chaos argument.

I Intuition Fk = �(fyi ; x igi<k ):

Efθk+1i jFkg = θki � 2"�(k")Ef

�yk � by(x k ;θ

k )�rθi��(x k ;θ

ki )jFkg

= θki � 2"�(k")rV (θki ) + 2"�(k")1N

NXj=1

r1U (θki ;θkj )

= θki � 2"�(k")r(θki ; �(N )k ) :

Andrea Montanari (Stanford) Two layers May 14, 2018 18 / 40

Page 32: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Overparametrization

errN ;D(z ) �

s1N_ " �

24sD _ logN"

+ z

35 :

I Small if N � D logD , "� 1=D .

I Number of parameters ND can be much larger than sample sizek = t=" = O(D)!!

I Overparametrization does not slow down convergence!

Andrea Montanari (Stanford) Two layers May 14, 2018 19 / 40

Page 33: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Related work

Last 2 weeksI Rotskoff, Vanden-Eijnden arXiv:1805.00915I Sirignano, Spiliopoulos arXiv:1805.01053

Andrea Montanari (Stanford) Two layers May 14, 2018 20 / 40

Page 34: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Minima of R(�) � Fixed points

R(�) � R# + 2ZV (θ) �(dθ) +

ZU (θ1;θ2) �(dθ1) �(dθ2)

@t�t = 2�(t)rθ ���trθ(θ; �t )

�:

I Minima

supp(��) � arg minθ2RD

(θ; ��) :

I Fixed points

supp(��) ��θ 2 RD : r(θ; ��) = 0

:

Andrea Montanari (Stanford) Two layers May 14, 2018 21 / 40

Page 35: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

What is this?

@t�t = 2�(t)rθ ���trθ(θ; �t )

�:

I Gradient flow of R(�). . .I . . . in the metric space (P(RD);W2)

[Jordan, Kinderlehrer, Otto, 1998; Ambrosio, Gigli, Savaré, 2006; Carrillo,McCann, Villani, 2013;. . . ]

Andrea Montanari (Stanford) Two layers May 14, 2018 22 / 40

Page 36: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

What to do with this?

@t�t = 2�(t)rθ ���trθ(θ; �t )

�:

I Concrete examplesI A general result for noisy SGD

Andrea Montanari (Stanford) Two layers May 14, 2018 23 / 40

Page 37: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Concrete examples

Andrea Montanari (Stanford) Two layers May 14, 2018 24 / 40

Page 38: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Simplest example requiring more than one neuron

With probability 1=2: y = +1, x � N(0;Σ+).With probability 1=2: y = �1, x � N(0;Σ�).

Σ� =

� 2� Is0 00 Id�s0

!

Invariant under O(s0)�O(d � s0) ) Reduced PDE

Andrea Montanari (Stanford) Two layers May 14, 2018 25 / 40

Page 39: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Activation functions

Simple θi = w i 2 Rd (no offset, no weight)

��(x ;θi ) = �(hw i ; x i)

ReLU θi = (w i ; ai ; bi ) 2 Rd+2

��(x ;θi ) = ai max(hw i ; x i+ bi ; 0)

Andrea Montanari (Stanford) Two layers May 14, 2018 26 / 40

Page 40: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Distributional dynamics

I s0 = d = 40, N = 800, � 2+ = 1:8, � 2

� = 0:2

I Simple activation

I Histogram: empirical results. Cont. lines: PDE solution.

Andrea Montanari (Stanford) Two layers May 14, 2018 27 / 40

Page 41: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Evolution of the risk

100

101

102

103

104

105

106

107

Iteration

0

0.5

1

1.5

2

2.5

Ris

k

PDE (∆=0.2)

SGD (∆=0.2)

PDE (∆=0.4)

SGD (∆=0.4)

PDE (∆=0.6)

SGD (∆=0.6)

-0.50

2 0

0.5

1

b (mean)

1.5

r1(m

ean)

a (mean)

1.5

1 0.5

2

0.5

2.5

10

PDE (∆=0.2)

SGD (∆=0.2)

PDE (∆=0.4)

SGD (∆=0.4)

PDE (∆=0.6)

SGD (∆=0.6)

I N = 800, d = 320, s0 = 60, � 2� = 1��

I ReLU activation

Andrea Montanari (Stanford) Two layers May 14, 2018 28 / 40

Page 42: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Classifying anisotropic Gaussians: Analysis

Theorem (Mei, Montanari, Nguyen, 2018)

Assume: (i) � : R! R non-decreasing bounded Lipschitz;(ii) s0 = d, 2 (0; 1) fixed; (iii) �0 2 P(R�0) has bounded densityand R(�0) < 1.

For d ; s0 � d0(�; �0;�), N � C0d log d, C0(�; �0;�; �) consider SGDinitialized with (w0

i )i�N �iid �0 and step size " � 1=(C0d). Then,for any k 2 [T="; 10T="], whp

RN (θk ) � inf

θ2Rd�NRN (θ) + � :

I Learning from O(1=") = O(d) samples.I Independent of number of neurons N � d log d .

Andrea Montanari (Stanford) Two layers May 14, 2018 29 / 40

Page 43: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Classifying anisotropic Gaussians: Analysis

Theorem (Mei, Montanari, Nguyen, 2018)

Assume: (i) � : R! R non-decreasing bounded Lipschitz;(ii) s0 = d, 2 (0; 1) fixed; (iii) �0 2 P(R�0) has bounded densityand R(�0) < 1.

For d ; s0 � d0(�; �0;�), N � C0d log d, C0(�; �0;�; �) consider SGDinitialized with (w0

i )i�N �iid �0 and step size " � 1=(C0d). Then,for any k 2 [T="; 10T="], whp

RN (θk ) � inf

θ2Rd�NRN (θ) + � :

I Learning from O(1=") = O(d) samples.I Independent of number of neurons N � d log d .

Andrea Montanari (Stanford) Two layers May 14, 2018 29 / 40

Page 44: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Predicting failure

100

101

102

103

104

105

106

107

Iteration

0

1

2

3

4

5

6

7

8

9

Ris

k

100

101

102

103

104

105

106

107

0

0.5

1

1.5

2

2.5

r

PDE (κ=0.1)

SGD (κ=0.1)

PDE (κ=0.4)

SGD (κ=0.4)

I s0 = d , � 2+ = 1:5, � 2

� = 0:5I N = 800, d = 320I Non-monotone activationI Two different initializations

Andrea Montanari (Stanford) Two layers May 14, 2018 30 / 40

Page 45: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Predicting failure

I SGD does not necessarily converge to global min

I Can we fix it?

Andrea Montanari (Stanford) Two layers May 14, 2018 31 / 40

Page 46: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Noisy stochastic gradient descent

Andrea Montanari (Stanford) Two layers May 14, 2018 32 / 40

Page 47: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Regularized noisy SGD

(gki )i�N ;k�0 �iid N(0; I)

θk+1i = (1� 2�sk )θki � 2sk

�yk � byk �rθi��(x k ;θ

ki ) +

qsk=� gk

i :

Andrea Montanari (Stanford) Two layers May 14, 2018 33 / 40

Page 48: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Distributional dynamics (wlog �(t) = 1=2)

@t�t = rθ ���trθ�(θ; �t )

�+ ��1�θ�t ;

�(θ; �) � V (θ) + �kθk22 +

ZU (θ;θ0) �(dθ0) :

TheoremSame approximation theorem: noisy SGD $ PDE.

What does this minimize?

Andrea Montanari (Stanford) Two layers May 14, 2018 34 / 40

Page 49: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Distributional dynamics (wlog �(t) = 1=2)

@t�t = rθ ���trθ�(θ; �t )

�+ ��1�θ�t ;

�(θ; �) � V (θ) + �kθk22 +

ZU (θ;θ0) �(dθ0) :

TheoremSame approximation theorem: noisy SGD $ PDE.

What does this minimize?

Andrea Montanari (Stanford) Two layers May 14, 2018 34 / 40

Page 50: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Wasserstein gradient flow

F�;�(�) =12R(�) +

2

Zkθk22�(dθ)� ��1Ent(�) ;

Ent(�) � �Z

�(θ) log �(θ) dθ :

Andrea Montanari (Stanford) Two layers May 14, 2018 35 / 40

Page 51: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Wasserstein gradient flow

PropositionAssume rV ( � );rU ( � ; � ) 2 C 1

b (RD), �0 2 C 1b (RD). If �t is a

solution of DD, then F�;�(�t ) is non-increasing:

@tF�(�t ) = �

Z r�(θ; �t )�1�log �t (θ)

� 2

2�t (dθ) :

In particular, any fixed point �� with F�(��) <1 satisfies

��(θ) =1

Z (�; ��)exp

n� �(θ; ��)

o:

[Jordan, Kinderlehrer, Otto, 1998; Carrillo, McCann, Villani, 2013;. . . ]

Andrea Montanari (Stanford) Two layers May 14, 2018 36 / 40

Page 52: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Key remark

I

��(θ) =1

Z (�)exp

n� �(θ; ��)

ois the Euler-Lagrange equation for

F�;�(�) =12R(�) +

2

Zkθk22�(dθ)� ��1Ent(�) ;

I F�;�( � ) is strongly convex

I ) Unique fixed point!

Andrea Montanari (Stanford) Two layers May 14, 2018 37 / 40

Page 53: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

General convergence for noisy SGD

Theorem (Mei, Montanari Nguyen, 2018)

Initialization (θ0i )i�N �iid �0. Assumptions of previous theorems,

with N � C0D logD, " � 1=(C0D).Then there exists �0 = �0(D ;U ;V ; �), such thas, for � > �0 thereexists T = T (D ;U ;V ; �; �) such that for any k 2 [T="; 10T="] wehave, with probability at least 1� �,

R�;N (θk ) � inf

θ2RD�NR�;N (θ) + � :

I General!I Convergence time depends on D , but not on N !

Andrea Montanari (Stanford) Two layers May 14, 2018 38 / 40

Page 54: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

General convergence for noisy SGD

Theorem (Mei, Montanari Nguyen, 2018)

Initialization (θ0i )i�N �iid �0. Assumptions of previous theorems,

with N � C0D logD, " � 1=(C0D).Then there exists �0 = �0(D ;U ;V ; �), such thas, for � > �0 thereexists T = T (D ;U ;V ; �; �) such that for any k 2 [T="; 10T="] wehave, with probability at least 1� �,

R�;N (θk ) � inf

θ2RD�NR�;N (θ) + � :

I General!I Convergence time depends on D , but not on N !

Andrea Montanari (Stanford) Two layers May 14, 2018 38 / 40

Page 55: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Conclusion

Andrea Montanari (Stanford) Two layers May 14, 2018 39 / 40

Page 56: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Conclusion

Correspondence

I Learning in two-layers neural networks

I Gradient flows in measure spaces

Thanks!

Andrea Montanari (Stanford) Two layers May 14, 2018 40 / 40

Page 57: A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. · AMeanFieldViewoftheLandscape ofTwo-LayersNeuralNetworks AndreaMontanari [withSongMei,Phan-MinhNguyen]

Conclusion

Correspondence

I Learning in two-layers neural networks

I Gradient flows in measure spaces

Thanks!

Andrea Montanari (Stanford) Two layers May 14, 2018 40 / 40