A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. ·...

A Mean Field View of the Landscapeof Two-Layers Neural Networks

Andrea Montanari[with Song Mei, Phan-Minh Nguyen]

Stanford University

May 14, 2018

Andrea Montanari (Stanford) Two layers May 14, 2018 1 / 40

Non-convex high-dimensional statistics

Data n(y1; x 1); (y2; x 2); � � � ; (ynxn)

o�iid P 2 P(R� Rd)

Empirical risk minimization

minimize bRn(θ) �1n

nXi=1

`(θ; x i ; yi )


Example: ‘One-neuron neural network’

4 2 0 2 4⟨θ,x⟩0.0

0.5

1.0

1.5

2.0

`(θ;y=

1,x

)

z i = (yi ; x i ), yi 2 f0; 1g, x i 2 Rd , P(yi = 1jx i ) = �(hθ0; x i i)

bRn(θ) =1n

nXi=1

�yi � �(hθ; x i i)

�2;

�(u) =1

1+ e�u:


Population vs. empirical risk

w0

w1

-3 -2 -1 0 1 2 3

w2

-3

-2

-1

0

1

2

3

w0 wn

w1

-3 -2 -1 0 1 2 3

w2

-3

-2

-1

0

1

2

3

I Population risk � Empirical risk.I Population risk has unique global minimum or. . .I . . . can construct a good initialization.

[Mei, Bai, Montanari 2017]


This talk

I More complicate models (two-layers NNs)

I Higher level description


Outline

1 Two-layers neural networks

2 Concrete examples

3 Noisy stochastic gradient descent

4 Conclusion

[Mei, M., Nguyen, arXiv:1804.06561]


Two layers neural networks


N -hidden units (neurons)

θ = (θ1; : : : ;θN ) 2 RD�N

by(x ;θ) = 1N

NXi=1

��(x ;θi ) ; �� : Rd � RD ! R

Classical example θi = (w i ; ai ; bi ) 2 Rd+2

��(x ;θi ) = ai �(hw i ; x i+ bi ) :


Empirical risk

bRn ;N (θ) =1n

nXi=1

`

0@yi ; 1N

NXj=1

��(x i ;θj )

1A


Empirical risk

I For simplicity `(y ; by) = (y � by)2

bRn ;N (θ) =1n

nXi=1

0@yi � 1N

NXj=1

��(x i ;θj )

1A2

Landscape analysis?

Partial success. . .[Arora, Bhaskara, Ge, Ma, 2014; Janzamin, Sedghi, Anandkumar, 2015; Ge,Lee, Ma, 2017; Soltanolkotabi, Javanmard, Lee, 2017; Zhang, Lee, Jordan,2017; Zhong, Song, Jain, Bartlett, Dhillon, 2017; . . . ]


Seeking a simpler structure

1. Stochastic gradient descent (SGD)

θk+1i = θki � 2sk

�yk � by(x k ;θ

k )�rθi��(x k ;θ

ki ) :

2. One-Pass: each example visited once

f(yk ; x k )gk�1 �iid P

Not a bad assumption!More on this later!


Questions

I Number of iterations k to achieve risk RN (θk ) � Rtarget?

I Scaling with N , D?

I Can achieve Rtarget = infθ RN (θ) + "?


SGD minimizes population risk

RN (θ) = En�

y �1N

NXj=1

��(x ;θj )�2o

= R# +2N

NXi=1

V (θi ) +1N 2

Xi ;j=1N

U (θi ;θj ) ;

V (θ) � �E�y ��(x ;θ)

;

U (θ1;θ2) � E��(x ;θ1)��(x ;θ2)

:

I Gas of N particles with in D dimensionsI Exchangeable!I U ( � ; � ) � 0


Exchangeability )RN (θ) depends on θ1; : : : ;θN only through �(N ) =

PNi=1 �θi=N :

R : P(RD)! R

R(�) � R# + 2ZV (θ) �(dθ) +

ZU (θ1;θ2) �(dθ1) �(dθ2)

LemmaIfRU (θ;θ)�opt(dθ) <1, then

inf�R(�) � inf

θRN (θ) � inf

�R(�) +

CN

:

[cf convnets, Bengio et al. 2006]


Puzzle

R(�) � R# + 2ZV (θ) �(dθ) +

ZU (θ1;θ2) �(dθ1) �(dθ2)

I Hard nonconvex problem RN (θ) ) Convex R(�)

I Did we trivialize the problem?

SolutionI Not all ‘small changes’ in � can be realized by SGD dynamicsI Mass must be conserved locallyI Is there a scaling limit for SGD dynamics?


‘Distributional dynamics’

@t�t = 2�(t)rθ ��trθ(θ; �t )

�;

(θ; �) ��R(�)

��(θ)= V (θ) +

ZU (θ;θ0) �(dθ0) :

Claim sk = "�(k"), k = t=", N !1, "! 0:

�(N )k �

1N

NXi=1

�θki) �t


More precisely

Theorem (Mei, Montanari, Nguyen, 2018)

Assume rV ;rU bounded Lipschitz, and rθ��(x ;θ) subgaussian.Let (θ0)i�N �iid �0. Then, for any bounded Lipschitz function f :

supt�T

�� 1NNXi=1

f (θbt="ci ; t)�Zf (θ; t)�t (dθ)

�� KeKT errN ;D(z ) ;

errN ;D(z ) �

s1N_ " �

24sD _ logN"

+ z

35 ;

with probability at least 1� 4 e�z2=2.


Proof idea

I Propagation-of-chaos argument.

I Intuition Fk = �(fyi ; x igi<k ):

Efθk+1i jFkg = θki � 2"�(k")Ef

�yk � by(x k ;θ

k )�rθi��(x k ;θ

ki )jFkg

= θki � 2"�(k")rV (θki ) + 2"�(k")1N

NXj=1

r1U (θki ;θkj )

= θki � 2"�(k")r(θki ; �(N )k ) :


Overparametrization

errN ;D(z ) �

s1N_ " �

24sD _ logN"

+ z

35 :

I Small if N � D logD , "� 1=D .

I Number of parameters ND can be much larger than sample sizek = t=" = O(D)!!

I Overparametrization does not slow down convergence!


Related work

Last 2 weeksI Rotskoff, Vanden-Eijnden arXiv:1805.00915I Sirignano, Spiliopoulos arXiv:1805.01053


Minima of R(�) � Fixed points

R(�) � R# + 2ZV (θ) �(dθ) +

ZU (θ1;θ2) �(dθ1) �(dθ2)

@t�t = 2�(t)rθ ��trθ(θ; �t )

�:

I Minima

supp(��) � arg minθ2RD

(θ; ��) :

I Fixed points

supp(��) ��θ 2 RD : r(θ; ��) = 0

:


What is this?

@t�t = 2�(t)rθ ��trθ(θ; �t )

�:

I Gradient flow of R(�). . .I . . . in the metric space (P(RD);W2)

[Jordan, Kinderlehrer, Otto, 1998; Ambrosio, Gigli, Savaré, 2006; Carrillo,McCann, Villani, 2013;. . . ]


What to do with this?

@t�t = 2�(t)rθ ��trθ(θ; �t )

�:

I Concrete examplesI A general result for noisy SGD


Concrete examples


Simplest example requiring more than one neuron

With probability 1=2: y = +1, x � N(0;Σ+).With probability 1=2: y = �1, x � N(0;Σ�).

Σ� =

� 2� Is0 00 Id�s0

!

Invariant under O(s0)�O(d � s0) ) Reduced PDE


Activation functions

Simple θi = w i 2 Rd (no offset, no weight)

��(x ;θi ) = �(hw i ; x i)

ReLU θi = (w i ; ai ; bi ) 2 Rd+2

��(x ;θi ) = ai max(hw i ; x i+ bi ; 0)


Distributional dynamics

I s0 = d = 40, N = 800, � 2+ = 1:8, � 2

� = 0:2

I Simple activation

I Histogram: empirical results. Cont. lines: PDE solution.


Evolution of the risk

100

101

102

103

104

105

106

107

Iteration

0

0.5

1

1.5

2

2.5

Ris

k

PDE (∆=0.2)

SGD (∆=0.2)

PDE (∆=0.4)

SGD (∆=0.4)

PDE (∆=0.6)

SGD (∆=0.6)

-0.50

2 0

0.5

1

b (mean)

1.5

r1(m

ean)

a (mean)

1.5

1 0.5

2

0.5

2.5

10

PDE (∆=0.2)

SGD (∆=0.2)

PDE (∆=0.4)

SGD (∆=0.4)

PDE (∆=0.6)

SGD (∆=0.6)

I N = 800, d = 320, s0 = 60, � 2� = 1��

I ReLU activation


Classifying anisotropic Gaussians: Analysis

Theorem (Mei, Montanari, Nguyen, 2018)

Assume: (i) � : R! R non-decreasing bounded Lipschitz;(ii) s0 = d, 2 (0; 1) fixed; (iii) �0 2 P(R�0) has bounded densityand R(�0) < 1.

For d ; s0 � d0(�; �0;�), N � C0d log d, C0(�; �0;�; �) consider SGDinitialized with (w0

i )i�N �iid �0 and step size " � 1=(C0d). Then,for any k 2 [T="; 10T="], whp

RN (θk ) � inf

θ2Rd�NRN (θ) + � :

I Learning from O(1=") = O(d) samples.I Independent of number of neurons N � d log d .


Predicting failure

100

101

102

103

104

105

106

107

Iteration

0

1

2

3

4

5

6

7

8

9

Ris

k

100

101

102

103

104

105

106

107

0

0.5

1

1.5

2

2.5

r

PDE (κ=0.1)

SGD (κ=0.1)

PDE (κ=0.4)

SGD (κ=0.4)

I s0 = d , � 2+ = 1:5, � 2

� = 0:5I N = 800, d = 320I Non-monotone activationI Two different initializations


Predicting failure

I SGD does not necessarily converge to global min

I Can we fix it?


Noisy stochastic gradient descent


Regularized noisy SGD

(gki )i�N ;k�0 �iid N(0; I)

θk+1i = (1� 2�sk )θki � 2sk

�yk � byk �rθi��(x k ;θ

ki ) +

qsk=� gk

i :


Distributional dynamics (wlog �(t) = 1=2)

@t�t = rθ ��trθ�(θ; �t )

�+ ��1�θ�t ;

�(θ; �) � V (θ) + �kθk22 +

ZU (θ;θ0) �(dθ0) :

TheoremSame approximation theorem: noisy SGD $ PDE.

What does this minimize?


Wasserstein gradient flow

F�;�(�) =12R(�) +

�

2

Zkθk22�(dθ)� ��1Ent(�) ;

Ent(�) � �Z

�(θ) log �(θ) dθ :


Wasserstein gradient flow

PropositionAssume rV ( � );rU ( � ; � ) 2 C 1

b (RD), �0 2 C 1b (RD). If �t is a

solution of DD, then F�;�(�t ) is non-increasing:

@tF�(�t ) = �

Z r�(θ; �t )�1�log �t (θ)

� 2

2�t (dθ) :

In particular, any fixed point �� with F�(��) <1 satisfies

��(θ) =1

Z (�; ��)exp

n� �(θ; ��)

o:

[Jordan, Kinderlehrer, Otto, 1998; Carrillo, McCann, Villani, 2013;. . . ]


Key remark

I

��(θ) =1

Z (�)exp

n� �(θ; ��)

ois the Euler-Lagrange equation for

F�;�(�) =12R(�) +

�

2

Zkθk22�(dθ)� ��1Ent(�) ;

I F�;�( � ) is strongly convex

I ) Unique fixed point!


General convergence for noisy SGD

Theorem (Mei, Montanari Nguyen, 2018)

Initialization (θ0i )i�N �iid �0. Assumptions of previous theorems,

with N � C0D logD, " � 1=(C0D).Then there exists �0 = �0(D ;U ;V ; �), such thas, for � > �0 thereexists T = T (D ;U ;V ; �; �) such that for any k 2 [T="; 10T="] wehave, with probability at least 1� �,

R�;N (θk ) � inf

θ2RD�NR�;N (θ) + � :

I General!I Convergence time depends on D , but not on N !


Conclusion


Conclusion

Correspondence

I Learning in two-layers neural networks

I Gradient flows in measure spaces

Thanks!


A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. ·...

Documents

Transcript of A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. ·...