A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. ·...
Transcript of A Mean Field View of the Landscape of Two-Layers Neural … · 2018. 6. 6. ·...
A Mean Field View of the Landscapeof Two-Layers Neural Networks
Andrea Montanari[with Song Mei, Phan-Minh Nguyen]
Stanford University
May 14, 2018
Andrea Montanari (Stanford) Two layers May 14, 2018 1 / 40
Non-convex high-dimensional statistics
Data n(y1; x 1); (y2; x 2); � � � ; (ynxn)
o�iid P 2 P(R� Rd)
Empirical risk minimization
minimize bRn(θ) �1n
nXi=1
`(θ; x i ; yi )
Andrea Montanari (Stanford) Two layers May 14, 2018 2 / 40
Example: ‘One-neuron neural network’
4 2 0 2 4⟨θ,x⟩0.0
0.5
1.0
1.5
2.0
`(θ;y=
1,x
)
z i = (yi ; x i ), yi 2 f0; 1g, x i 2 Rd , P(yi = 1jx i ) = �(hθ0; x i i)
bRn(θ) =1n
nXi=1
�yi � �(hθ; x i i)
�2;
�(u) =1
1+ e�u:
Andrea Montanari (Stanford) Two layers May 14, 2018 3 / 40
Population vs. empirical risk
w0
w1
-3 -2 -1 0 1 2 3
w2
-3
-2
-1
0
1
2
3
w0 wn
w1
-3 -2 -1 0 1 2 3
w2
-3
-2
-1
0
1
2
3
I Population risk � Empirical risk.I Population risk has unique global minimum or. . .I . . . can construct a good initialization.
[Mei, Bai, Montanari 2017]
Andrea Montanari (Stanford) Two layers May 14, 2018 4 / 40
This talk
I More complicate models (two-layers NNs)
I Higher level description
Andrea Montanari (Stanford) Two layers May 14, 2018 5 / 40
Outline
1 Two-layers neural networks
2 Concrete examples
3 Noisy stochastic gradient descent
4 Conclusion
[Mei, M., Nguyen, arXiv:1804.06561]
Andrea Montanari (Stanford) Two layers May 14, 2018 6 / 40
Two layers neural networks
Andrea Montanari (Stanford) Two layers May 14, 2018 7 / 40
N -hidden units (neurons)
θ = (θ1; : : : ;θN ) 2 RD�N
by(x ;θ) = 1N
NXi=1
��(x ;θi ) ; �� : Rd � RD ! R
Classical example θi = (w i ; ai ; bi ) 2 Rd+2
��(x ;θi ) = ai �(hw i ; x i+ bi ) :
Andrea Montanari (Stanford) Two layers May 14, 2018 8 / 40
Empirical risk
bRn ;N (θ) =1n
nXi=1
`
0@yi ; 1N
NXj=1
��(x i ;θj )
1A
Andrea Montanari (Stanford) Two layers May 14, 2018 9 / 40
Empirical risk
I For simplicity `(y ; by) = (y � by)2
bRn ;N (θ) =1n
nXi=1
0@yi � 1N
NXj=1
��(x i ;θj )
1A2
Landscape analysis?
Partial success. . .[Arora, Bhaskara, Ge, Ma, 2014; Janzamin, Sedghi, Anandkumar, 2015; Ge,Lee, Ma, 2017; Soltanolkotabi, Javanmard, Lee, 2017; Zhang, Lee, Jordan,2017; Zhong, Song, Jain, Bartlett, Dhillon, 2017; . . . ]
Andrea Montanari (Stanford) Two layers May 14, 2018 10 / 40
Empirical risk
I For simplicity `(y ; by) = (y � by)2
bRn ;N (θ) =1n
nXi=1
0@yi � 1N
NXj=1
��(x i ;θj )
1A2
Landscape analysis?
Partial success. . .[Arora, Bhaskara, Ge, Ma, 2014; Janzamin, Sedghi, Anandkumar, 2015; Ge,Lee, Ma, 2017; Soltanolkotabi, Javanmard, Lee, 2017; Zhang, Lee, Jordan,2017; Zhong, Song, Jain, Bartlett, Dhillon, 2017; . . . ]
Andrea Montanari (Stanford) Two layers May 14, 2018 10 / 40
Empirical risk
I For simplicity `(y ; by) = (y � by)2
bRn ;N (θ) =1n
nXi=1
0@yi � 1N
NXj=1
��(x i ;θj )
1A2
Landscape analysis?
Partial success. . .[Arora, Bhaskara, Ge, Ma, 2014; Janzamin, Sedghi, Anandkumar, 2015; Ge,Lee, Ma, 2017; Soltanolkotabi, Javanmard, Lee, 2017; Zhang, Lee, Jordan,2017; Zhong, Song, Jain, Bartlett, Dhillon, 2017; . . . ]
Andrea Montanari (Stanford) Two layers May 14, 2018 10 / 40
Seeking a simpler structure
1. Stochastic gradient descent (SGD)
θk+1i = θki � 2sk
�yk � by(x k ;θ
k )�rθi��(x k ;θ
ki ) :
2. One-Pass: each example visited once
f(yk ; x k )gk�1 �iid P
Not a bad assumption!More on this later!
Andrea Montanari (Stanford) Two layers May 14, 2018 11 / 40
Seeking a simpler structure
1. Stochastic gradient descent (SGD)
θk+1i = θki � 2sk
�yk � by(x k ;θ
k )�rθi��(x k ;θ
ki ) :
2. One-Pass: each example visited once
f(yk ; x k )gk�1 �iid P
Not a bad assumption!More on this later!
Andrea Montanari (Stanford) Two layers May 14, 2018 11 / 40
Questions
I Number of iterations k to achieve risk RN (θk ) � Rtarget?
I Scaling with N , D?
I Can achieve Rtarget = infθ RN (θ) + "?
Andrea Montanari (Stanford) Two layers May 14, 2018 12 / 40
SGD minimizes population risk
RN (θ) = En�
y �1N
NXj=1
��(x ;θj )�2o
= R# +2N
NXi=1
V (θi ) +1N 2
Xi ;j=1N
U (θi ;θj ) ;
V (θ) � �E�y ��(x ;θ)
;
U (θ1;θ2) � E���(x ;θ1)��(x ;θ2)
:
I Gas of N particles with in D dimensionsI Exchangeable!I U ( � ; � ) � 0
Andrea Montanari (Stanford) Two layers May 14, 2018 13 / 40
SGD minimizes population risk
RN (θ) = En�
y �1N
NXj=1
��(x ;θj )�2o
= R# +2N
NXi=1
V (θi ) +1N 2
Xi ;j=1N
U (θi ;θj ) ;
V (θ) � �E�y ��(x ;θ)
;
U (θ1;θ2) � E���(x ;θ1)��(x ;θ2)
:
I Gas of N particles with in D dimensionsI Exchangeable!I U ( � ; � ) � 0
Andrea Montanari (Stanford) Two layers May 14, 2018 13 / 40
SGD minimizes population risk
RN (θ) = En�
y �1N
NXj=1
��(x ;θj )�2o
= R# +2N
NXi=1
V (θi ) +1N 2
Xi ;j=1N
U (θi ;θj ) ;
V (θ) � �E�y ��(x ;θ)
;
U (θ1;θ2) � E���(x ;θ1)��(x ;θ2)
:
I Gas of N particles with in D dimensionsI Exchangeable!I U ( � ; � ) � 0
Andrea Montanari (Stanford) Two layers May 14, 2018 13 / 40
SGD minimizes population risk
RN (θ) = En�
y �1N
NXj=1
��(x ;θj )�2o
= R# +2N
NXi=1
V (θi ) +1N 2
Xi ;j=1N
U (θi ;θj ) ;
V (θ) � �E�y ��(x ;θ)
;
U (θ1;θ2) � E���(x ;θ1)��(x ;θ2)
:
I Gas of N particles with in D dimensionsI Exchangeable!I U ( � ; � ) � 0
Andrea Montanari (Stanford) Two layers May 14, 2018 13 / 40
Exchangeability )RN (θ) depends on θ1; : : : ;θN only through �(N ) =
PNi=1 �θi=N :
R : P(RD)! R
R(�) � R# + 2ZV (θ) �(dθ) +
ZU (θ1;θ2) �(dθ1) �(dθ2)
LemmaIfRU (θ;θ)�opt(dθ) <1, then
inf�R(�) � inf
θRN (θ) � inf
�R(�) +
CN
:
[cf convnets, Bengio et al. 2006]
Andrea Montanari (Stanford) Two layers May 14, 2018 14 / 40
Exchangeability )RN (θ) depends on θ1; : : : ;θN only through �(N ) =
PNi=1 �θi=N :
R : P(RD)! R
R(�) � R# + 2ZV (θ) �(dθ) +
ZU (θ1;θ2) �(dθ1) �(dθ2)
LemmaIfRU (θ;θ)�opt(dθ) <1, then
inf�R(�) � inf
θRN (θ) � inf
�R(�) +
CN
:
[cf convnets, Bengio et al. 2006]
Andrea Montanari (Stanford) Two layers May 14, 2018 14 / 40
Exchangeability )RN (θ) depends on θ1; : : : ;θN only through �(N ) =
PNi=1 �θi=N :
R : P(RD)! R
R(�) � R# + 2ZV (θ) �(dθ) +
ZU (θ1;θ2) �(dθ1) �(dθ2)
LemmaIfRU (θ;θ)�opt(dθ) <1, then
inf�R(�) � inf
θRN (θ) � inf
�R(�) +
CN
:
[cf convnets, Bengio et al. 2006]
Andrea Montanari (Stanford) Two layers May 14, 2018 14 / 40
Puzzle
R(�) � R# + 2ZV (θ) �(dθ) +
ZU (θ1;θ2) �(dθ1) �(dθ2)
I Hard nonconvex problem RN (θ) ) Convex R(�)
I Did we trivialize the problem?
SolutionI Not all ‘small changes’ in � can be realized by SGD dynamicsI Mass must be conserved locallyI Is there a scaling limit for SGD dynamics?
Andrea Montanari (Stanford) Two layers May 14, 2018 15 / 40
Puzzle
R(�) � R# + 2ZV (θ) �(dθ) +
ZU (θ1;θ2) �(dθ1) �(dθ2)
I Hard nonconvex problem RN (θ) ) Convex R(�)
I Did we trivialize the problem?
SolutionI Not all ‘small changes’ in � can be realized by SGD dynamicsI Mass must be conserved locallyI Is there a scaling limit for SGD dynamics?
Andrea Montanari (Stanford) Two layers May 14, 2018 15 / 40
Puzzle
R(�) � R# + 2ZV (θ) �(dθ) +
ZU (θ1;θ2) �(dθ1) �(dθ2)
I Hard nonconvex problem RN (θ) ) Convex R(�)
I Did we trivialize the problem?
SolutionI Not all ‘small changes’ in � can be realized by SGD dynamicsI Mass must be conserved locallyI Is there a scaling limit for SGD dynamics?
Andrea Montanari (Stanford) Two layers May 14, 2018 15 / 40
Puzzle
R(�) � R# + 2ZV (θ) �(dθ) +
ZU (θ1;θ2) �(dθ1) �(dθ2)
I Hard nonconvex problem RN (θ) ) Convex R(�)
I Did we trivialize the problem?
SolutionI Not all ‘small changes’ in � can be realized by SGD dynamicsI Mass must be conserved locallyI Is there a scaling limit for SGD dynamics?
Andrea Montanari (Stanford) Two layers May 14, 2018 15 / 40
‘Distributional dynamics’
@t�t = 2�(t)rθ ���trθ(θ; �t )
�;
(θ; �) ��R(�)
��(θ)= V (θ) +
ZU (θ;θ0) �(dθ0) :
Claim sk = "�(k"), k = t=", N !1, "! 0:
�(N )k �
1N
NXi=1
�θki) �t
Andrea Montanari (Stanford) Two layers May 14, 2018 16 / 40
‘Distributional dynamics’
@t�t = 2�(t)rθ ���trθ(θ; �t )
�;
(θ; �) ��R(�)
��(θ)= V (θ) +
ZU (θ;θ0) �(dθ0) :
Claim sk = "�(k"), k = t=", N !1, "! 0:
�(N )k �
1N
NXi=1
�θki) �t
Andrea Montanari (Stanford) Two layers May 14, 2018 16 / 40
More precisely
Theorem (Mei, Montanari, Nguyen, 2018)
Assume rV ;rU bounded Lipschitz, and rθ��(x ;θ) subgaussian.Let (θ0)i�N �iid �0. Then, for any bounded Lipschitz function f :
supt�T
����� 1NNXi=1
f (θbt="ci ; t)�Zf (θ; t)�t (dθ)
����� � KeKT errN ;D(z ) ;
errN ;D(z ) �
s1N_ " �
24sD _ logN"
+ z
35 ;
with probability at least 1� 4 e�z2=2.
Andrea Montanari (Stanford) Two layers May 14, 2018 17 / 40
Proof idea
I Propagation-of-chaos argument.
I Intuition Fk = �(fyi ; x igi<k ):
Efθk+1i jFkg = θki � 2"�(k")Ef
�yk � by(x k ;θ
k )�rθi��(x k ;θ
ki )jFkg
= θki � 2"�(k")rV (θki ) + 2"�(k")1N
NXj=1
r1U (θki ;θkj )
= θki � 2"�(k")r(θki ; �(N )k ) :
Andrea Montanari (Stanford) Two layers May 14, 2018 18 / 40
Proof idea
I Propagation-of-chaos argument.
I Intuition Fk = �(fyi ; x igi<k ):
Efθk+1i jFkg = θki � 2"�(k")Ef
�yk � by(x k ;θ
k )�rθi��(x k ;θ
ki )jFkg
= θki � 2"�(k")rV (θki ) + 2"�(k")1N
NXj=1
r1U (θki ;θkj )
= θki � 2"�(k")r(θki ; �(N )k ) :
Andrea Montanari (Stanford) Two layers May 14, 2018 18 / 40
Overparametrization
errN ;D(z ) �
s1N_ " �
24sD _ logN"
+ z
35 :
I Small if N � D logD , "� 1=D .
I Number of parameters ND can be much larger than sample sizek = t=" = O(D)!!
I Overparametrization does not slow down convergence!
Andrea Montanari (Stanford) Two layers May 14, 2018 19 / 40
Related work
Last 2 weeksI Rotskoff, Vanden-Eijnden arXiv:1805.00915I Sirignano, Spiliopoulos arXiv:1805.01053
Andrea Montanari (Stanford) Two layers May 14, 2018 20 / 40
Minima of R(�) � Fixed points
R(�) � R# + 2ZV (θ) �(dθ) +
ZU (θ1;θ2) �(dθ1) �(dθ2)
@t�t = 2�(t)rθ ���trθ(θ; �t )
�:
I Minima
supp(��) � arg minθ2RD
(θ; ��) :
I Fixed points
supp(��) ��θ 2 RD : r(θ; ��) = 0
:
Andrea Montanari (Stanford) Two layers May 14, 2018 21 / 40
What is this?
@t�t = 2�(t)rθ ���trθ(θ; �t )
�:
I Gradient flow of R(�). . .I . . . in the metric space (P(RD);W2)
[Jordan, Kinderlehrer, Otto, 1998; Ambrosio, Gigli, Savaré, 2006; Carrillo,McCann, Villani, 2013;. . . ]
Andrea Montanari (Stanford) Two layers May 14, 2018 22 / 40
What to do with this?
@t�t = 2�(t)rθ ���trθ(θ; �t )
�:
I Concrete examplesI A general result for noisy SGD
Andrea Montanari (Stanford) Two layers May 14, 2018 23 / 40
Concrete examples
Andrea Montanari (Stanford) Two layers May 14, 2018 24 / 40
Simplest example requiring more than one neuron
With probability 1=2: y = +1, x � N(0;Σ+).With probability 1=2: y = �1, x � N(0;Σ�).
Σ� =
� 2� Is0 00 Id�s0
!
Invariant under O(s0)�O(d � s0) ) Reduced PDE
Andrea Montanari (Stanford) Two layers May 14, 2018 25 / 40
Activation functions
Simple θi = w i 2 Rd (no offset, no weight)
��(x ;θi ) = �(hw i ; x i)
ReLU θi = (w i ; ai ; bi ) 2 Rd+2
��(x ;θi ) = ai max(hw i ; x i+ bi ; 0)
Andrea Montanari (Stanford) Two layers May 14, 2018 26 / 40
Distributional dynamics
I s0 = d = 40, N = 800, � 2+ = 1:8, � 2
� = 0:2
I Simple activation
I Histogram: empirical results. Cont. lines: PDE solution.
Andrea Montanari (Stanford) Two layers May 14, 2018 27 / 40
Evolution of the risk
100
101
102
103
104
105
106
107
Iteration
0
0.5
1
1.5
2
2.5
Ris
k
PDE (∆=0.2)
SGD (∆=0.2)
PDE (∆=0.4)
SGD (∆=0.4)
PDE (∆=0.6)
SGD (∆=0.6)
-0.50
2 0
0.5
1
b (mean)
1.5
r1(m
ean)
a (mean)
1.5
1 0.5
2
0.5
2.5
10
PDE (∆=0.2)
SGD (∆=0.2)
PDE (∆=0.4)
SGD (∆=0.4)
PDE (∆=0.6)
SGD (∆=0.6)
I N = 800, d = 320, s0 = 60, � 2� = 1��
I ReLU activation
Andrea Montanari (Stanford) Two layers May 14, 2018 28 / 40
Classifying anisotropic Gaussians: Analysis
Theorem (Mei, Montanari, Nguyen, 2018)
Assume: (i) � : R! R non-decreasing bounded Lipschitz;(ii) s0 = d, 2 (0; 1) fixed; (iii) �0 2 P(R�0) has bounded densityand R(�0) < 1.
For d ; s0 � d0(�; �0;�), N � C0d log d, C0(�; �0;�; �) consider SGDinitialized with (w0
i )i�N �iid �0 and step size " � 1=(C0d). Then,for any k 2 [T="; 10T="], whp
RN (θk ) � inf
θ2Rd�NRN (θ) + � :
I Learning from O(1=") = O(d) samples.I Independent of number of neurons N � d log d .
Andrea Montanari (Stanford) Two layers May 14, 2018 29 / 40
Classifying anisotropic Gaussians: Analysis
Theorem (Mei, Montanari, Nguyen, 2018)
Assume: (i) � : R! R non-decreasing bounded Lipschitz;(ii) s0 = d, 2 (0; 1) fixed; (iii) �0 2 P(R�0) has bounded densityand R(�0) < 1.
For d ; s0 � d0(�; �0;�), N � C0d log d, C0(�; �0;�; �) consider SGDinitialized with (w0
i )i�N �iid �0 and step size " � 1=(C0d). Then,for any k 2 [T="; 10T="], whp
RN (θk ) � inf
θ2Rd�NRN (θ) + � :
I Learning from O(1=") = O(d) samples.I Independent of number of neurons N � d log d .
Andrea Montanari (Stanford) Two layers May 14, 2018 29 / 40
Predicting failure
100
101
102
103
104
105
106
107
Iteration
0
1
2
3
4
5
6
7
8
9
Ris
k
100
101
102
103
104
105
106
107
0
0.5
1
1.5
2
2.5
r
PDE (κ=0.1)
SGD (κ=0.1)
PDE (κ=0.4)
SGD (κ=0.4)
I s0 = d , � 2+ = 1:5, � 2
� = 0:5I N = 800, d = 320I Non-monotone activationI Two different initializations
Andrea Montanari (Stanford) Two layers May 14, 2018 30 / 40
Predicting failure
I SGD does not necessarily converge to global min
I Can we fix it?
Andrea Montanari (Stanford) Two layers May 14, 2018 31 / 40
Noisy stochastic gradient descent
Andrea Montanari (Stanford) Two layers May 14, 2018 32 / 40
Regularized noisy SGD
(gki )i�N ;k�0 �iid N(0; I)
θk+1i = (1� 2�sk )θki � 2sk
�yk � byk �rθi��(x k ;θ
ki ) +
qsk=� gk
i :
Andrea Montanari (Stanford) Two layers May 14, 2018 33 / 40
Distributional dynamics (wlog �(t) = 1=2)
@t�t = rθ ���trθ�(θ; �t )
�+ ��1�θ�t ;
�(θ; �) � V (θ) + �kθk22 +
ZU (θ;θ0) �(dθ0) :
TheoremSame approximation theorem: noisy SGD $ PDE.
What does this minimize?
Andrea Montanari (Stanford) Two layers May 14, 2018 34 / 40
Distributional dynamics (wlog �(t) = 1=2)
@t�t = rθ ���trθ�(θ; �t )
�+ ��1�θ�t ;
�(θ; �) � V (θ) + �kθk22 +
ZU (θ;θ0) �(dθ0) :
TheoremSame approximation theorem: noisy SGD $ PDE.
What does this minimize?
Andrea Montanari (Stanford) Two layers May 14, 2018 34 / 40
Wasserstein gradient flow
F�;�(�) =12R(�) +
�
2
Zkθk22�(dθ)� ��1Ent(�) ;
Ent(�) � �Z
�(θ) log �(θ) dθ :
Andrea Montanari (Stanford) Two layers May 14, 2018 35 / 40
Wasserstein gradient flow
PropositionAssume rV ( � );rU ( � ; � ) 2 C 1
b (RD), �0 2 C 1b (RD). If �t is a
solution of DD, then F�;�(�t ) is non-increasing:
@tF�(�t ) = �
Z r�(θ; �t )�1�log �t (θ)
� 2
2�t (dθ) :
In particular, any fixed point �� with F�(��) <1 satisfies
��(θ) =1
Z (�; ��)exp
n� �(θ; ��)
o:
[Jordan, Kinderlehrer, Otto, 1998; Carrillo, McCann, Villani, 2013;. . . ]
Andrea Montanari (Stanford) Two layers May 14, 2018 36 / 40
Key remark
I
��(θ) =1
Z (�)exp
n� �(θ; ��)
ois the Euler-Lagrange equation for
F�;�(�) =12R(�) +
�
2
Zkθk22�(dθ)� ��1Ent(�) ;
I F�;�( � ) is strongly convex
I ) Unique fixed point!
Andrea Montanari (Stanford) Two layers May 14, 2018 37 / 40
General convergence for noisy SGD
Theorem (Mei, Montanari Nguyen, 2018)
Initialization (θ0i )i�N �iid �0. Assumptions of previous theorems,
with N � C0D logD, " � 1=(C0D).Then there exists �0 = �0(D ;U ;V ; �), such thas, for � > �0 thereexists T = T (D ;U ;V ; �; �) such that for any k 2 [T="; 10T="] wehave, with probability at least 1� �,
R�;N (θk ) � inf
θ2RD�NR�;N (θ) + � :
I General!I Convergence time depends on D , but not on N !
Andrea Montanari (Stanford) Two layers May 14, 2018 38 / 40
General convergence for noisy SGD
Theorem (Mei, Montanari Nguyen, 2018)
Initialization (θ0i )i�N �iid �0. Assumptions of previous theorems,
with N � C0D logD, " � 1=(C0D).Then there exists �0 = �0(D ;U ;V ; �), such thas, for � > �0 thereexists T = T (D ;U ;V ; �; �) such that for any k 2 [T="; 10T="] wehave, with probability at least 1� �,
R�;N (θk ) � inf
θ2RD�NR�;N (θ) + � :
I General!I Convergence time depends on D , but not on N !
Andrea Montanari (Stanford) Two layers May 14, 2018 38 / 40
Conclusion
Andrea Montanari (Stanford) Two layers May 14, 2018 39 / 40
Conclusion
Correspondence
I Learning in two-layers neural networks
I Gradient flows in measure spaces
Thanks!
Andrea Montanari (Stanford) Two layers May 14, 2018 40 / 40
Conclusion
Correspondence
I Learning in two-layers neural networks
I Gradient flows in measure spaces
Thanks!
Andrea Montanari (Stanford) Two layers May 14, 2018 40 / 40