A Mean Field Analysis Of Deep ResNet

A Mean Field Analysis Of Deep ResNet:Towards Provable Optimization Via Overparameterization From Depth

Joint work with Chao Ma, Yulong Lu, Jianfeng Lu and Lexing Ying

Presenter:Yiping LuContact:yplu@stanford.edu,https://web.stanford.edu/~yplu/

Global Convergence Proof Of NN

Neural Tangent Kernel([Jacot et al.2019]):Linearize the modelfNN(θ) = fNN(θinit)+ <∇θfNN(θinit), θ− θinit >

Pro: can provide proof of convergence for any structure of NN. ([Li etal. 2019])Con: Feature is lazy learned, i.e. not data dependent. ([Chizat andBach 2019.][Ghorbani et al.2019])

Mean Field Regime([Bengio et al.2006][Bach et al.2014][Suzuki etal.2015]): We consider properties of the loss landscape withrespect to the distribution of weights L(ρ) = ‖Eθ∼ρg(θ, x)− f (x)‖22,the objective is a convex function

Pro: SGD = Wasserstein Gradient Flow ([Mei et al.2018][Chizat etal.2018][Rotskoff et al.2018])Con: Hard to generalize beyond two layer

2prime A Mean Field Analysis Of Deep ResNet: 04/2020 1 / 12

Neural Ordinary Differential Equation

Figure: ResNet can be seen as the Euler discretization of a time evolving ODE

Limit of depth→∞This analogy does not directly provide guarantees of globalconvergence even in the continuum limit.

Neural Ordinary Differential Equation

Figure: ResNet can be seen as the Euler discretization of a time evolving ODE

Limit of depth→∞This analogy does not directly provide guarantees of globalconvergence even in the continuum limit.

Mean Field ResNetOur Aim: Provide a new continuous limit for ResNet with good limitinglandscape.Idea:We consider properties of the loss landscape with respect to thedistribution of weights.

Here:Input data is the initial condition Xρ(x ,0) = 〈w2, x〉X is the feature, t represents the depth.

Loss function:E(ρ) = Ex∼μ[

(⟨w1,Xρ(x ,1)

⟩− y(x)

Mean Field ResNetOur Aim: Provide a new continuous limit for ResNet with good limitinglandscape.Idea:We consider properties of the loss landscape with respect to thedistribution of weights.

Here:Input data is the initial condition Xρ(x ,0) = 〈w2, x〉X is the feature, t represents the depth.

Loss function:E(ρ) = Ex∼μ[

(⟨w1,Xρ(x ,1)

⟩− y(x)

Adjoint EquationTo optimize the Mean Field model, we calculate the gradient δE

δρ via the adjointsensitivity method.

ModelThe loss function can be written as

Ex∼μE(x ;ρ) := Ex∼μ12∣∣〈w1,Xρ(x ,1)〉 − y(x)

∣∣2 (1)

where Xρ satisfies the equation Xρ(x , t) =∫θ f (Xρ(x , t), θ)ρ(θ, t)dθ,

Adjoint Equation. The gradient can be represented as a secondbackwards-in-time augmented ODE.

pρ(x , t) = −δX Hρ(pρ, x , t)

= −pρ(x , t)∫∇X f (Xρ(x , t), θ)ρ(θ, t)dθ,

Here the Hamiltonian is defined as Hρ(p, x , t) = p(x , t) ·∫

f (x , θ)ρ(θ, t)dθ.

Adjoint EquationTo optimize the Mean Field model, we calculate the gradient δE

δρ via the adjointsensitivity method.

ModelThe loss function can be written as

Ex∼μE(x ;ρ) := Ex∼μ12∣∣〈w1,Xρ(x ,1)〉 − y(x)

∣∣2 (1)

where Xρ satisfies the equation Xρ(x , t) =∫θ f (Xρ(x , t), θ)ρ(θ, t)dθ,

Adjoint Equation. The gradient can be represented as a secondbackwards-in-time augmented ODE.

pρ(x , t) = −δX Hρ(pρ, x , t)

= −pρ(x , t)∫∇X f (Xρ(x , t), θ)ρ(θ, t)dθ,

Here the Hamiltonian is defined as Hρ(p, x , t) = p(x , t) ·∫

f (x , θ)ρ(θ, t)dθ.2prime A Mean Field Analysis Of Deep ResNet: 04/2020 4 / 12

Adjoint EquationTheorem

For ρ ∈ P2 let δEδρ (θ, t) = Ex∼μf (Xρ(x , t), θ))pρ(x , t), where pρ is the solution

to the backward equation pρ(x , t) = −pρ(x , t)∫∇X f (Xρ(x , t), θ)ρ(θ, t)dθ.

Then for every ν ∈ P2, we have

E(ρ+ λ(ν− ρ)) = E(ρ) + λ⟨δEδρ , (ν− ρ)

⟩+ o(λ)

for the convex combination (1− λ)ρ+ λν ∈ P2 with λ ∈ [0,1].

Deep Residual Network Behaves Like anEnsemble Of Shallow Models

X1 = X0 +1

∫θ0σ(θ0X0)ρ0(θ0)dθ0.

X2 = X0 +1

∫θ0σ(θ0X0)ρ0(θ0)dθ0 +

∫θ1σ(θ1(X0 +

∫θ0σ(θ0X0)ρ0(θ0)dθ0))ρ1(θ1)dθ1

= X0 +1

∫θ0σ(θ0X0)ρ0(θ0)dθ0 +

∫θ1σ(θ1X0)ρ1(θ1)dθ1 +

∫θ1∇σ(θ1X0)θ1(

∫θ0σ(θ0X0)ρ0(θ0)dθ0)ρ1(θ1)dθ1

+ h.o.t.

Iterating this expansion gives rise to

XL ≈ X0 +1

L−1∑a=0

∫σ(θX0)ρa(θ)dθ +

∑b>a

∫ ∫∇σ(θbX0)θbσ(θaX0)ρb(θb)ρa(θa)dθbθa + h.o.t.

Veit A, Wilber M J, Belongie S. Residual networks behave likeensembles of relatively shallow networks. Advances in neuralinformation processing systems. 2016: 550-558.

X1 = X0 +1

X2 = X0 +1

∫θ0σ(θ0X0)ρ0(θ0)dθ0 +

∫θ1σ(θ1(X0 +

= X0 +1

∫θ0σ(θ0X0)ρ0(θ0)dθ0 +

∫θ1σ(θ1X0)ρ1(θ1)dθ1 +

∫θ1∇σ(θ1X0)θ1(

+ h.o.t.

XL ≈ X0 +1

L−1∑a=0

∑b>a

X1 = X0 +1

X2 = X0 +1

∫θ0σ(θ0X0)ρ0(θ0)dθ0 +

∫θ1σ(θ1(X0 +

= X0 +1

∫θ0σ(θ0X0)ρ0(θ0)dθ0 +

∫θ1σ(θ1X0)ρ1(θ1)dθ1 +

∫θ1∇σ(θ1X0)θ1(

+ h.o.t.

XL ≈ X0 +1

L−1∑a=0

∑b>a

Deep Residual Network Behaves Like anEnsemble Of Shallow ModelsDifference of back propagation process of two-layer net andResNet.Two-layer Network

δEδρ (θ, t) = Ex∼μf (x , θ))(Xρ − y(x))

ResNet

δEδρ (θ, t) = Ex∼μf (Xρ(x , t), θ))pρ(x , t)

We aim to show that the two gradient are similar.

The norm of the solution to the adjoint equation can be bounded bythe loss

‖pρ(·, t)‖μ ≥ e−(C1+C2r)E(ρ),∀t ∈ [0,1]’

Deep Residual Network Behaves Like anEnsemble Of Shallow ModelsDifference of back propagation process of two-layer net andResNet.Two-layer Network

δEδρ (θ, t) = Ex∼μf (x , θ))(Xρ − y(x))

ResNet

δEδρ (θ, t) = Ex∼μf (Xρ(x , t), θ))pρ(x , t)

We aim to show that the two gradient are similar.

The norm of the solution to the adjoint equation can be bounded bythe loss

‖pρ(·, t)‖μ ≥ e−(C1+C2r)E(ρ), ∀t ∈ [0,1]’

Local = GlobalTheorem

If E(ρ) > 0 for distribution ρ ∈ P2 that is supported on one of thenested sets Qr , we can always construct a descend direction ν ∈ P2,i.e.

infν∈P2

⟨δEδρ , (ν− ρ)

Corollary

Consider a stationary solution to the Wasserstein gradient flow whichis full support(informal), then it’s a global minimizer.

Local = GlobalTheorem

If E(ρ) > 0 for distribution ρ ∈ P2 that is supported on one of thenested sets Qr , we can always construct a descend direction ν ∈ P2,i.e.

infν∈P2

⟨δEδρ , (ν− ρ)

Corollary

Consider a stationary solution to the Wasserstein gradient flow whichis full support(informal), then it’s a global minimizer.

Numerical SchemeWe may consider using a parametrization of ρ with n particles as

ρn(θ, t) =n∑

δθi(θ)1[τi ,τ′i ](t).

The characteristic function 1[τi ,τ′i ] can be viewed as a relaxation of theDirac delta mass δτi (t).

Given: A collection of residual blocks (θi , τi)ni=1

while training doSort (θi , τi) based on τi to be (θi , τi) where τ0 ≤ · · · ≤ τn.Define the ResNet as X `+1 = X ` + (τ` − τ`−1)σ(`X `) for 0 ≤ ` <n.Use gradient descent to update both θi and τi .

end while

Numerical ResultsVanilla mean-field Dataset

ResNet20 8.75 8.19 CIFAR10ResNet32 7.51 7.15 CIFAR10ResNet44 7.17 6.91 CIFAR10ResNet56 6.97 6.72 CIFAR10ResNet110 6.37 6.10 CIFAR10ResNet164 5.46 5.19 CIFAR10ResNeXt29(864d) 17.92 17.53 CIFAR100ResNeXt29(1664d) 17.65 16.81 CIFAR100

Table: Comparison of the stochastic gradient descent and mean-field training(Algorithm 1.) of ResNet On CIFAR Dataset. Results indicate that our methodour performs the Vanilla SGD consistently.

Take Home MessageWe propose a new continuous limitfor deep resnet

Xρ(x , t) =∫θ

f (Xρ(x , t), θ)ρ(θ, t)dθ,

with initial Xρ(x ,0) = 〈w2, x〉Local minimizer is global in `2 space.

A potential scheme to approximate.

TO DO List.

Analysis of Wasserstein gradientflow. (Global Existence)

Refined analysis of numerical scheme

h.o.t in the expansion from ResNet toensemble of small networks.

Take Home MessageWe propose a new continuous limitfor deep resnet

Xρ(x , t) =∫θ

f (Xρ(x , t), θ)ρ(θ, t)dθ,

with initial Xρ(x ,0) = 〈w2, x〉Local minimizer is global in `2 space.

A potential scheme to approximate.

TO DO List.

Analysis of Wasserstein gradientflow. (Global Existence)

Refined analysis of numerical scheme

h.o.t in the expansion from ResNet toensemble of small networks.

Thanks

Reference: Yiping Lu*, Chao Ma, Yulong Lu, Jianfeng Lu, Lexing Ying. "AMean-field Analysis of Deep ResNet and Beyond: Towards ProvableOptimization Via Overparameterization From Depth" arXiv:2003.05508ICLR 2020 Workshop on Integration of Deep Neural Models and DifferentialEquations. (Contributed Talk)

Contact:yplu@stanford.edu

A Mean Field Analysis Of Deep ResNet

Documents

Transcript of A Mean Field Analysis Of Deep ResNet

Extremal Functions for a Mean Field Equation in Two …chang/ccl-20017.pdfExtremal Functions for a Mean Field Equation in Two Dimension Sun-Yung A. Chang, Chiun-Chuan Chen and Chang-Shou

ars.els-cdn.com · Web view2343.67 2368.25 2477.93 2396.62 58.37 108.33 116.37 112.79 112.50 3.29 0.60 0.62 0.48 0.57 0.06 Intermittent Field Samplings R1 R2 R3 Mean SD R1 R2 R3 Mean

How is working a microscope · Field view and field deep The area of sample viewed by the ocular tube is called fiel of view. Anyway, not only the lateral resolution is important,

Electronic Structure Calculations with Dynamical …online.itp.ucsb.edu/online/excitcm09/savrasov/pdf/...Electronic Structure Calculations with Dynamical Mean Field Theory KITP, December

DEEP FOUNDATIONS – Pile Foundations - جامعة تكريتced.ceng.tu.edu.iq/images/lectures/dr.farouq/ch7-Deep-Foundation... · DEEP FOUNDATIONS – Pile Foundations ... Qs =

Current work on AO NACO archive data on deep fields: SR across the field limit mag. morphology NICMOS VLT-NACO ghosts.

The Mean Field of the Sun

MEAN FIELD EQUATIONS, HYPERELLIPTIC CURVES AND MODULAR ...chai/papers_pdf/MFE1_13may2015.pdf · MEAN FIELD EQUATIONS, HYPERELLIPTIC CURVES AND MODULAR FORMS: I 3 0.1.1. The Green’s

Deep Neural Networks - GitHub Pages

Recap: U(1) slave-boson formulation of t-J model and mean field theory Mean field phase diagram LabelStateχΔb IFermi liquid≠ 0= 0≠ 0 IISpin gap≠ 0 = 0.

Deep Generative Models I

22. Kinematic Dynamo Theory; Mean Field Theory …web.gps.caltech.edu/classes/ge131/notes2016/Ch22.pdf238 22. Kinematic Dynamo Theory; Mean Field Theory Dynamo Solutions We seek solutions

Extended Mean Field Study of Complex 4-Theory at Finite ... fileMotivation ’4-TheoryMean Field TheoryExtended Mean Field TheoryResultsOutlook Extended Mean Field Study of Complex

Ionization Detectors - Astronomypetra/phys6771/lecture14.pdfIonization Detectors Mostly Gaseous Detectors Last lecture: Electric Field Dependence: λ= 1 N σ mean free path with respect

5 Deep Learning Multi-layered feedforward neural networks ... · 5 Deep Learning •Some Topics in Deep Learning: ∗Learning algorithms: ~Back propagation, Stochastic Gradient Descent

DATA STRUCTURE BASED THEORY FOR DEEP LEARNING · 2016. 6. 27. · of the data. Gaussian mean width is a good measure for the complexity of the data. Random Gaussian weights are ...

Phase Diagrams and Cluster Expansions for Low …people.virginia.edu/~ji2k/Home/Papers/PhaseDiagram1.pdf · Phase Diagrams. I 263 theory. Their mean field expansion has formed a basis

Niobe Deep

DEEP-Theory Meeting 30 October 2017physics.ucsc.edu/~joel/DEEP-Theory_Slides/DEEP... · density rather than by location within the cosmic web — we are ﬁnishing the paper led by

Probabilistic Approach to Mean Field Games · 2013-09-10 · PROBABILISTIC APPROACH Disclaimer (to PL and the PDE aﬁcionados) "Mathematicians (Probabilists) are like Frenchmen: