Some statistical theory for deep neural networks...

Some statistical theory for deep neural networks

Johannes Schmidt-Hieber

1 / 20

Deep neural networks

Network architecture (L,p) consists of

I a positive integer L called the number of hidden layers/depth

I width vector p = (p0, . . . , pL+1) ∈ NL+2.

Neural network with network architecture (L,p)

f : Rp0 → RpL+1 , x 7→ f (x) = WL+1σvLWLσvL−1· · ·W2σv1W1x,

Network parameters:

I Wi is a pi × pi−1 matrix

I vi ∈ Rpi

Activation function:

I We study the ReLU activation function σ(x) = max(x , 0).

2 / 20

Depth

Source: Kaiming He, Deep Residual Networks

I Networks are deepI version of ResNet with 152 hidden layersI networks become deeper

3 / 20

High-dimensionality

Source: arxiv.org/pdf/1605.07678.pdf

I Number of network parameters is larger than sample sizeI AlexNet uses 60 million parameters for 1.2 million training

samples

4 / 20

Network sparsity

(ax + b)+

−b/a

I There is some sort of sparsity on the parameters

I units get deactivated by initialization or by learning

I we assume instead sparsely connected networks

5 / 20

The large parameter trick

I with arbitrarily large networkparameters we can approximatethe indicator function via

x 7→ σ(ax)− σ(ax − 1)

I it is common in approximation theory to use networks withnetwork parameters tending to infinity

I in practice, network parameters aretypically small

we restrict all network parameters in absolute value by one

6 / 20

Statistical analysis

I we observe n i.i.d. copies (X1,Y1), . . . , (Xn,Yn),

Yi = f (Xi ) + εi , εi ∼ N (0, 1)

I Xi ∈ Rd , Yi ∈ R,I goal is to reconstruct the function f : Rd → R

I has been studied extensively (kernel smoothing, wavelets,splines, . . . )

7 / 20

The estimator

I choose network architecture (L,p) and sparsity sI denote by F(L,p, s) the class of all networks with

I architecture (L,p)I number of active (e.g. non-zero) parameters is s

I our theory applies to any estimator f̂n taking values inF(L,p, s)

I prediction error

R(f̂n, f ) := Ef

[(f̂n(X)− f (X)

)2],

with XD= X1 being independent of the sample

I study the dependence of n on R(f̂n, f )

8 / 20

Function class

I classical idea: assume that regression function is β-smooth

I optimal nonparametric estimation rate is n−2β/(2β+d)

I suffers from curse of dimensionality

I to understand deep learning this setting is therefore useless

I make a good structural assumption on f

9 / 20

Hierarchical structure

I Important: Only few objects are combined on deeperabstraction levelI few letters in one wordI few words in one sentence

10 / 20

Function class

I We assume that

f = gq ◦ . . . ◦ g0

withI gi : Rdi → Rdi+1 .I each of the di+1 components of gi is βi -smooth and depends

only on ti variablesI ti can be much smaller than diI effective smoothness

β∗i := βi

q∏`=i+1

(β` ∧ 1).

I we show that the rate depends on the pairs

(ti , β∗i ), i = 0, . . . , q.

11 / 20

Example: Additive models

I In an additive model

f (x) =d∑

i=1

fi (xi )

I This can be written as f = g1 ◦ g0 with

g0(x) = (fi (xi ))i=1,...,d , g1(y) =d∑

i=1

yi .

Hence, t0 = 1, d1 = t1 = d .I Decomposes additive functions in

I one function that can be non-smooth but every component isone-dimensional

I one function that has high-dimensional input but the functionis smooth

12 / 20

Main resultTheorem: If

(i) depth � log n

(ii) width ≥ network sparsity � maxi=0,...,q nti

2β∗i+ti log n

Then, for any network reconstruction method f̂n,

predicition error � φn + ∆n

(up to log n-factors) with

∆n := E[1

n

n∑i=1

(Yi − f̂n(Xi ))2 − inff ∈F(L,p,s)

1

n

n∑i=1

(Yi − f (Xi ))2]

and

φn := maxi=0,...,q

n− 2β∗i

2β∗i+ti .

13 / 20

Consequences

I the assumption that depth � log n appears naturally

I in particular the depth scales with the sample size

I the networks can have much more parameters than thesample size

important for statistical performance is not the size ofthe network but the amount of regularization

14 / 20

Consequences (ctd.)

paradox:

I good rate for all smoothness indices

I existing piecewise linear methods only give good rates up tosmoothness two

I Here the non-linearity of the function class helps

non-linearity is essential!!!

15 / 20

Suboptimality of wavelet estimators

I f (x) = h(x1 + . . .+ xd)

I for some α-smooth function h

I Rate for DNNs . n−α/(2α+1) (up to logarithmic factors)

I Rate for best wavelet thresholding estimator & n−α/(2α+d)

I Reason: Low-dimensional structure does not affect the decayof the wavelet coefficients

16 / 20

MARS

I consider products of ramp functions

hI ,t(x1, . . . , xd) =∏j∈I

(± (xj − tj)

)+

I piecewise constant in each component

I MARS (multivariate adaptive regression splines) fits linearcombinations of such functions to data

I greedy algorithm

I has depth and width type parameters

17 / 20

Comparison with MARS

I how does MARS compare to ReLU networks?

I functions that can be represented by s parameters withrespect to the MARS function system can be represented bys log(1/ε)-sparse DNNs up to sup-norm error ε

18 / 20

Comparison with MARS (ctd.)

Figure: Reconstruction using MARS (left) and networks (right)

I the opposite is not true, one counterexample is

f (x1, x2) = (x1 + x2 − 1)+

I we need & ε−1/2 many parameters to get ε-close with MARSfunctions

I conclusion: DNNs work better for correlated design

19 / 20

References:

I SH (2017). Nonparametric regression using deep neuralnetworks with ReLU activation function. arxiv:1708.06633

I Eckle, SH (2018). A comparison of deep networks with ReLUactivation function and linear spline-type methods.arXiv:1804.02253

20 / 20

Some statistical theory for deep neural networks...

Documents

Transcript of Some statistical theory for deep neural networks...