TTIC 31230, Fundamentals of Deep Learningdmcallester/DeepClass/universality.pdf · Deep Learning...

Post on 03-Aug-2020

2 views 0 download

Transcript of TTIC 31230, Fundamentals of Deep Learningdmcallester/DeepClass/universality.pdf · Deep Learning...

TTIC 31230, Fundamentals of Deep Learning

David McAllester, April 2017

Architectures and Universality

Review:

• ηt > 0 and ηt→ 0 and∑t ηt =∞ implies convergence.

•Momentum: gt+1 = µgt+(1−µ) ∇Θ `t Θ -= ηg.

•RMSprop: Θi -=η

(RMSi+ε)(∇Θ `t)i

•Adam: RMSprop+momentum: Θi -=η

(RMSi+ε)gi

RMSi smoothed

Architecture:

A Pattern of Feed-Forward Computation

Example: Convolution

For filter f and signal x

y = Convolve(f, x)

y[i] =

|f |−1∑j=0

x[i− j]f [j]

Another Example: Recurrent Neural Networks (RNNs)

h[t + 1] = σ(WRh[t] + WIx[t])

Issues in Architecture

• Expressive Power

• Ease of Learning

• Embodyment of Domain Knowledge

Well Established Architectural Motifs

• Perceptron (a linear sum into an activation function)

• CNNs

•RNNs

• Pooling (max pooling or average pooling)

• Softmax

•Dropout and other Stochastic Model Perturbations

• Explicit Ensembles

Well Established Architectural Motifs

• Batch Normalization

•ReLUs

•Highway Architectures: LSTMS, Residual Networks

• Sequence to Sequence and Image to Image Architectures.

•Gating and Attention

Speculative Architectures

•Generative Adversarial Networks (GANs)

•Neural Turing Machines

•Neural Stack Machines

•Neural Logic Machines

Is There a Universal Architecture?

Noam Chomsky: By the no free lunch theorem naturallanguage grammar is unlearnable without an in-nate linguistic capacity. In any domain a strong prior(a learning bias) is required.

Leonid Levin, Andrey Kolmogorov, Geoff Hintonand Jurgen Schmidhuber: Universal learningalgorithms exist. No domain-specific in-nate knowledge is required.

Is Domain-Specific Insight Valuable?

Fred Jelinek: Every time I fire a linguist our recognitionrate improves.

C++ as a Universal Architecture

Let h be any C++ procedure that can be run on a probleminstance to get a loss where the loss is scaled to be in [0, 1].

Let |h| be the number of bits in the Zip compression of h.

Occam’s Razor Theorem: With probability at least 1−δover the draw of the sample the following holds simultaneouslyfor all h and all λ > 1/2.

`(h) ≤ 1

1− 12λ

((h) +λ

N

((ln 2)|h| + ln

1

δ

))See “A PAC-Bayesian Tutorial with a Dropout Bound” , McAllester (2013)

The Occam’s Razor Theorem

It suffices to find any regularity in the training data where theregularity can be expressed concisely relative to the amount oftraining data.

The VGG vision architecture has 138 million parameters.

Imagenet contains approximately 1 trillion pixels.

The Turing Tarpit Theorem

The choice of programming language does not matter.

For any two Turing universal languages L1 and L2 there existsa compiler C : L1→ L2 such that

|C(h)| ≤ |h| + |C|

Universality of Shallow Architectures

We will now consider theorems that apply to any continuousf : [0, 1]m→ R

f→ R

Note that the values at the corners are independent — we canuse multilinear interpolation to fill in the interior of the cube.

There are 2m corners. Each corner has an independent value.

So it must take at least 2m bits of information to specify f .

The Kolmogorov-Arnold representation theorem (1956)

Any continuous function of m inputs can be represented ex-actly by a small (polynomial sized) two-layer network.

f (x1, . . . , xm) =

2m+1∑i=1

gi

m∑j=1

hi,j(xj)

Where gi and hi,j are continuous scalar-to-scalar functions.

A Simpler, Similar Theorem

For any (possibly discontinuous) f : [0, 1]m→ R we have

f (x1, . . . , xm) = g

∑i

hi(xi)

for (discontinuous) scalar-to-scalar functions g and hi.

Proof: Any single real number contains an infinite amount ofinformation.

Select hi to spread out the digits of its argument so that∑i hi(xi) contains all the digits of all the xi.

A Reference

F. Girosi and T. Poggio. Representation properties of net-works: Kolmogorov s theorem is irrelevant. Neural Comp.,1:465 469, 1989.

Cybenko’s Universal Approximation Theorem (1989):

Again consider any continuous f : [0, 1]m→ R

f→ R

Again note that f must contain an exponential amount ofinformation (an independent value at every corner).

Cybenko’s Universal Approximation Theorem (1989)

Any continuous function can be approximated arbitrarily wellby a two layer perceptron.

For any continuous f : [0, 1]m → R and any ε > 0, thereexists

F (x) = α · σ(Wx + β)

=∑i

αiσ

∑j

Wi,j xj + βi

such that for all x in [0, 1]m we have |F (x)− f (x)| < ε.

How Many Hidden Units?

Consider Boolean functions f : {0, 1}m→ {0, 1}.

For Boolean functions we can simply list the inputs x0, . . . , xk

where the function is true.

f (x) =∑k

I [x = xk]

I [x = xk] ≈ σ

∑i

Wk,ixi + bk

This is analogous to observing that every Boolean function canbe put in disjunctive normal form.

The Cybenko Theorem is Irrelevant

The number of inputs where a Boolean function f is true istypically exponential in the number of arguments to f .

The number of hidden units (channels) needed for Cybenko’stheorem is exponential.

Shallow Circuit Inexpressibility Theorems

Building on work of Ajtai, Sipser and others, Hastad proved(1987) that any bounded-depth Boolean circuit computing theparity function must have exponential size.

Matus Telgarsky recently gave some formal conditions underwhich shallow networks provably require exponentially moreparameters than deeper networks (COLT 2016).

Circuit Complexity Theory

Circuit complexity theory — the study of circuit size as a func-tion of the components allowed and the depth allowed — is aspecial case of the study of deep real-valued computation.

Depth and size requirements are hard to prove. Little hasbeen proven about depth and size requirements when additivethreshold gates are allowed.

Still, we believe that deeper circuits can be smaller in total sizethan shallow circuits.

Finding the Program

The Occam’s Razor theorem says that a concise program thatdoes well on the training data will do well on the test data.

But the theorem does not tell us how to find such a program.

Levin’s Universal Problem Solver (Levin Search)

Leonid Levin observed that one can construct a universal solverthat takes as input an oracle for testing proposed solutions and,if a solution exists, returns it.

One can of course enumerate all candidate solutions.

However, Levin’s solver is universal in the sense that it is notmore than a constant factor slower than any other solver.

Levin’s Universal Solver

We time share all programs giving time slice 2−|h| to programh where |h| is the length in bits of h.

The run time of the universal solver is at most

O(2−|h|(h(n) + T (n)))

where h(n) is the time required to run program h on a problemof size n and T (n) is the time required to check the solution.

Here 2−|h| is independent of n and is technically a constant.

Bootstrapping Levin Search

While Levin search sounds like a joke, Jurgin Schmidhuber(inventor of LSTMs and other deep architectural motifs) takesit seriously.

He has proposed ways of accelerating a search over all programsand has something called the Optimal Ordered Problem Solver(OOPS).

The basic idea is bootstrapping — we automate a search formethods of efficient search.

Deep Learning and Evolution

The Baldwin Effect

In a 1987 paper entitled “How Learning Can Guide Evolu-tion”, Goeffrey Hinton and Steve Nowlan brought attention toa paper by Baldwin (1896).

The basic idea is that learning facilitates modularity.

For example, longer arms are easier to evolve if arm controlis learned — arm control is then independent of arm length.Arm control and arm structure become more modular.

If each “module” is learning to participate in the “society ofmind” then each model can more easily accommodate (exploit)changes (improvements) in other modules.

Recent Neuroscience: Quanta Magazine, January 10, 2017

Image from Department of Brain and Cognitive Sciences and McGovern

Institute, Massachusetts Institute of Technology

The g factor

END