TTIC 31230, Fundamentals of Deep Learningdmcallester/DeepClass/universality.pdf · Deep Learning...

TTIC 31230, Fundamentals of Deep Learning

David McAllester, April 2017

Architectures and Universality

Review:

• ηt > 0 and ηt→ 0 and∑t ηt =∞ implies convergence.

•Momentum: gt+1 = µgt+(1−µ) ∇Θ `t Θ -= ηg.

•RMSprop: Θi -=η

(RMSi+ε)(∇Θ `t)i

•Adam: RMSprop+momentum: Θi -=η

(RMSi+ε)gi

RMSi smoothed

Architecture:

A Pattern of Feed-Forward Computation

Example: Convolution

For filter f and signal x

y = Convolve(f, x)

y[i] =

|f |−1∑j=0

x[i− j]f [j]

Another Example: Recurrent Neural Networks (RNNs)

h[t + 1] = σ(WRh[t] + WIx[t])

Issues in Architecture

• Expressive Power

• Ease of Learning

• Embodyment of Domain Knowledge

Well Established Architectural Motifs

• Perceptron (a linear sum into an activation function)

• CNNs

•RNNs

• Pooling (max pooling or average pooling)

• Softmax

•Dropout and other Stochastic Model Perturbations

• Explicit Ensembles

Well Established Architectural Motifs

• Batch Normalization

•ReLUs

•Highway Architectures: LSTMS, Residual Networks

• Sequence to Sequence and Image to Image Architectures.

•Gating and Attention

Speculative Architectures

•Generative Adversarial Networks (GANs)

•Neural Turing Machines

•Neural Stack Machines

•Neural Logic Machines

Is There a Universal Architecture?

Noam Chomsky: By the no free lunch theorem naturallanguage grammar is unlearnable without an in-nate linguistic capacity. In any domain a strong prior(a learning bias) is required.

Leonid Levin, Andrey Kolmogorov, Geoff Hintonand Jurgen Schmidhuber: Universal learningalgorithms exist. No domain-specific in-nate knowledge is required.

Is Domain-Specific Insight Valuable?

Fred Jelinek: Every time I fire a linguist our recognitionrate improves.

C++ as a Universal Architecture

Let h be any C++ procedure that can be run on a probleminstance to get a loss where the loss is scaled to be in [0, 1].

Let |h| be the number of bits in the Zip compression of h.

Occam’s Razor Theorem: With probability at least 1−δover the draw of the sample the following holds simultaneouslyfor all h and all λ > 1/2.

`(h) ≤ 1

1− 12λ

((h) +λ

((ln 2)|h| + ln

))See “A PAC-Bayesian Tutorial with a Dropout Bound” , McAllester (2013)

The Occam’s Razor Theorem

It suffices to find any regularity in the training data where theregularity can be expressed concisely relative to the amount oftraining data.

The VGG vision architecture has 138 million parameters.

Imagenet contains approximately 1 trillion pixels.

The Turing Tarpit Theorem

The choice of programming language does not matter.

For any two Turing universal languages L1 and L2 there existsa compiler C : L1→ L2 such that

|C(h)| ≤ |h| + |C|

Universality of Shallow Architectures

We will now consider theorems that apply to any continuousf : [0, 1]m→ R

f→ R

Note that the values at the corners are independent — we canuse multilinear interpolation to fill in the interior of the cube.

There are 2m corners. Each corner has an independent value.

So it must take at least 2m bits of information to specify f .

The Kolmogorov-Arnold representation theorem (1956)

Any continuous function of m inputs can be represented ex-actly by a small (polynomial sized) two-layer network.

f (x1, . . . , xm) =

2m+1∑i=1

m∑j=1

hi,j(xj)

Where gi and hi,j are continuous scalar-to-scalar functions.

A Simpler, Similar Theorem

For any (possibly discontinuous) f : [0, 1]m→ R we have

f (x1, . . . , xm) = g

hi(xi)

for (discontinuous) scalar-to-scalar functions g and hi.

Proof: Any single real number contains an infinite amount ofinformation.

Select hi to spread out the digits of its argument so that∑i hi(xi) contains all the digits of all the xi.

A Reference

F. Girosi and T. Poggio. Representation properties of net-works: Kolmogorov s theorem is irrelevant. Neural Comp.,1:465 469, 1989.

Cybenko’s Universal Approximation Theorem (1989):

Again consider any continuous f : [0, 1]m→ R

f→ R

Again note that f must contain an exponential amount ofinformation (an independent value at every corner).

Cybenko’s Universal Approximation Theorem (1989)

Any continuous function can be approximated arbitrarily wellby a two layer perceptron.

For any continuous f : [0, 1]m → R and any ε > 0, thereexists

F (x) = α · σ(Wx + β)

Wi,j xj + βi

such that for all x in [0, 1]m we have |F (x)− f (x)| < ε.

How Many Hidden Units?

Consider Boolean functions f : {0, 1}m→ {0, 1}.

For Boolean functions we can simply list the inputs x0, . . . , xk

where the function is true.

f (x) =∑k

I [x = xk]

I [x = xk] ≈ σ

Wk,ixi + bk

This is analogous to observing that every Boolean function canbe put in disjunctive normal form.

The Cybenko Theorem is Irrelevant

The number of inputs where a Boolean function f is true istypically exponential in the number of arguments to f .

The number of hidden units (channels) needed for Cybenko’stheorem is exponential.

Shallow Circuit Inexpressibility Theorems

Building on work of Ajtai, Sipser and others, Hastad proved(1987) that any bounded-depth Boolean circuit computing theparity function must have exponential size.

Matus Telgarsky recently gave some formal conditions underwhich shallow networks provably require exponentially moreparameters than deeper networks (COLT 2016).

Circuit Complexity Theory

Circuit complexity theory — the study of circuit size as a func-tion of the components allowed and the depth allowed — is aspecial case of the study of deep real-valued computation.

Depth and size requirements are hard to prove. Little hasbeen proven about depth and size requirements when additivethreshold gates are allowed.

Still, we believe that deeper circuits can be smaller in total sizethan shallow circuits.

Finding the Program

The Occam’s Razor theorem says that a concise program thatdoes well on the training data will do well on the test data.

But the theorem does not tell us how to find such a program.

Levin’s Universal Problem Solver (Levin Search)

Leonid Levin observed that one can construct a universal solverthat takes as input an oracle for testing proposed solutions and,if a solution exists, returns it.

One can of course enumerate all candidate solutions.

However, Levin’s solver is universal in the sense that it is notmore than a constant factor slower than any other solver.

Levin’s Universal Solver

We time share all programs giving time slice 2−|h| to programh where |h| is the length in bits of h.

The run time of the universal solver is at most

O(2−|h|(h(n) + T (n)))

where h(n) is the time required to run program h on a problemof size n and T (n) is the time required to check the solution.

Here 2−|h| is independent of n and is technically a constant.

Bootstrapping Levin Search

While Levin search sounds like a joke, Jurgin Schmidhuber(inventor of LSTMs and other deep architectural motifs) takesit seriously.

He has proposed ways of accelerating a search over all programsand has something called the Optimal Ordered Problem Solver(OOPS).

The basic idea is bootstrapping — we automate a search formethods of efficient search.

Deep Learning and Evolution

The Baldwin Effect

In a 1987 paper entitled “How Learning Can Guide Evolu-tion”, Goeffrey Hinton and Steve Nowlan brought attention toa paper by Baldwin (1896).

The basic idea is that learning facilitates modularity.

For example, longer arms are easier to evolve if arm controlis learned — arm control is then independent of arm length.Arm control and arm structure become more modular.

If each “module” is learning to participate in the “society ofmind” then each model can more easily accommodate (exploit)changes (improvements) in other modules.

Recent Neuroscience: Quanta Magazine, January 10, 2017

Image from Department of Brain and Cognitive Sciences and McGovern

Institute, Massachusetts Institute of Technology

The g factor

TTIC 31230, Fundamentals of Deep Learningdmcallester/DeepClass/universality.pdf · Deep Learning...

Documents

Transcript of TTIC 31230, Fundamentals of Deep Learningdmcallester/DeepClass/universality.pdf · Deep Learning...

Deep Generative Models I

Advanced Q-Function Learning Methodsrll.berkeley.edu/deeprlcoursesp17/docs/lec4.pdfZ. Wang, N. de Freitas, and M. Lanctot.\Dueling network architectures for deep reinforcement learning".

μVulDeePecker: A Deep Learning-Based System for Multiclass ...

P01 introduction cvpr2012 deep learning methods for vision

deep imaging Lecture 5: A gentle introduction to optimizationLecture 5: A gentle introduction to optimization. Machine Learning and Imaging – Roarke Horstmeyer(2019) deep imaging

arXiv:1605.09721v1 [stat.ML] 31 May 2016 · problems, as demonstrated by deep learning systems such as Google’s Downpour SGD [DCM+12] and Microsoft’s Project Adam [CSAK14]. While

Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

EVOLU ÇÃO DO PENSAMENTO RELIGIOSO Aula ll

Introduction - Deep Learning · 2019-01-03 · (Goodfellow 2016) Depth: Repeated Composition CHAPTER 1. INTRODUCTION Visible layer (input pixels) 1st hidden layer (edges) 2nd hidden

11-785 Deep Learning - Neural Networksdeeplearning.cs.cmu.edu/F20/document/slides/lecx1... · 2021. 3. 22. · –Causing the field at other neurons to change, potentially making

Multilayer Perceptron and Deep Learning - uni-goettingen.de · Supervised learning Multilayer Perceptron and Deep Learning. Some slides are adopted from Honglak Lee, Geoffrey Hinton,

Deep Neural Networks Motivated By Ordinary Differential ...lruthot/talks/2019-LR-IPAM-ODE.pdf · Deep Learning Revolution (?) 8 >> >< >> >: Y j+1 = ˙(K +b )

CS-F441: Selected Topics from Computer Science (Deep Learning for NLP & CV) · 2019-11-06 · CS-F441: SELECTED TOPICS FROM COMPUTER SCIENCE (DEEP LEARNING FOR NLP & CV) Lecture-KT-10:

Provable Bounds for Learning Some Deep Representations · Provable Bounds for Learning Some Deep Representations Sanjeev Arora Aditya Bhaskara y Rong Gez Tengyu Max October 24, 2013

Why did the network make this prediction?ece739/lectures/18739... · Understanding Deep Neural Networks We understand them enough to: Design architectures for complex learning tasks

By Johan Karlberg & Erik Månsson An Othello agent ...fileadmin.cs.lth.se/cs/Education/edan70/AIProjects/2018/slides/... · An Othello agent utilizing deep learning By Johan Karlberg

Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

DEEP FOUNDATIONS – Pile Foundations - جامعة تكريتced.ceng.tu.edu.iq/images/lectures/dr.farouq/ch7-Deep-Foundation... · DEEP FOUNDATIONS – Pile Foundations ... Qs =

Google Research Geoffrey Irving Christian Szegedy …aitp-conference.org/2016/slides/aitp-deep-learning-intro.pdf · [Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton 2012] DeepDream

Lecture 15: Learning Basics Neural Network / Deep Machine Learning …€¦ · Machine Learning Lecture 15: Neural Network / Deep Learning Basics. 3. ewx+b 1 + e wx+b Logistic Regression