Representation Power of Feedforward Neural Networksdasgupta/254-deep/matus.pdf · Representation...

Representation Power of Feedforward NeuralNetworks

Based on work by Barron (1993), Cybenko (1989),Kolmogorov (1957)

Matus Telgarsky <mtelgars@cs.ucsd.edu>

Feedforward Neural Networks

I Two node types:I Linear combinations:

x 7→∑i

wixi + w0.

I Sigmoid thresholded linear combinations:

x 7→ σ (〈w, x〉+ w0) .

I What can a network of these nodes represent?n∑i=1

wixi one layer,

n∑i=1

ni∑j=1

wjixj + wj0

two layers,

......

Feedforward Neural Networks

I Two node types:I Linear combinations:

x 7→∑i

wixi + w0.

I Sigmoid thresholded linear combinations:

x 7→ σ (〈w, x〉+ w0) .

I What can a network of these nodes represent?n∑i=1

wixi one layer,

n∑i=1

ni∑j=1

wjixj + wj0

two layers,

......

Forget about 1 layer.

I Target set [0, 3]; target function 1[x ∈ [1, 2]].

I Standard sigmoid σs(x) := 1/(1 + e−x).

I Consider sigmoid output at x = 2.

I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

I Consider sigmoid output at x = 2.I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

Meaning of “Universal Approximation”

Target set [0, 1]n; target function f ∈ C([0, 1]n).

I For any ε > 0, exists NN f ,

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

I This gives NNs fi → f pointwise.

I For any ε > 0, exists NN f and S ⊆ [0, 1]n, m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

I If (for instance) bounded on Sc, gives NNs fi → f m-a.e..

Goal: 2-NNs approximate continuous functions over [0, 1]n.

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

x ∈ S =⇒ |f(x)− f(x)| < ε.

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

x ∈ S =⇒ |f(x)− f(x)| < ε.

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

x ∈ S =⇒ |f(x)− f(x)| < ε.

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

x ∈ S =⇒ |f(x)− f(x)| < ε.

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

x ∈ S =⇒ |f(x)− f(x)| < ε.

Outline

I 2-nn via functional analysis (Cybenko, 1989).

I 2-nn via greedy approx (Barron, 1993).

I 3-nn via histograms (Folklore).

I 3-nn via wizardry (Kolmogorov, 1957).

Overview of Functional Analysis proof (Cybenko, 1989)

I Hidden layer as a basis:

B := σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R .

I Want to show cl(span(B)) = C([0, 1]n).

I Work via contradiction: if f ∈ C([0, 1]n) far fromcl(span(B)), can bridge the gap with a sigmoid.

B := σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R .

Abstracting σ

I Cybenko needs σ discriminates:

µ = 0 ⇐⇒ ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I Satisfied for the standard choices

σs(x) =1

1 + e−x,

2(tanh(x) + 1) =

(ex − e−x

ex + e−x+ 1

)= σs(2x).

I Most results today only need σ approximates 1[x ≥ 0]:

σ(x)→

1 as x→ +∞,

0 as x→ −∞.

Combined with σ bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

Abstracting σ

µ = 0 ⇐⇒ ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

σs(x) =1

1 + e−x,

2(tanh(x) + 1) =

(ex − e−x

ex + e−x+ 1

)= σs(2x).

σ(x)→

1 as x→ +∞,

0 as x→ −∞.

Abstracting σ

µ = 0 ⇐⇒ ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

σs(x) =1

1 + e−x,

2(tanh(x) + 1) =

(ex − e−x

ex + e−x+ 1

)= σs(2x).

σ(x)→

1 as x→ +∞,

0 as x→ −∞.

Proof of Cybenko (1989)

I Consider the

closed

subspace

span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.

I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

I Consider the closed subspace

S := cl