Representation Power of Feedforward Neural Networksdasgupta/254-deep/matus.pdf · Representation...

Post on 05-Apr-2018

225 views 2 download

Transcript of Representation Power of Feedforward Neural Networksdasgupta/254-deep/matus.pdf · Representation...

Representation Power of Feedforward NeuralNetworks

Based on work by Barron (1993), Cybenko (1989),Kolmogorov (1957)

Matus Telgarsky <mtelgars@cs.ucsd.edu>

Feedforward Neural Networks

I Two node types:I Linear combinations:

x 7→∑i

wixi + w0.

I Sigmoid thresholded linear combinations:

x 7→ σ (〈w, x〉+ w0) .

I What can a network of these nodes represent?n∑i=1

wixi one layer,

n∑i=1

wiσ

ni∑j=1

wjixj + wj0

two layers,

......

Feedforward Neural Networks

I Two node types:I Linear combinations:

x 7→∑i

wixi + w0.

I Sigmoid thresholded linear combinations:

x 7→ σ (〈w, x〉+ w0) .

I What can a network of these nodes represent?n∑i=1

wixi one layer,

n∑i=1

wiσ

ni∑j=1

wjixj + wj0

two layers,

......

Forget about 1 layer.

I Target set [0, 3]; target function 1[x ∈ [1, 2]].

I Standard sigmoid σs(x) := 1/(1 + e−x).

I Consider sigmoid output at x = 2.

I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

0 3

1

0.5

I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

Forget about 1 layer.

I Target set [0, 3]; target function 1[x ∈ [1, 2]].

I Standard sigmoid σs(x) := 1/(1 + e−x).

I Consider sigmoid output at x = 2.

I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

0 3

1

0.5

I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

Forget about 1 layer.

I Target set [0, 3]; target function 1[x ∈ [1, 2]].

I Standard sigmoid σs(x) := 1/(1 + e−x).

I Consider sigmoid output at x = 2.

I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

0 3

1

0.5

I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

Forget about 1 layer.

I Target set [0, 3]; target function 1[x ∈ [1, 2]].

I Standard sigmoid σs(x) := 1/(1 + e−x).

I Consider sigmoid output at x = 2.I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

0 3

1

0.5

I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

Forget about 1 layer.

I Target set [0, 3]; target function 1[x ∈ [1, 2]].

I Standard sigmoid σs(x) := 1/(1 + e−x).

I Consider sigmoid output at x = 2.I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

0 3

1

0.5

I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

Forget about 1 layer.

I Target set [0, 3]; target function 1[x ∈ [1, 2]].

I Standard sigmoid σs(x) := 1/(1 + e−x).

I Consider sigmoid output at x = 2.I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

0 3

1

0.5

I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

Meaning of “Universal Approximation”

Target set [0, 1]n; target function f ∈ C([0, 1]n).

I For any ε > 0, exists NN f ,

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

I This gives NNs fi → f pointwise.

I For any ε > 0, exists NN f and S ⊆ [0, 1]n, m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

I If (for instance) bounded on Sc, gives NNs fi → f m-a.e..

Goal: 2-NNs approximate continuous functions over [0, 1]n.

Meaning of “Universal Approximation”

Target set [0, 1]n; target function f ∈ C([0, 1]n).

I For any ε > 0, exists NN f ,

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

I This gives NNs fi → f pointwise.

I For any ε > 0, exists NN f and S ⊆ [0, 1]n, m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

I If (for instance) bounded on Sc, gives NNs fi → f m-a.e..

Goal: 2-NNs approximate continuous functions over [0, 1]n.

Meaning of “Universal Approximation”

Target set [0, 1]n; target function f ∈ C([0, 1]n).

I For any ε > 0, exists NN f ,

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

I This gives NNs fi → f pointwise.

I For any ε > 0, exists NN f and S ⊆ [0, 1]n, m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

I If (for instance) bounded on Sc, gives NNs fi → f m-a.e..

Goal: 2-NNs approximate continuous functions over [0, 1]n.

Meaning of “Universal Approximation”

Target set [0, 1]n; target function f ∈ C([0, 1]n).

I For any ε > 0, exists NN f ,

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

I This gives NNs fi → f pointwise.

I For any ε > 0, exists NN f and S ⊆ [0, 1]n, m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

I If (for instance) bounded on Sc, gives NNs fi → f m-a.e..

Goal: 2-NNs approximate continuous functions over [0, 1]n.

Meaning of “Universal Approximation”

Target set [0, 1]n; target function f ∈ C([0, 1]n).

I For any ε > 0, exists NN f ,

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

I This gives NNs fi → f pointwise.

I For any ε > 0, exists NN f and S ⊆ [0, 1]n, m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

I If (for instance) bounded on Sc, gives NNs fi → f m-a.e..

Goal: 2-NNs approximate continuous functions over [0, 1]n.

Meaning of “Universal Approximation”

Target set [0, 1]n; target function f ∈ C([0, 1]n).

I For any ε > 0, exists NN f ,

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

I This gives NNs fi → f pointwise.

I For any ε > 0, exists NN f and S ⊆ [0, 1]n, m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

I If (for instance) bounded on Sc, gives NNs fi → f m-a.e..

Goal: 2-NNs approximate continuous functions over [0, 1]n.

Outline

I 2-nn via functional analysis (Cybenko, 1989).

I 2-nn via greedy approx (Barron, 1993).

I 3-nn via histograms (Folklore).

I 3-nn via wizardry (Kolmogorov, 1957).

Overview of Functional Analysis proof (Cybenko, 1989)

I Hidden layer as a basis:

B := σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R .

I Want to show cl(span(B)) = C([0, 1]n).

I Work via contradiction: if f ∈ C([0, 1]n) far fromcl(span(B)), can bridge the gap with a sigmoid.

Overview of Functional Analysis proof (Cybenko, 1989)

I Hidden layer as a basis:

B := σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R .

I Want to show cl(span(B)) = C([0, 1]n).

I Work via contradiction: if f ∈ C([0, 1]n) far fromcl(span(B)), can bridge the gap with a sigmoid.

Overview of Functional Analysis proof (Cybenko, 1989)

I Hidden layer as a basis:

B := σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R .

I Want to show cl(span(B)) = C([0, 1]n).

I Work via contradiction: if f ∈ C([0, 1]n) far fromcl(span(B)), can bridge the gap with a sigmoid.

Abstracting σ

I Cybenko needs σ discriminates:

µ = 0 ⇐⇒ ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I Satisfied for the standard choices

σs(x) =1

1 + e−x,

1

2(tanh(x) + 1) =

1

2

(ex − e−x

ex + e−x+ 1

)= σs(2x).

I Most results today only need σ approximates 1[x ≥ 0]:

σ(x)→

1 as x→ +∞,

0 as x→ −∞.

Combined with σ bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

Abstracting σ

I Cybenko needs σ discriminates:

µ = 0 ⇐⇒ ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I Satisfied for the standard choices

σs(x) =1

1 + e−x,

1

2(tanh(x) + 1) =

1

2

(ex − e−x

ex + e−x+ 1

)= σs(2x).

I Most results today only need σ approximates 1[x ≥ 0]:

σ(x)→

1 as x→ +∞,

0 as x→ −∞.

Combined with σ bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

Abstracting σ

I Cybenko needs σ discriminates:

µ = 0 ⇐⇒ ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I Satisfied for the standard choices

σs(x) =1

1 + e−x,

1

2(tanh(x) + 1) =

1

2

(ex − e−x

ex + e−x+ 1

)= σs(2x).

I Most results today only need σ approximates 1[x ≥ 0]:

σ(x)→

1 as x→ +∞,

0 as x→ −∞.

Combined with σ bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

Proof of Cybenko (1989)

I Consider the

closed

subspace

S :=

cl

(

span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

)

.

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.

I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.

I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.

I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.

I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.

I Can think of projecting f onto S⊥.

I Don’t need inner products: Hahn-Banach Theorem.

I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.

I Can think of projecting f onto S⊥.I Don’t need inner products: Hahn-Banach Theorem.

I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.

I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I In Rn, linear functions have form 〈·, y〉 for some y ∈ Rn.

I With infinite dimensions, integrals give inner products.I A form of the Riesz Representation Theorem.

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I In Rn, linear functions have form 〈·, y〉 for some y ∈ Rn.I With infinite dimensions, integrals give inner products.

I A form of the Riesz Representation Theorem.

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I In Rn, linear functions have form 〈·, y〉 for some y ∈ Rn.I With infinite dimensions, integrals give inner products.I A form of the Riesz Representation Theorem.

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I Contradiction: σ is discriminatory.

I L|S = 0 implies ∀w,w0 ∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I Contradiction: σ is discriminatory.I L|S = 0 implies ∀w,w0

∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)

).

I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

∫h(x)dµ(x).

I Contradiction: σ is discriminatory.I L|S = 0 implies ∀w,w0

∫σ(〈w, x〉+ w0)dµ(x) = 0.

I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 ∫. . ..

What was Cybenko’s proof about?

I Set S := cl (span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)).

I Another way to look at it.

I Given f ∈ C([0, 1]n), current approximant f ∈ S.I The residue f − f must have nonzero projection onto S.I Add residue projection into f ; does this now match f?

I Can this be turned into an algorithm?

What was Cybenko’s proof about?

I Set S := cl (span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)).I Another way to look at it.

I Given f ∈ C([0, 1]n), current approximant f ∈ S.I The residue f − f must have nonzero projection onto S.I Add residue projection into f ; does this now match f?

I Can this be turned into an algorithm?

What was Cybenko’s proof about?

I Set S := cl (span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)).I Another way to look at it.

I Given f ∈ C([0, 1]n), current approximant f ∈ S.

I The residue f − f must have nonzero projection onto S.I Add residue projection into f ; does this now match f?

I Can this be turned into an algorithm?

What was Cybenko’s proof about?

I Set S := cl (span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)).I Another way to look at it.

I Given f ∈ C([0, 1]n), current approximant f ∈ S.I The residue f − f must have nonzero projection onto S.

I Add residue projection into f ; does this now match f?

I Can this be turned into an algorithm?

What was Cybenko’s proof about?

I Set S := cl (span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)).I Another way to look at it.

I Given f ∈ C([0, 1]n), current approximant f ∈ S.I The residue f − f must have nonzero projection onto S.I Add residue projection into f ; does this now match f?

I Can this be turned into an algorithm?

What was Cybenko’s proof about?

I Set S := cl (span (σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R)).I Another way to look at it.

I Given f ∈ C([0, 1]n), current approximant f ∈ S.I The residue f − f must have nonzero projection onto S.I Add residue projection into f ; does this now match f?

I Can this be turned into an algorithm?

Outline

I 2-nn via functional analysis (Cybenko, 1989).

I 2-nn via greedy approx (Barron, 1993).

I 3-nn via histograms (Folklore).

I 3-nn via wizardry (Kolmogorov, 1957).

The method of Barron (1993, Section 8)

Input: basis set B,

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar?

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

The method of Barron (1993, Section 8)

Input: basis set B, target f ∈ cl(conv(B)).

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar?

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

The method of Barron (1993, Section 8)

Input: basis set B, target f ∈ cl(conv(B)).

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar?

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

The method of Barron (1993, Section 8)

Input: basis set B, target f ∈ cl(conv(B)).

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar?

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

The method of Barron (1993, Section 8)

Input: basis set B, target f ∈ cl(conv(B)).

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar?

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

The method of Barron (1993, Section 8)

Input: basis set B, target f ∈ cl(conv(B)).

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar?

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

The method of Barron (1993, Section 8)

Input: basis set B, target f ∈ cl(conv(B)).

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar?

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

The method of Barron (1993, Section 8)

Input: basis set B, target f ∈ cl(conv(B)).

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar?

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

The method of Barron (1993, Section 8)

Input: basis set B, target f ∈ cl(conv(B)).

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar? Oh let’s see..

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

The method of Barron (1993, Section 8)

Input: basis set B, target f ∈ cl(conv(B)).

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar? Oh let’s see.. gradient descent, coordinatedescent, Frank-Wolfe, projection pursuit, basis pursuit,boosting.. as usual, cf. Zhang (2003).

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

The method of Barron (1993, Section 8)

Input: basis set B, target f ∈ cl(conv(B)).

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar? YES.

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

The method of Barron (1993, Section 8)

Input: basis set B, target f ∈ cl(conv(B)).

1. Choose arbitrary f0 ∈ B.

2. For t = 1, 2, . . .:

2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

infα∈[0,1],g∈B

‖f − (αft−1 + (1− α)g)‖22.

2.2 Update ft := αtft−1 + (1− αt)gt.

I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

I Is this familiar? YES.

I Barron carefully shows rate ‖f − ft‖22 = O(1/t).

Greedy approx as coordinate descent

I Intuition: linear operator A with basis B as “columns”:

ft = Awt, g = Aej .

I Split the update into two steps:

infg∈B‖f − (ft−1 + g)‖22 = inf

i‖f − (Awt−1 +Aei)‖22,

infα∈[0,1]

‖f − (αft−1 + (1− α)gt)‖22.

I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to

supg∈B

⟨g, ft−1 − f

⟩= sup

i(Aei)

>(Awt−1 − f).

I Second version is coordinate descent update!

I (Standard rate proofs use curvature of ‖ · ‖22.)

Greedy approx as coordinate descent

I Intuition: linear operator A with basis B as “columns”:

ft = Awt, g = Aej .

I Split the update into two steps:

infg∈B‖f − (ft−1 + g)‖22 = inf

i‖f − (Awt−1 +Aei)‖22,

infα∈[0,1]

‖f − (αft−1 + (1− α)gt)‖22.

I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to

supg∈B

⟨g, ft−1 − f

⟩= sup

i(Aei)

>(Awt−1 − f).

I Second version is coordinate descent update!

I (Standard rate proofs use curvature of ‖ · ‖22.)

Greedy approx as coordinate descent

I Intuition: linear operator A with basis B as “columns”:

ft = Awt, g = Aej .

I Split the update into two steps:

infg∈B‖f − (ft−1 + g)‖22 = inf

i‖f − (Awt−1 +Aei)‖22,

infα∈[0,1]

‖f − (αft−1 + (1− α)gt)‖22.

I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to

supg∈B

⟨g, ft−1 − f

⟩= sup

i(Aei)

>(Awt−1 − f).

I Second version is coordinate descent update!

I (Standard rate proofs use curvature of ‖ · ‖22.)

Greedy approx as coordinate descent

I Intuition: linear operator A with basis B as “columns”:

ft = Awt, g = Aej .

I Split the update into two steps:

infg∈B‖f − (ft−1 + g)‖22 = inf

i‖f − (Awt−1 +Aei)‖22,

infα∈[0,1]

‖f − (αft−1 + (1− α)gt)‖22.

I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to

supg∈B

⟨g, ft−1 − f

⟩= sup

i(Aei)

>(Awt−1 − f).

I Second version is coordinate descent update!

I (Standard rate proofs use curvature of ‖ · ‖22.)

Greedy approx as coordinate descent

I Intuition: linear operator A with basis B as “columns”:

ft = Awt, g = Aej .

I Split the update into two steps:

infg∈B‖f − (ft−1 + g)‖22 = inf

i‖f − (Awt−1 +Aei)‖22,

infα∈[0,1]

‖f − (αft−1 + (1− α)gt)‖22.

I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to

supg∈B

⟨g, ft−1 − f

⟩= sup

i(Aei)

>(Awt−1 − f).

I Second version is coordinate descent update!

I (Standard rate proofs use curvature of ‖ · ‖22.)

Story for 2-layer NNs so far

I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..

I ..so why aren’t we using this algorithm for neural nets?

I This is not an easy optimization problem:

infα∈[0,1],w∈Rn,w0∈R

‖f − (αft−1 + (1− α)σ(〈w, ·〉+ w0)‖22.

I For a general basis B, must check each basis element..

I From an algorithmic perspective, much remains to be done.

Story for 2-layer NNs so far

I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..

I ..so why aren’t we using this algorithm for neural nets?

I This is not an easy optimization problem:

infα∈[0,1],w∈Rn,w0∈R

‖f − (αft−1 + (1− α)σ(〈w, ·〉+ w0)‖22.

I For a general basis B, must check each basis element..

I From an algorithmic perspective, much remains to be done.

Story for 2-layer NNs so far

I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..

I ..so why aren’t we using this algorithm for neural nets?I This is not an easy optimization problem:

infα∈[0,1],w∈Rn,w0∈R

‖f − (αft−1 + (1− α)σ(〈w, ·〉+ w0)‖22.

I For a general basis B, must check each basis element..

I From an algorithmic perspective, much remains to be done.

Story for 2-layer NNs so far

I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..

I ..so why aren’t we using this algorithm for neural nets?I This is not an easy optimization problem:

infα∈[0,1],w∈Rn,w0∈R

‖f − (αft−1 + (1− α)σ(〈w, ·〉+ w0)‖22.

I For a general basis B, must check each basis element..

I From an algorithmic perspective, much remains to be done.

Story for 2-layer NNs so far

I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..

I ..so why aren’t we using this algorithm for neural nets?I This is not an easy optimization problem:

infα∈[0,1],w∈Rn,w0∈R

‖f − (αft−1 + (1− α)σ(〈w, ·〉+ w0)‖22.

I For a general basis B, must check each basis element..

I From an algorithmic perspective, much remains to be done.

Lower bound from Barron (1993, Section 10)

I The greedy algorithm gets to search whole basis.

I If an n-element basis Bn is fixed, then

supf∈cl(spanC(B))

inff∈span(Bn)

‖f − f‖2 = Ω

(1

n1/d

),

the curse of dimension!

I Can defeat this with more layers (cf. Kolmogorov (1957)).

Lower bound from Barron (1993, Section 10)

I The greedy algorithm gets to search whole basis.

I If an n-element basis Bn is fixed, then

supf∈cl(spanC(B))

inff∈span(Bn)

‖f − f‖2 = Ω

(1

n1/d

),

the curse of dimension!

I Can defeat this with more layers (cf. Kolmogorov (1957)).

Lower bound from Barron (1993, Section 10)

I The greedy algorithm gets to search whole basis.

I If an n-element basis Bn is fixed, then

supf∈cl(spanC(B))

inff∈span(Bn)

‖f − f‖2 = Ω

(1

n1/d

),

the curse of dimension!

I Can defeat this with more layers (cf. Kolmogorov (1957)).

Outline

I 2-nn via functional analysis (Cybenko, 1989).

I 2-nn via greedy approx (Barron, 1993).

I 3-nn via histograms (Folklore).

I 3-nn via wizardry (Kolmogorov, 1957).

Folklore proof of NN power

I What’s an easy way to write function as sum of others?

I Project onto favorite basis (Fourier, Chebyshev, ...)!

I Let’s use this to build NNs.

I Easiest basis: histograms!

Folklore proof of NN power

I What’s an easy way to write function as sum of others?I Project onto favorite basis (Fourier, Chebyshev, ...)!

I Let’s use this to build NNs.

I Easiest basis: histograms!

Folklore proof of NN power

I What’s an easy way to write function as sum of others?I Project onto favorite basis (Fourier, Chebyshev, ...)!

I Let’s use this to build NNs.

I Easiest basis: histograms!

Folklore proof of NN power

I What’s an easy way to write function as sum of others?I Project onto favorite basis (Fourier, Chebyshev, ...)!

I Let’s use this to build NNs.I Easiest basis: histograms!

Histogram basis for C([0, 1]n)

I Grid [0, 1]n into Rimi=1; consider

f(x) =

m∑i=1

ai1[x ∈ Ri],

where target f(x) = ai for some x ∈ Ri.

I Fact: for any ε > 0, exists histogram f with

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

I Pf. Continuous function over compact set is uniformlycontinuous.

I Now just write individual histogram bars as a NNs.

Histogram basis for C([0, 1]n)

I Grid [0, 1]n into Rimi=1; consider

f(x) =

m∑i=1

ai1[x ∈ Ri],

where target f(x) = ai for some x ∈ Ri.I Fact: for any ε > 0, exists histogram f with

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

I Pf. Continuous function over compact set is uniformlycontinuous.

I Now just write individual histogram bars as a NNs.

Histogram basis for C([0, 1]n)

I Grid [0, 1]n into Rimi=1; consider

f(x) =

m∑i=1

ai1[x ∈ Ri],

where target f(x) = ai for some x ∈ Ri.I Fact: for any ε > 0, exists histogram f with

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

I Pf. Continuous function over compact set is uniformlycontinuous.

I Now just write individual histogram bars as a NNs.

Histogram basis for C([0, 1]n)

I Grid [0, 1]n into Rimi=1; consider

f(x) =

m∑i=1

ai1[x ∈ Ri],

where target f(x) = ai for some x ∈ Ri.I Fact: for any ε > 0, exists histogram f with

x ∈ [0, 1]n =⇒ |f(x)− f(x)| < ε.

I Pf. Continuous function over compact set is uniformlycontinuous.

I Now just write individual histogram bars as a NNs.

Histograms via 0/1 NNs

I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].

I 2-D example. First carve out axes.

2

1

1 2

3

2

3

2

2

3

4

3

I Now pass through a second layer 1[input ≥ 3.5].

1

0

Histograms via 0/1 NNs

I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].

I 2-D example. First carve out axes.

2

1

1 2

3

2

3

2

2

3

4

3

I Now pass through a second layer 1[input ≥ 3.5].

1

0

Histograms via 0/1 NNs

I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].

I 2-D example. First carve out axes.

2

1

1 2

3

2

3

2

2

3

4

3

I Now pass through a second layer 1[input ≥ 3.5].

1

0

Histograms via 0/1 NNs

I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].

I 2-D example. First carve out axes.

2

1

1

2

3

2

3

2

2

3

4

3

I Now pass through a second layer 1[input ≥ 3.5].

1

0

Histograms via 0/1 NNs

I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].

I 2-D example. First carve out axes.

2

1

1 2

3

2

3

2

2

3

4

3

I Now pass through a second layer 1[input ≥ 3.5].

1

0

Histograms via 0/1 NNs

I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].

I 2-D example. First carve out axes.

2

1

1 2

3

2

3

2

2

3

4

3

I Now pass through a second layer 1[input ≥ 3.5].

1

0

Histograms via sigmoidal NNs

I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).

I 2-D example. First carve out axes, with fuzz region.

>2-2ε

<1+2ε

<1+2ε

<2+4ε

<3+4ε

>4-4ε

I Now pass through a second layer σ(huge · (input− 3.5)).

>1-ε

Histograms via sigmoidal NNs

I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).

I 2-D example. First carve out axes, with fuzz region.

>2-2ε

<1+2ε

<1+2ε

<2+4ε

<3+4ε

>4-4ε

I Now pass through a second layer σ(huge · (input− 3.5)).

>1-ε

Histograms via sigmoidal NNs

I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).

I 2-D example. First carve out axes, with fuzz region.

>2-2ε

<1+2ε

<1+2ε

<2+4ε

<3+4ε

>4-4ε

I Now pass through a second layer σ(huge · (input− 3.5)).

>1-ε

Histograms via sigmoidal NNs

I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).

I 2-D example. First carve out axes, with fuzz region.

>2-2ε

<1+2ε

<1+2ε

<2+4ε

<3+4ε

>4-4ε

I Now pass through a second layer σ(huge · (input− 3.5)).

>1-ε

Histograms via sigmoidal NNs

I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).

I 2-D example. First carve out axes, with fuzz region.

>2-2ε

<1+2ε

<1+2ε

<2+4ε

<3+4ε

>4-4ε

I Now pass through a second layer σ(huge · (input− 3.5)).

>1-ε

Histograms via sigmoidal NNs

I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).

I 2-D example. First carve out axes, with fuzz region.

>2-2ε

<1+2ε

<1+2ε

<2+4ε

<3+4ε

>4-4ε

I Now pass through a second layer σ(huge · (input− 3.5)).

>1-ε

Histograms apx via sigmoidal NNs, some math

I Histogram region is clearly 2-layer 0/1 NN:

1[∀i xi ≥ li ∧ xi ≤ ui]

= 1

[n∑i=1

(1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n

].

I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).

I ε/2-apx f with histogram ai, Rimi=1;ε/(2m)-apx each bar with sigmoids.

I Final NN f(x) :=∑m

i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

Histograms apx via sigmoidal NNs, some math

I Histogram region is clearly 2-layer 0/1 NN:

1[∀i xi ≥ li ∧ xi ≤ ui]

= 1

[n∑i=1

(1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n

].

I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).

I ε/2-apx f with histogram ai, Rimi=1;ε/(2m)-apx each bar with sigmoids.

I Final NN f(x) :=∑m

i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

Histograms apx via sigmoidal NNs, some math

I Histogram region is clearly 2-layer 0/1 NN:

1[∀i xi ≥ li ∧ xi ≤ ui]

= 1

[n∑i=1

(1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n

].

I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).

I ε/2-apx f with histogram ai, Rimi=1;ε/(2m)-apx each bar with sigmoids.

I Final NN f(x) :=∑m

i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

Histograms apx via sigmoidal NNs, some math

I Histogram region is clearly 2-layer 0/1 NN:

1[∀i xi ≥ li ∧ xi ≤ ui]

= 1

[n∑i=1

(1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n

].

I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).

I ε/2-apx f with histogram ai, Rimi=1;ε/(2m)-apx each bar with sigmoids.

I Final NN f(x) :=∑m

i=1 ai(sigmoidal apx to 1[x ∈ Ri]).

I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

Histograms apx via sigmoidal NNs, some math

I Histogram region is clearly 2-layer 0/1 NN:

1[∀i xi ≥ li ∧ xi ≤ ui]

= 1

[n∑i=1

(1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n

].

I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).

I ε/2-apx f with histogram ai, Rimi=1;ε/(2m)-apx each bar with sigmoids.

I Final NN f(x) :=∑m

i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− ε,

x ∈ S =⇒ |f(x)− f(x)| < ε.

Folklore proof discussion

I Curse of dimension!

I Still, high level features useful.

Outline

I 2-nn via functional analysis (Cybenko, 1989).

I 2-nn via greedy approx (Barron, 1993).

I 3-nn via histograms (Folklore).

I 3-nn via wizardry (Kolmogorov, 1957).

Preface to the Kolmogorov Superposition Theorem

I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:

Can the solution to

x7 + ax3 + bx2 + cx+ 1 = 0

be written as a finite number of 2-variablefunctions?

I Kolmogorov’s generalization is a NN apx theorem.

I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.

Preface to the Kolmogorov Superposition Theorem

I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:

Can the solution to

x7 + ax3 + bx2 + cx+ 1 = 0

be written as a finite number of 2-variablefunctions?

I Kolmogorov’s generalization is a NN apx theorem.

I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.

Preface to the Kolmogorov Superposition Theorem

I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:

Can the solution to

x7 + ax3 + bx2 + cx+ 1 = 0

be written as a finite number of 2-variablefunctions?

I Kolmogorov’s generalization is a NN apx theorem.

I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.

Preface to the Kolmogorov Superposition Theorem

I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:

Can the solution to

x7 + ax3 + bx2 + cx+ 1 = 0

be written as a finite number of 2-variablefunctions?

I Kolmogorov’s generalization is a NN apx theorem.

I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.

The Kolmogorov Superposition Theorem

Theorem. (Kolmogorov, 1957)

There exist continuous ψp,q : p ∈ [n], q ∈ [2n+ 1], so that

forany f ∈ C([0, 1]n),

there exist continuous χq

2n+1

q=1 ,

f(x) =

2n+1

∑q=1

χq

n

∑p=1

ψp,q(xp)

I The proof is also histogram based.

I It must remove the “fuzz” gaps.I It must remove dependence on a particular ε > 0.

I The magic is within ψp,q!!

The Kolmogorov Superposition Theorem

Theorem. (Kolmogorov, 1957)

There exist continuous ψp,q : p ∈ [n], q ∈ [2n+ 1], so that

forany f ∈ C([0, 1]n), there exist continuous χq

2n+1

q=1 ,

f(x) =

2n+1

∑q=1

χq

n

∑p=1

ψp,q(xp)

I The proof is also histogram based.

I It must remove the “fuzz” gaps.I It must remove dependence on a particular ε > 0.

I The magic is within ψp,q!!

The Kolmogorov Superposition Theorem

Theorem. (Kolmogorov, 1957)

There exist continuous ψp,q : p ∈ [n], q ∈ [2n+ 1], so that

forany f ∈ C([0, 1]n), there exist continuous χq

2n+1

q=1 ,

f(x) =

2n+1

∑q=1

χq

n∑p=1

ψp,q(xp)

I The proof is also histogram based.

I It must remove the “fuzz” gaps.I It must remove dependence on a particular ε > 0.

I The magic is within ψp,q!!

The Kolmogorov Superposition Theorem

Theorem. (Kolmogorov, 1957)

There exist continuous ψp,q : p ∈ [n], q ∈ [2n+ 1], so that

forany f ∈ C([0, 1]n), there exist continuous χq2n+1

q=1 ,

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I The proof is also histogram based.

I It must remove the “fuzz” gaps.I It must remove dependence on a particular ε > 0.

I The magic is within ψp,q!!

The Kolmogorov Superposition Theorem

Theorem. (Kolmogorov, 1957)

There exist continuous ψp,q : p ∈ [n], q ∈ [2n+ 1], so that forany f ∈ C([0, 1]n), there exist continuous χq2n+1

q=1 ,

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I The proof is also histogram based.

I It must remove the “fuzz” gaps.I It must remove dependence on a particular ε > 0.

I The magic is within ψp,q!!

The Kolmogorov Superposition Theorem

Theorem. (Kolmogorov, 1957)

There exist continuous ψp,q : p ∈ [n], q ∈ [2n+ 1], so that forany f ∈ C([0, 1]n), there exist continuous χq2n+1

q=1 ,

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I The proof is also histogram based.

I It must remove the “fuzz” gaps.I It must remove dependence on a particular ε > 0.

I The magic is within ψp,q!!

The Kolmogorov Superposition Theorem

Theorem. (Kolmogorov, 1957)

There exist continuous ψp,q : p ∈ [n], q ∈ [2n+ 1], so that forany f ∈ C([0, 1]n), there exist continuous χq2n+1

q=1 ,

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I The proof is also histogram based.

I It must remove the “fuzz” gaps.

I It must remove dependence on a particular ε > 0.

I The magic is within ψp,q!!

The Kolmogorov Superposition Theorem

Theorem. (Kolmogorov, 1957)

There exist continuous ψp,q : p ∈ [n], q ∈ [2n+ 1], so that forany f ∈ C([0, 1]n), there exist continuous χq2n+1

q=1 ,

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I The proof is also histogram based.

I It must remove the “fuzz” gaps.I It must remove dependence on a particular ε > 0.

I The magic is within ψp,q!!

The Kolmogorov Superposition Theorem

Theorem. (Kolmogorov, 1957)

There exist continuous ψp,q : p ∈ [n], q ∈ [2n+ 1], so that forany f ∈ C([0, 1]n), there exist continuous χq2n+1

q=1 ,

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I The proof is also histogram based.

I It must remove the “fuzz” gaps.I It must remove dependence on a particular ε > 0.

I The magic is within ψp,q!!

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

q

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

q

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

q

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

q

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

q

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

q

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

q

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

q

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

q

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing ψp,q

I Given resolution k ∈ Z++, build a staggered gridding.

q

I Let Sqk,i1,...,im range over the cells.

I Construct ψp,q so that for each k and q,∑n

p=1 ψp,q maps

each Sqk,i1,...,im to its own specific interval of R.

I Holds for all q: gracefully handles the fuzz regions.

I Holds for all k, simultaneously handles all resolutions.

I The functions ψp,q are fractals.

Constructing χq

I Recall goal:

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I Start χq0 := 0; eventually χq := limr→∞ χqr.

I To build χqr+1 from χqr:

I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

I (I’m leaving a lot out =( )

Constructing χq

I Recall goal:

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I Start χq0 := 0; eventually χq := limr→∞ χ

qr.

I To build χqr+1 from χqr:

I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

I (I’m leaving a lot out =( )

Constructing χq

I Recall goal:

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I Start χq0 := 0; eventually χq := limr→∞ χ

qr.

I To build χqr+1 from χqr:

I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

I (I’m leaving a lot out =( )

Constructing χq

I Recall goal:

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I Start χq0 := 0; eventually χq := limr→∞ χ

qr.

I To build χqr+1 from χqr:I Choose kr+1 so gridding fluctuations sufficiently small.

I Make χq some value of f in each cell, smooth interpolationin fuzz.

I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

I (I’m leaving a lot out =( )

Constructing χq

I Recall goal:

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I Start χq0 := 0; eventually χq := limr→∞ χ

qr.

I To build χqr+1 from χqr:I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

in fuzz.

I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

I (I’m leaving a lot out =( )

Constructing χq

I Recall goal:

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I Start χq0 := 0; eventually χq := limr→∞ χ

qr.

I To build χqr+1 from χqr:I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

I (I’m leaving a lot out =( )

Constructing χq

I Recall goal:

f(x) =

2n+1∑q=1

χq

n∑p=1

ψp,q(xp)

I Start χq0 := 0; eventually χq := limr→∞ χ

qr.

I To build χqr+1 from χqr:I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

I (I’m leaving a lot out =( )

Discussion 1

Discussion 2

I It needs different transfer functions..

I ..but these can be approximated by multiple layers!

I It shows the value of powerful nodes, whereas the standardapx results just suggest a very wide hidden layer.

Discussion 2

I It needs different transfer functions..

I ..but these can be approximated by multiple layers!

I It shows the value of powerful nodes, whereas the standardapx results just suggest a very wide hidden layer.

Discussion 2

I It needs different transfer functions..

I ..but these can be approximated by multiple layers!

I It shows the value of powerful nodes, whereas the standardapx results just suggest a very wide hidden layer.

Conclusion

I Some fancy mechanisms give 2-NNs fi → f ∈ C([0, 1]n).

I Histogram constructions hint to the power of deepernetworks.

Thanks!

Andrew R. Barron. Universal approximation bounds forsuperpositions of a sigmoidal function. IEEE Transactions onInformation theory, 39(3):930–945, 1993.

George Cybenko. Approximation by superpositions of asigmoidal function. Mathematics of Control, Signals, andSystems, 2:303–314, 1989.

Andrei N. Kolmogorov. On the representation of continuousfunctions of several veriables as superpositions of continuousfunctions of one variable and addition. Dokl. Acad. NaukSSSR, 114(5):953–956, 1957. Translation to English: V. M.Volosov.

Tong Zhang. Sequential greedy approximation for certainconvex optimization problems. IEEE Transactions onInformation Theory, 49(3):682–691, 2003.