Representation Power of Feedforward Neural Networks · 2020. 8. 29. · Representation Power of...

136
Representation Power of Feedforward Neural Networks Based on work by Barron (1993), Cybenko (1989), Kolmogorov (1957) Matus Telgarsky <[email protected]>

Transcript of Representation Power of Feedforward Neural Networks · 2020. 8. 29. · Representation Power of...

  • Representation Power of Feedforward NeuralNetworks

    Based on work by Barron (1993), Cybenko (1989),Kolmogorov (1957)

    Matus Telgarsky

  • Feedforward Neural Networks

    I Two node types:I Linear combinations:

    x 7→∑i

    wixi + w0.

    I Sigmoid thresholded linear combinations:

    x 7→ σ (〈w, x〉+ w0) .

    I What can a network of these nodes represent?n∑i=1

    wixi one layer,

    n∑i=1

    wiσ

    ni∑j=1

    wjixj + wj0

    two layers,...

    ...

  • Feedforward Neural Networks

    I Two node types:I Linear combinations:

    x 7→∑i

    wixi + w0.

    I Sigmoid thresholded linear combinations:

    x 7→ σ (〈w, x〉+ w0) .

    I What can a network of these nodes represent?n∑i=1

    wixi one layer,

    n∑i=1

    wiσ

    ni∑j=1

    wjixj + wj0

    two layers,...

    ...

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x ∈ [1, 2]].

    I Standard sigmoid σs(x) := 1/(1 + e−x).

    I Consider sigmoid output at x = 2.

    I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

    0 3

    1

    0.5

    I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x ∈ [1, 2]].I Standard sigmoid σs(x) := 1/(1 + e

    −x).

    I Consider sigmoid output at x = 2.

    I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

    0 3

    1

    0.5

    I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x ∈ [1, 2]].I Standard sigmoid σs(x) := 1/(1 + e

    −x).

    I Consider sigmoid output at x = 2.

    I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

    0 3

    1

    0.5

    I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x ∈ [1, 2]].I Standard sigmoid σs(x) := 1/(1 + e

    −x).

    I Consider sigmoid output at x = 2.I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

    0 3

    1

    0.5

    I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x ∈ [1, 2]].I Standard sigmoid σs(x) := 1/(1 + e

    −x).

    I Consider sigmoid output at x = 2.I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

    0 3

    1

    0.5

    I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x ∈ [1, 2]].I Standard sigmoid σs(x) := 1/(1 + e

    −x).

    I Consider sigmoid output at x = 2.I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.

    0 3

    1

    0.5

    I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error ≥ 1/2.

  • Meaning of “Universal Approximation”

    Target set [0, 1]n; target function f ∈ C([0, 1]n).

    I For any � > 0, exists NN f̂ ,

    x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.

    I This gives NNs f̂i → f pointwise.

    I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,

    x ∈ S =⇒ |f(x)− f̂(x)| < �.

    I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..

    Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Meaning of “Universal Approximation”

    Target set [0, 1]n; target function f ∈ C([0, 1]n).I For any � > 0, exists NN f̂ ,

    x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.

    I This gives NNs f̂i → f pointwise.I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,

    x ∈ S =⇒ |f(x)− f̂(x)| < �.

    I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..

    Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Meaning of “Universal Approximation”

    Target set [0, 1]n; target function f ∈ C([0, 1]n).I For any � > 0, exists NN f̂ ,

    x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.

    I This gives NNs f̂i → f pointwise.

    I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,

    x ∈ S =⇒ |f(x)− f̂(x)| < �.

    I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..

    Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Meaning of “Universal Approximation”

    Target set [0, 1]n; target function f ∈ C([0, 1]n).I For any � > 0, exists NN f̂ ,

    x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.

    I This gives NNs f̂i → f pointwise.I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,

    x ∈ S =⇒ |f(x)− f̂(x)| < �.

    I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Meaning of “Universal Approximation”

    Target set [0, 1]n; target function f ∈ C([0, 1]n).I For any � > 0, exists NN f̂ ,

    x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.

    I This gives NNs f̂i → f pointwise.I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,

    x ∈ S =⇒ |f(x)− f̂(x)| < �.

    I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..

    Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Meaning of “Universal Approximation”

    Target set [0, 1]n; target function f ∈ C([0, 1]n).I For any � > 0, exists NN f̂ ,

    x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.

    I This gives NNs f̂i → f pointwise.I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,

    x ∈ S =⇒ |f(x)− f̂(x)| < �.

    I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Outline

    I 2-nn via functional analysis (Cybenko, 1989).

    I 2-nn via greedy approx (Barron, 1993).

    I 3-nn via histograms (Folklore).

    I 3-nn via wizardry (Kolmogorov, 1957).

  • Overview of Functional Analysis proof (Cybenko, 1989)

    I Hidden layer as a basis:

    B := {σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R} .

    I Want to show cl(span(B)) = C([0, 1]n).I Work via contradiction: if f ∈ C([0, 1]n) far from

    cl(span(B)), can bridge the gap with a sigmoid.

  • Overview of Functional Analysis proof (Cybenko, 1989)

    I Hidden layer as a basis:

    B := {σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R} .

    I Want to show cl(span(B)) = C([0, 1]n).

    I Work via contradiction: if f ∈ C([0, 1]n) far fromcl(span(B)), can bridge the gap with a sigmoid.

  • Overview of Functional Analysis proof (Cybenko, 1989)

    I Hidden layer as a basis:

    B := {σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R} .

    I Want to show cl(span(B)) = C([0, 1]n).I Work via contradiction: if f ∈ C([0, 1]n) far from

    cl(span(B)), can bridge the gap with a sigmoid.

  • Abstracting σ

    I Cybenko needs σ discriminates:

    µ = 0 ⇐⇒ ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I Satisfied for the standard choices

    σs(x) =1

    1 + e−x,

    1

    2(tanh(x) + 1) =

    1

    2

    (ex − e−x

    ex + e−x+ 1

    )= σs(2x).

    I Most results today only need σ approximates 1[x ≥ 0]:

    σ(x)→

    {1 as x→ +∞,0 as x→ −∞.

    Combined with σ bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

  • Abstracting σ

    I Cybenko needs σ discriminates:

    µ = 0 ⇐⇒ ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I Satisfied for the standard choices

    σs(x) =1

    1 + e−x,

    1

    2(tanh(x) + 1) =

    1

    2

    (ex − e−x

    ex + e−x+ 1

    )= σs(2x).

    I Most results today only need σ approximates 1[x ≥ 0]:

    σ(x)→

    {1 as x→ +∞,0 as x→ −∞.

    Combined with σ bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

  • Abstracting σ

    I Cybenko needs σ discriminates:

    µ = 0 ⇐⇒ ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I Satisfied for the standard choices

    σs(x) =1

    1 + e−x,

    1

    2(tanh(x) + 1) =

    1

    2

    (ex − e−x

    ex + e−x+ 1

    )= σs(2x).

    I Most results today only need σ approximates 1[x ≥ 0]:

    σ(x)→

    {1 as x→ +∞,0 as x→ −∞.

    Combined with σ bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

  • Proof of Cybenko (1989)

    I Consider the

    closed

    subspace

    S :=

    cl

    (

    span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    )

    .

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.

    I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

    ∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.

    I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

    ∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.

    I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

    ∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.

    I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.

    I Can think of projecting f onto S⊥.

    I Don’t need inner products: Hahn-Banach Theorem.

    I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.

    I Can think of projecting f onto S⊥.I Don’t need inner products: Hahn-Banach Theorem.

    I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.

    I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

    ∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

    ∫h(x)dµ(x).

    I In Rn, linear functions have form 〈·, y〉 for some y ∈ Rn.

    I With infinite dimensions, integrals give inner products.I A form of the Riesz Representation Theorem.

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

    ∫h(x)dµ(x).

    I In Rn, linear functions have form 〈·, y〉 for some y ∈ Rn.I With infinite dimensions, integrals give inner products.

    I A form of the Riesz Representation Theorem.

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

    ∫h(x)dµ(x).

    I In Rn, linear functions have form 〈·, y〉 for some y ∈ Rn.I With infinite dimensions, integrals give inner products.I A form of the Riesz Representation Theorem.

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

    ∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

    ∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.

    I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

    ∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.I L|S = 0 implies ∀w,w0 �

    ∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})

    ).

    I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =

    ∫h(x)dµ(x).

    I Contradiction: σ is discriminatory.I L|S = 0 implies ∀w,w0 �

    ∫σ(〈w, x〉+ w0)dµ(x) = 0.

    I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..

  • What was Cybenko’s proof about?

    I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).

    I Another way to look at it.

    I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.I The residue f − f̂ must have nonzero projection onto S.I Add residue projection into f̂ ; does this now match f?

    I Can this be turned into an algorithm?

  • What was Cybenko’s proof about?

    I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).I Another way to look at it.

    I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.I The residue f − f̂ must have nonzero projection onto S.I Add residue projection into f̂ ; does this now match f?

    I Can this be turned into an algorithm?

  • What was Cybenko’s proof about?

    I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).I Another way to look at it.

    I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.

    I The residue f − f̂ must have nonzero projection onto S.I Add residue projection into f̂ ; does this now match f?

    I Can this be turned into an algorithm?

  • What was Cybenko’s proof about?

    I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).I Another way to look at it.

    I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.I The residue f − f̂ must have nonzero projection onto S.

    I Add residue projection into f̂ ; does this now match f?

    I Can this be turned into an algorithm?

  • What was Cybenko’s proof about?

    I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).I Another way to look at it.

    I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.I The residue f − f̂ must have nonzero projection onto S.I Add residue projection into f̂ ; does this now match f?

    I Can this be turned into an algorithm?

  • What was Cybenko’s proof about?

    I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).I Another way to look at it.

    I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.I The residue f − f̂ must have nonzero projection onto S.I Add residue projection into f̂ ; does this now match f?

    I Can this be turned into an algorithm?

  • Outline

    I 2-nn via functional analysis (Cybenko, 1989).

    I 2-nn via greedy approx (Barron, 1993).

    I 3-nn via histograms (Folklore).

    I 3-nn via wizardry (Kolmogorov, 1957).

  • The method of Barron (1993, Section 8)

    Input: basis set B,

    1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar?

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • The method of Barron (1993, Section 8)

    Input: basis set B, target f ∈ cl(conv(B)).

    1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

    I Is this familiar?

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • The method of Barron (1993, Section 8)

    Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.

    2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

    I Is this familiar?

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • The method of Barron (1993, Section 8)

    Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

    I Is this familiar?

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • The method of Barron (1993, Section 8)

    Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

    I Is this familiar?

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • The method of Barron (1993, Section 8)

    Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

    I Is this familiar?

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • The method of Barron (1993, Section 8)

    Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)

    I Is this familiar?

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • The method of Barron (1993, Section 8)

    Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar?

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • The method of Barron (1993, Section 8)

    Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar? Oh let’s see..

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • The method of Barron (1993, Section 8)

    Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar? Oh let’s see.. gradient descent, coordinate

    descent, Frank-Wolfe, projection pursuit, basis pursuit,boosting.. as usual, cf. Zhang (2003).

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • The method of Barron (1993, Section 8)

    Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar? YES.

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • The method of Barron (1993, Section 8)

    Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:

    2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))

    infα∈[0,1],g∈B

    ‖f − (αf̂t−1 + (1− α)g)‖22.

    2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.

    I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar? YES.

    I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).

  • Greedy approx as coordinate descent

    I Intuition: linear operator A with basis B as “columns”:

    f̂t = Awt, g = Aej .

    I Split the update into two steps:

    infg∈B‖f − (f̂t−1 + g)‖22 = inf

    i‖f − (Awt−1 +Aei)‖22,

    infα∈[0,1]

    ‖f − (αf̂t−1 + (1− α)gt)‖22.

    I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to

    supg∈B

    〈g, f̂t−1 − f

    〉= sup

    i(Aei)

    >(Awt−1 − f).

    I Second version is coordinate descent update!

    I (Standard rate proofs use curvature of ‖ · ‖22.)

  • Greedy approx as coordinate descent

    I Intuition: linear operator A with basis B as “columns”:

    f̂t = Awt, g = Aej .

    I Split the update into two steps:

    infg∈B‖f − (f̂t−1 + g)‖22 = inf

    i‖f − (Awt−1 +Aei)‖22,

    infα∈[0,1]

    ‖f − (αf̂t−1 + (1− α)gt)‖22.

    I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to

    supg∈B

    〈g, f̂t−1 − f

    〉= sup

    i(Aei)

    >(Awt−1 − f).

    I Second version is coordinate descent update!

    I (Standard rate proofs use curvature of ‖ · ‖22.)

  • Greedy approx as coordinate descent

    I Intuition: linear operator A with basis B as “columns”:

    f̂t = Awt, g = Aej .

    I Split the update into two steps:

    infg∈B‖f − (f̂t−1 + g)‖22 = inf

    i‖f − (Awt−1 +Aei)‖22,

    infα∈[0,1]

    ‖f − (αf̂t−1 + (1− α)gt)‖22.

    I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to

    supg∈B

    〈g, f̂t−1 − f

    〉= sup

    i(Aei)

    >(Awt−1 − f).

    I Second version is coordinate descent update!

    I (Standard rate proofs use curvature of ‖ · ‖22.)

  • Greedy approx as coordinate descent

    I Intuition: linear operator A with basis B as “columns”:

    f̂t = Awt, g = Aej .

    I Split the update into two steps:

    infg∈B‖f − (f̂t−1 + g)‖22 = inf

    i‖f − (Awt−1 +Aei)‖22,

    infα∈[0,1]

    ‖f − (αf̂t−1 + (1− α)gt)‖22.

    I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to

    supg∈B

    〈g, f̂t−1 − f

    〉= sup

    i(Aei)

    >(Awt−1 − f).

    I Second version is coordinate descent update!

    I (Standard rate proofs use curvature of ‖ · ‖22.)

  • Greedy approx as coordinate descent

    I Intuition: linear operator A with basis B as “columns”:

    f̂t = Awt, g = Aej .

    I Split the update into two steps:

    infg∈B‖f − (f̂t−1 + g)‖22 = inf

    i‖f − (Awt−1 +Aei)‖22,

    infα∈[0,1]

    ‖f − (αf̂t−1 + (1− α)gt)‖22.

    I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to

    supg∈B

    〈g, f̂t−1 − f

    〉= sup

    i(Aei)

    >(Awt−1 − f).

    I Second version is coordinate descent update!

    I (Standard rate proofs use curvature of ‖ · ‖22.)

  • Story for 2-layer NNs so far

    I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..

    I ..so why aren’t we using this algorithm for neural nets?

    I This is not an easy optimization problem:

    infα∈[0,1],w∈Rn,w0∈R

    ‖f − (αf̂t−1 + (1− α)σ(〈w, ·〉+ w0)‖22.

    I For a general basis B, must check each basis element..

    I From an algorithmic perspective, much remains to be done.

  • Story for 2-layer NNs so far

    I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..

    I ..so why aren’t we using this algorithm for neural nets?

    I This is not an easy optimization problem:

    infα∈[0,1],w∈Rn,w0∈R

    ‖f − (αf̂t−1 + (1− α)σ(〈w, ·〉+ w0)‖22.

    I For a general basis B, must check each basis element..

    I From an algorithmic perspective, much remains to be done.

  • Story for 2-layer NNs so far

    I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..

    I ..so why aren’t we using this algorithm for neural nets?I This is not an easy optimization problem:

    infα∈[0,1],w∈Rn,w0∈R

    ‖f − (αf̂t−1 + (1− α)σ(〈w, ·〉+ w0)‖22.

    I For a general basis B, must check each basis element..

    I From an algorithmic perspective, much remains to be done.

  • Story for 2-layer NNs so far

    I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..

    I ..so why aren’t we using this algorithm for neural nets?I This is not an easy optimization problem:

    infα∈[0,1],w∈Rn,w0∈R

    ‖f − (αf̂t−1 + (1− α)σ(〈w, ·〉+ w0)‖22.

    I For a general basis B, must check each basis element..

    I From an algorithmic perspective, much remains to be done.

  • Story for 2-layer NNs so far

    I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..

    I ..so why aren’t we using this algorithm for neural nets?I This is not an easy optimization problem:

    infα∈[0,1],w∈Rn,w0∈R

    ‖f − (αf̂t−1 + (1− α)σ(〈w, ·〉+ w0)‖22.

    I For a general basis B, must check each basis element..

    I From an algorithmic perspective, much remains to be done.

  • Lower bound from Barron (1993, Section 10)

    I The greedy algorithm gets to search whole basis.

    I If an n-element basis Bn is fixed, then

    supf∈cl(spanC(B))

    inff̂∈span(Bn)

    ‖f − f̂‖2 = Ω(

    1

    n1/d

    ),

    the curse of dimension!

    I Can defeat this with more layers (cf. Kolmogorov (1957)).

  • Lower bound from Barron (1993, Section 10)

    I The greedy algorithm gets to search whole basis.

    I If an n-element basis Bn is fixed, then

    supf∈cl(spanC(B))

    inff̂∈span(Bn)

    ‖f − f̂‖2 = Ω(

    1

    n1/d

    ),

    the curse of dimension!

    I Can defeat this with more layers (cf. Kolmogorov (1957)).

  • Lower bound from Barron (1993, Section 10)

    I The greedy algorithm gets to search whole basis.

    I If an n-element basis Bn is fixed, then

    supf∈cl(spanC(B))

    inff̂∈span(Bn)

    ‖f − f̂‖2 = Ω(

    1

    n1/d

    ),

    the curse of dimension!

    I Can defeat this with more layers (cf. Kolmogorov (1957)).

  • Outline

    I 2-nn via functional analysis (Cybenko, 1989).

    I 2-nn via greedy approx (Barron, 1993).

    I 3-nn via histograms (Folklore).

    I 3-nn via wizardry (Kolmogorov, 1957).

  • Folklore proof of NN power

    I What’s an easy way to write function as sum of others?

    I Project onto favorite basis (Fourier, Chebyshev, ...)!

    I Let’s use this to build NNs.

    I Easiest basis: histograms!

  • Folklore proof of NN power

    I What’s an easy way to write function as sum of others?I Project onto favorite basis (Fourier, Chebyshev, ...)!

    I Let’s use this to build NNs.

    I Easiest basis: histograms!

  • Folklore proof of NN power

    I What’s an easy way to write function as sum of others?I Project onto favorite basis (Fourier, Chebyshev, ...)!

    I Let’s use this to build NNs.

    I Easiest basis: histograms!

  • Folklore proof of NN power

    I What’s an easy way to write function as sum of others?I Project onto favorite basis (Fourier, Chebyshev, ...)!

    I Let’s use this to build NNs.I Easiest basis: histograms!

  • Histogram basis for C([0, 1]n)

    I Grid [0, 1]n into {Ri}mi=1; consider

    f̂(x) =

    m∑i=1

    ai1[x ∈ Ri],

    where target f(x) = ai for some x ∈ Ri.

    I Fact: for any � > 0, exists histogram f̂ with

    x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.

    I Pf. Continuous function over compact set is uniformlycontinuous.

    I Now just write individual histogram bars as a NNs.

  • Histogram basis for C([0, 1]n)

    I Grid [0, 1]n into {Ri}mi=1; consider

    f̂(x) =

    m∑i=1

    ai1[x ∈ Ri],

    where target f(x) = ai for some x ∈ Ri.I Fact: for any � > 0, exists histogram f̂ with

    x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.

    I Pf. Continuous function over compact set is uniformlycontinuous.

    I Now just write individual histogram bars as a NNs.

  • Histogram basis for C([0, 1]n)

    I Grid [0, 1]n into {Ri}mi=1; consider

    f̂(x) =

    m∑i=1

    ai1[x ∈ Ri],

    where target f(x) = ai for some x ∈ Ri.I Fact: for any � > 0, exists histogram f̂ with

    x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.

    I Pf. Continuous function over compact set is uniformlycontinuous.

    I Now just write individual histogram bars as a NNs.

  • Histogram basis for C([0, 1]n)

    I Grid [0, 1]n into {Ri}mi=1; consider

    f̂(x) =

    m∑i=1

    ai1[x ∈ Ri],

    where target f(x) = ai for some x ∈ Ri.I Fact: for any � > 0, exists histogram f̂ with

    x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.

    I Pf. Continuous function over compact set is uniformlycontinuous.

    I Now just write individual histogram bars as a NNs.

  • Histograms via 0/1 NNs

    I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].

    I 2-D example. First carve out axes.

    2

    1

    1 2

    3

    2

    3

    2

    2

    3

    4

    3

    I Now pass through a second layer 1[input ≥ 3.5].

    1

    0

  • Histograms via 0/1 NNs

    I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].I 2-D example. First carve out axes.

    2

    1

    1 2

    3

    2

    3

    2

    2

    3

    4

    3

    I Now pass through a second layer 1[input ≥ 3.5].

    1

    0

  • Histograms via 0/1 NNs

    I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].I 2-D example. First carve out axes.

    2

    1

    1 2

    3

    2

    3

    2

    2

    3

    4

    3

    I Now pass through a second layer 1[input ≥ 3.5].

    1

    0

  • Histograms via 0/1 NNs

    I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].I 2-D example. First carve out axes.

    2

    1

    1

    2

    3

    2

    3

    2

    2

    3

    4

    3

    I Now pass through a second layer 1[input ≥ 3.5].

    1

    0

  • Histograms via 0/1 NNs

    I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].I 2-D example. First carve out axes.

    2

    1

    1 2

    3

    2

    3

    2

    2

    3

    4

    3

    I Now pass through a second layer 1[input ≥ 3.5].

    1

    0

  • Histograms via 0/1 NNs

    I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].I 2-D example. First carve out axes.

    2

    1

    1 2

    3

    2

    3

    2

    2

    3

    4

    3

    I Now pass through a second layer 1[input ≥ 3.5].

    1

    0

  • Histograms via sigmoidal NNs

    I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).

    I 2-D example. First carve out axes, with fuzz region.

    >2-2ε

  • Histograms via sigmoidal NNs

    I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).I 2-D example. First carve out axes, with fuzz region.

    >2-2ε

  • Histograms via sigmoidal NNs

    I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).I 2-D example. First carve out axes, with fuzz region.

    >2-2ε

  • Histograms via sigmoidal NNs

    I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).I 2-D example. First carve out axes, with fuzz region.

    >2-2ε

  • Histograms via sigmoidal NNs

    I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).I 2-D example. First carve out axes, with fuzz region.

    >2-2ε

  • Histograms via sigmoidal NNs

    I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).I 2-D example. First carve out axes, with fuzz region.

    >2-2ε

  • Histograms apx via sigmoidal NNs, some math

    I Histogram region is clearly 2-layer 0/1 NN:

    1[∀i � xi ≥ li ∧ xi ≤ ui]

    = 1

    [n∑i=1

    (1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n

    ].

    I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).I �/2-apx f with histogram {ai, Ri}mi=1;�/(2m)-apx each bar with sigmoids.

    I Final NN f̂(x) :=∑m

    i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− �,

    x ∈ S =⇒ |f(x)− f̂(x)| < �.

  • Histograms apx via sigmoidal NNs, some math

    I Histogram region is clearly 2-layer 0/1 NN:

    1[∀i � xi ≥ li ∧ xi ≤ ui]

    = 1

    [n∑i=1

    (1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n

    ].

    I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).

    I �/2-apx f with histogram {ai, Ri}mi=1;�/(2m)-apx each bar with sigmoids.

    I Final NN f̂(x) :=∑m

    i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− �,

    x ∈ S =⇒ |f(x)− f̂(x)| < �.

  • Histograms apx via sigmoidal NNs, some math

    I Histogram region is clearly 2-layer 0/1 NN:

    1[∀i � xi ≥ li ∧ xi ≤ ui]

    = 1

    [n∑i=1

    (1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n

    ].

    I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).I �/2-apx f with histogram {ai, Ri}mi=1;�/(2m)-apx each bar with sigmoids.

    I Final NN f̂(x) :=∑m

    i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− �,

    x ∈ S =⇒ |f(x)− f̂(x)| < �.

  • Histograms apx via sigmoidal NNs, some math

    I Histogram region is clearly 2-layer 0/1 NN:

    1[∀i � xi ≥ li ∧ xi ≤ ui]

    = 1

    [n∑i=1

    (1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n

    ].

    I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).I �/2-apx f with histogram {ai, Ri}mi=1;�/(2m)-apx each bar with sigmoids.

    I Final NN f̂(x) :=∑m

    i=1 ai(sigmoidal apx to 1[x ∈ Ri]).

    I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− �,

    x ∈ S =⇒ |f(x)− f̂(x)| < �.

  • Histograms apx via sigmoidal NNs, some math

    I Histogram region is clearly 2-layer 0/1 NN:

    1[∀i � xi ≥ li ∧ xi ≤ ui]

    = 1

    [n∑i=1

    (1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n

    ].

    I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).I �/2-apx f with histogram {ai, Ri}mi=1;�/(2m)-apx each bar with sigmoids.

    I Final NN f̂(x) :=∑m

    i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− �,

    x ∈ S =⇒ |f(x)− f̂(x)| < �.

  • Folklore proof discussion

    I Curse of dimension!

    I Still, high level features useful.

  • Outline

    I 2-nn via functional analysis (Cybenko, 1989).

    I 2-nn via greedy approx (Barron, 1993).

    I 3-nn via histograms (Folklore).

    I 3-nn via wizardry (Kolmogorov, 1957).

  • Preface to the Kolmogorov Superposition Theorem

    I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:

    Can the solution to

    x7 + ax3 + bx2 + cx+ 1 = 0

    be written as a finite number of 2-variablefunctions?

    I Kolmogorov’s generalization is a NN apx theorem.

    I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.

  • Preface to the Kolmogorov Superposition Theorem

    I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:

    Can the solution to

    x7 + ax3 + bx2 + cx+ 1 = 0

    be written as a finite number of 2-variablefunctions?

    I Kolmogorov’s generalization is a NN apx theorem.

    I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.

  • Preface to the Kolmogorov Superposition Theorem

    I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:

    Can the solution to

    x7 + ax3 + bx2 + cx+ 1 = 0

    be written as a finite number of 2-variablefunctions?

    I Kolmogorov’s generalization is a NN apx theorem.

    I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.

  • Preface to the Kolmogorov Superposition Theorem

    I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:

    Can the solution to

    x7 + ax3 + bx2 + cx+ 1 = 0

    be written as a finite number of 2-variablefunctions?

    I Kolmogorov’s generalization is a NN apx theorem.

    I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.

  • The Kolmogorov Superposition Theorem

    Theorem. (Kolmogorov, 1957)

    There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that

    forany f ∈ C([0, 1]n),

    there exist continuous {χq}

    2n+1

    q=1 ,

    f(x) =

    2n+1

    ∑q=1

    χq

    n

    ∑p=1

    ψp,q(xp)

    I The proof is also histogram based.

    I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.

    I The magic is within ψp,q!!

  • The Kolmogorov Superposition Theorem

    Theorem. (Kolmogorov, 1957)

    There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that

    forany f ∈ C([0, 1]n), there exist continuous {χq}

    2n+1

    q=1 ,

    f(x) =

    2n+1

    ∑q=1

    χq

    n

    ∑p=1

    ψp,q(xp)

    I The proof is also histogram based.

    I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.

    I The magic is within ψp,q!!

  • The Kolmogorov Superposition Theorem

    Theorem. (Kolmogorov, 1957)

    There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that

    forany f ∈ C([0, 1]n), there exist continuous {χq}

    2n+1

    q=1 ,

    f(x) =

    2n+1

    ∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I The proof is also histogram based.

    I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.

    I The magic is within ψp,q!!

  • The Kolmogorov Superposition Theorem

    Theorem. (Kolmogorov, 1957)

    There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that

    forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I The proof is also histogram based.

    I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.

    I The magic is within ψp,q!!

  • The Kolmogorov Superposition Theorem

    Theorem. (Kolmogorov, 1957)

    There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I The proof is also histogram based.

    I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.

    I The magic is within ψp,q!!

  • The Kolmogorov Superposition Theorem

    Theorem. (Kolmogorov, 1957)

    There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I The proof is also histogram based.

    I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.

    I The magic is within ψp,q!!

  • The Kolmogorov Superposition Theorem

    Theorem. (Kolmogorov, 1957)

    There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I The proof is also histogram based.

    I It must remove the “fuzz” gaps.

    I It must remove dependence on a particular � > 0.

    I The magic is within ψp,q!!

  • The Kolmogorov Superposition Theorem

    Theorem. (Kolmogorov, 1957)

    There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I The proof is also histogram based.

    I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.

    I The magic is within ψp,q!!

  • The Kolmogorov Superposition Theorem

    Theorem. (Kolmogorov, 1957)

    There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I The proof is also histogram based.

    I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.

    I The magic is within ψp,q!!

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    q

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    q

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    q

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    q

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    q

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    q

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    q

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.

    I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    q

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    q

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing ψp,q

    I Given resolution k ∈ Z++, build a staggered gridding.

    q

    I Let Sqk,i1,...,im range over the cells.

    I Construct ψp,q so that for each k and q,∑n

    p=1 ψp,q maps

    each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.

    I Holds for all k, simultaneously handles all resolutions.

    I The functions ψp,q are fractals.

  • Constructing χq

    I Recall goal:

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I Start χq0 := 0; eventually χq := limr→∞ χ

    qr.

    I To build χqr+1 from χqr:

    I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

    in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

    I (I’m leaving a lot out =( )

  • Constructing χq

    I Recall goal:

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I Start χq0 := 0; eventually χ

    q := limr→∞ χqr.

    I To build χqr+1 from χqr:

    I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

    in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

    I (I’m leaving a lot out =( )

  • Constructing χq

    I Recall goal:

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I Start χq0 := 0; eventually χ

    q := limr→∞ χqr.

    I To build χqr+1 from χqr:

    I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

    in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

    I (I’m leaving a lot out =( )

  • Constructing χq

    I Recall goal:

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I Start χq0 := 0; eventually χ

    q := limr→∞ χqr.

    I To build χqr+1 from χqr:

    I Choose kr+1 so gridding fluctuations sufficiently small.

    I Make χq some value of f in each cell, smooth interpolationin fuzz.

    I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

    I (I’m leaving a lot out =( )

  • Constructing χq

    I Recall goal:

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I Start χq0 := 0; eventually χ

    q := limr→∞ χqr.

    I To build χqr+1 from χqr:

    I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

    in fuzz.

    I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

    I (I’m leaving a lot out =( )

  • Constructing χq

    I Recall goal:

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I Start χq0 := 0; eventually χ

    q := limr→∞ χqr.

    I To build χqr+1 from χqr:

    I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

    in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

    I (I’m leaving a lot out =( )

  • Constructing χq

    I Recall goal:

    f(x) =

    2n+1∑q=1

    χq

    n∑p=1

    ψp,q(xp)

    I Start χq0 := 0; eventually χ

    q := limr→∞ χqr.

    I To build χqr+1 from χqr:

    I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation

    in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.

    I (I’m leaving a lot out =( )

  • Discussion 1

  • Discussion 2

    I It needs different transfer functions..

    I ..but these can be approximated by multiple layers!

    I It shows the value of powerful nodes, whereas the standardapx results just suggest a very wide hidden layer.

  • Discussion 2

    I It needs different transfer functions..

    I ..but these can be approximated by multiple layers!

    I It shows the value of powerful nodes, whereas the standardapx results just suggest a very wide hidden layer.

  • Discussion 2

    I It needs different transfer functions..

    I ..but these can be approximated by multiple layers!

    I It shows the value of powerful nodes, whereas the standardapx results just suggest a very wide hidden layer.

  • Conclusion

    I Some fancy mechanisms give 2-NNs f̂i → f ∈ C([0, 1]n).I Histogram constructions hint to the power of deeper

    networks.

  • Thanks!

    Andrew R. Barron. Universal approximation bounds forsuperpositions of a sigmoidal function. IEEE Transactions onInformation theory, 39(3):930–945, 1993.

    George Cybenko. Approximation by superpositions of asigmoidal function. Mathematics of Control, Signals, andSystems, 2:303–314, 1989.

    Andrei N. Kolmogorov. On the representation of continuousfunctions of several veriables as superpositions of continuousfunctions of one variable and addition. Dokl. Acad. NaukSSSR, 114(5):953–956, 1957. Translation to English: V. M.Volosov.

    Tong Zhang. Sequential greedy approximation for certainconvex optimization problems. IEEE Transactions onInformation Theory, 49(3):682–691, 2003.