Representation Power of Feedforward Neural Networks dasgupta/254-deep/matus.pdfRepresentation Power...

download Representation Power of Feedforward Neural Networks dasgupta/254-deep/matus.pdfRepresentation Power of Feedforward Neural Networks Based on work by Barron (1993), Cybenko (1989), Kolmogorov

of 136

  • date post

    05-Apr-2018
  • Category

    Documents

  • view

    214
  • download

    2

Embed Size (px)

Transcript of Representation Power of Feedforward Neural Networks dasgupta/254-deep/matus.pdfRepresentation Power...

  • Representation Power of Feedforward NeuralNetworks

    Based on work by Barron (1993), Cybenko (1989),Kolmogorov (1957)

    Matus Telgarsky

  • Feedforward Neural Networks

    I Two node types:I Linear combinations:

    x 7i

    wixi + w0.

    I Sigmoid thresholded linear combinations:

    x 7 (w, x+ w0) .

    I What can a network of these nodes represent?ni=1

    wixi one layer,

    ni=1

    wi

    nij=1

    wjixj + wj0

    two layers,...

    ...

  • Feedforward Neural Networks

    I Two node types:I Linear combinations:

    x 7i

    wixi + w0.

    I Sigmoid thresholded linear combinations:

    x 7 (w, x+ w0) .

    I What can a network of these nodes represent?ni=1

    wixi one layer,

    ni=1

    wi

    nij=1

    wjixj + wj0

    two layers,...

    ...

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x [1, 2]].

    I Standard sigmoid s(x) := 1/(1 + ex).

    I Consider sigmoid output at x = 2.

    I w 0 and (2w + w0) 1/2: mess up on right side.

    0 3

    1

    0.5

    I w 0 and (2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error 1/2.

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x [1, 2]].I Standard sigmoid s(x) := 1/(1 + e

    x).

    I Consider sigmoid output at x = 2.

    I w 0 and (2w + w0) 1/2: mess up on right side.

    0 3

    1

    0.5

    I w 0 and (2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error 1/2.

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x [1, 2]].I Standard sigmoid s(x) := 1/(1 + e

    x).

    I Consider sigmoid output at x = 2.

    I w 0 and (2w + w0) 1/2: mess up on right side.

    0 3

    1

    0.5

    I w 0 and (2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error 1/2.

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x [1, 2]].I Standard sigmoid s(x) := 1/(1 + e

    x).

    I Consider sigmoid output at x = 2.I w 0 and (2w + w0) 1/2: mess up on right side.

    0 3

    1

    0.5

    I w 0 and (2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error 1/2.

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x [1, 2]].I Standard sigmoid s(x) := 1/(1 + e

    x).

    I Consider sigmoid output at x = 2.I w 0 and (2w + w0) 1/2: mess up on right side.

    0 3

    1

    0.5

    I w 0 and (2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error 1/2.

  • Forget about 1 layer.

    I Target set [0, 3]; target function 1[x [1, 2]].I Standard sigmoid s(x) := 1/(1 + e

    x).

    I Consider sigmoid output at x = 2.I w 0 and (2w + w0) 1/2: mess up on right side.

    0 3

    1

    0.5

    I w 0 and (2w + w0) < 1/2: mess up on middle bump.

    0 3

    1

    0.5

    I Can symmetrize (w < 0); no matter what, error 1/2.

  • Meaning of Universal Approximation

    Target set [0, 1]n; target function f C([0, 1]n).

    I For any > 0, exists NN f ,

    x [0, 1]n = |f(x) f(x)| < .

    I This gives NNs fi f pointwise.

    I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

    x S = |f(x) f(x)| < .

    I If (for instance) bounded on Sc, gives NNs fi f m-a.e..

    Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Meaning of Universal Approximation

    Target set [0, 1]n; target function f C([0, 1]n).I For any > 0, exists NN f ,

    x [0, 1]n = |f(x) f(x)| < .

    I This gives NNs fi f pointwise.I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

    x S = |f(x) f(x)| < .

    I If (for instance) bounded on Sc, gives NNs fi f m-a.e..

    Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Meaning of Universal Approximation

    Target set [0, 1]n; target function f C([0, 1]n).I For any > 0, exists NN f ,

    x [0, 1]n = |f(x) f(x)| < .

    I This gives NNs fi f pointwise.

    I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

    x S = |f(x) f(x)| < .

    I If (for instance) bounded on Sc, gives NNs fi f m-a.e..

    Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Meaning of Universal Approximation

    Target set [0, 1]n; target function f C([0, 1]n).I For any > 0, exists NN f ,

    x [0, 1]n = |f(x) f(x)| < .

    I This gives NNs fi f pointwise.I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

    x S = |f(x) f(x)| < .

    I If (for instance) bounded on Sc, gives NNs fi f m-a.e..Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Meaning of Universal Approximation

    Target set [0, 1]n; target function f C([0, 1]n).I For any > 0, exists NN f ,

    x [0, 1]n = |f(x) f(x)| < .

    I This gives NNs fi f pointwise.I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

    x S = |f(x) f(x)| < .

    I If (for instance) bounded on Sc, gives NNs fi f m-a.e..

    Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Meaning of Universal Approximation

    Target set [0, 1]n; target function f C([0, 1]n).I For any > 0, exists NN f ,

    x [0, 1]n = |f(x) f(x)| < .

    I This gives NNs fi f pointwise.I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

    x S = |f(x) f(x)| < .

    I If (for instance) bounded on Sc, gives NNs fi f m-a.e..Goal: 2-NNs approximate continuous functions over [0, 1]n.

  • Outline

    I 2-nn via functional analysis (Cybenko, 1989).

    I 2-nn via greedy approx (Barron, 1993).

    I 3-nn via histograms (Folklore).

    I 3-nn via wizardry (Kolmogorov, 1957).

  • Overview of Functional Analysis proof (Cybenko, 1989)

    I Hidden layer as a basis:

    B := {(w, x+ w0) : w Rn, w0 R} .

    I Want to show cl(span(B)) = C([0, 1]n).I Work via contradiction: if f C([0, 1]n) far from

    cl(span(B)), can bridge the gap with a sigmoid.

  • Overview of Functional Analysis proof (Cybenko, 1989)

    I Hidden layer as a basis:

    B := {(w, x+ w0) : w Rn, w0 R} .

    I Want to show cl(span(B)) = C([0, 1]n).

    I Work via contradiction: if f C([0, 1]n) far fromcl(span(B)), can bridge the gap with a sigmoid.

  • Overview of Functional Analysis proof (Cybenko, 1989)

    I Hidden layer as a basis:

    B := {(w, x+ w0) : w Rn, w0 R} .

    I Want to show cl(span(B)) = C([0, 1]n).I Work via contradiction: if f C([0, 1]n) far from

    cl(span(B)), can bridge the gap with a sigmoid.

  • Abstracting

    I Cybenko needs discriminates:

    = 0 w,w0 (w, x+ w0)d(x) = 0.

    I Satisfied for the standard choices

    s(x) =1

    1 + ex,

    1

    2(tanh(x) + 1) =

    1

    2

    (ex ex

    ex + ex+ 1

    )= s(2x).

    I Most results today only need approximates 1[x 0]:

    (x)

    {1 as x +,0 as x .

    Combined with bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

  • Abstracting

    I Cybenko needs discriminates:

    = 0 w,w0 (w, x+ w0)d(x) = 0.

    I Satisfied for the standard choices

    s(x) =1

    1 + ex,

    1

    2(tanh(x) + 1) =

    1

    2

    (ex ex

    ex + ex+ 1

    )= s(2x).

    I Most results today only need approximates 1[x 0]:

    (x)

    {1 as x +,0 as x .

    Combined with bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

  • Abstracting

    I Cybenko needs discriminates:

    = 0 w,w0 (w, x+ w0)d(x) = 0.

    I Satisfied for the standard choices

    s(x) =1

    1 + ex,

    1

    2(tanh(x) + 1) =

    1

    2

    (ex ex

    ex + ex+ 1

    )= s(2x).

    I Most results today only need approximates 1[x 0]:

    (x)

    {1 as x +,0 as x .

    Combined with bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

  • Proof of Cybenko (1989)

    I Consider the

    closed

    subspace

    S :=

    cl

    (

    span ({(w, x+ w0) : w Rn, w0 R})

    )

    .

    I Suppose (contradictorily) exists f C([0, 1]n) \ S.

    I Exists bounded linear L 6= 0 with L|S = 0.I Exists 6= 0 so that L(h) =

    h(x)d(x).

    I Contradiction: is discriminatory.

    I L|S = 0 implies w,w0 (w, x+ w0)d(x) = 0.

    I But discriminatory means = 0 w,w0 . . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({(w, x+ w0) : w Rn, w0 R})

    ).

    I Suppose (contradictorily) exists f C([0, 1]n) \ S.

    I Exists bounded linear L 6= 0 with L|S = 0.I Exists 6= 0 so that L(h) =

    h(x)d(x).

    I Contradiction: is discriminatory.

    I L|S = 0 implies w,w0 (w, x+ w0)d(x) = 0.

    I But discriminatory means = 0 w,w0 . . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({(w, x+ w0) : w Rn, w0 R})

    ).

    I Suppose (contradictorily) exists f C([0, 1]n) \ S.

    I Exists bounded linear L 6= 0 with L|S = 0.I Exists 6= 0 so that L(h) =

    h(x)d(x).

    I Contradiction: is discriminatory.

    I L|S = 0 implies w,w0 (w, x+ w0)d(x) = 0.

    I But discriminatory means = 0 w,w0 . . ..

  • Proof of Cybenko (1989)

    I Consider the closed subspace

    S := cl

    (span ({(w, x+ w0) : w Rn, w0 R})

    ).

    I Suppose (contradictorily) exists f C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with