     • date post

05-Apr-2018
• Category

## Documents

• view

214

2

Embed Size (px)

### Transcript of Representation Power of Feedforward Neural Networks dasgupta/254-deep/matus.pdfRepresentation Power...

• Representation Power of Feedforward NeuralNetworks

Based on work by Barron (1993), Cybenko (1989),Kolmogorov (1957)

Matus Telgarsky

• Feedforward Neural Networks

I Two node types:I Linear combinations:

x 7i

wixi + w0.

I Sigmoid thresholded linear combinations:

x 7 (w, x+ w0) .

I What can a network of these nodes represent?ni=1

wixi one layer,

ni=1

wi

nij=1

wjixj + wj0

two layers,...

...

• Feedforward Neural Networks

I Two node types:I Linear combinations:

x 7i

wixi + w0.

I Sigmoid thresholded linear combinations:

x 7 (w, x+ w0) .

I What can a network of these nodes represent?ni=1

wixi one layer,

ni=1

wi

nij=1

wjixj + wj0

two layers,...

...

I Target set [0, 3]; target function 1[x [1, 2]].

I Standard sigmoid s(x) := 1/(1 + ex).

I Consider sigmoid output at x = 2.

I w 0 and (2w + w0) 1/2: mess up on right side.

0 3

1

0.5

I w 0 and (2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error 1/2.

I Target set [0, 3]; target function 1[x [1, 2]].I Standard sigmoid s(x) := 1/(1 + e

x).

I Consider sigmoid output at x = 2.

I w 0 and (2w + w0) 1/2: mess up on right side.

0 3

1

0.5

I w 0 and (2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error 1/2.

I Target set [0, 3]; target function 1[x [1, 2]].I Standard sigmoid s(x) := 1/(1 + e

x).

I Consider sigmoid output at x = 2.

I w 0 and (2w + w0) 1/2: mess up on right side.

0 3

1

0.5

I w 0 and (2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error 1/2.

I Target set [0, 3]; target function 1[x [1, 2]].I Standard sigmoid s(x) := 1/(1 + e

x).

I Consider sigmoid output at x = 2.I w 0 and (2w + w0) 1/2: mess up on right side.

0 3

1

0.5

I w 0 and (2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error 1/2.

I Target set [0, 3]; target function 1[x [1, 2]].I Standard sigmoid s(x) := 1/(1 + e

x).

I Consider sigmoid output at x = 2.I w 0 and (2w + w0) 1/2: mess up on right side.

0 3

1

0.5

I w 0 and (2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error 1/2.

I Target set [0, 3]; target function 1[x [1, 2]].I Standard sigmoid s(x) := 1/(1 + e

x).

I Consider sigmoid output at x = 2.I w 0 and (2w + w0) 1/2: mess up on right side.

0 3

1

0.5

I w 0 and (2w + w0) < 1/2: mess up on middle bump.

0 3

1

0.5

I Can symmetrize (w < 0); no matter what, error 1/2.

• Meaning of Universal Approximation

Target set [0, 1]n; target function f C([0, 1]n).

I For any > 0, exists NN f ,

x [0, 1]n = |f(x) f(x)| < .

I This gives NNs fi f pointwise.

I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

x S = |f(x) f(x)| < .

I If (for instance) bounded on Sc, gives NNs fi f m-a.e..

Goal: 2-NNs approximate continuous functions over [0, 1]n.

• Meaning of Universal Approximation

Target set [0, 1]n; target function f C([0, 1]n).I For any > 0, exists NN f ,

x [0, 1]n = |f(x) f(x)| < .

I This gives NNs fi f pointwise.I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

x S = |f(x) f(x)| < .

I If (for instance) bounded on Sc, gives NNs fi f m-a.e..

Goal: 2-NNs approximate continuous functions over [0, 1]n.

• Meaning of Universal Approximation

Target set [0, 1]n; target function f C([0, 1]n).I For any > 0, exists NN f ,

x [0, 1]n = |f(x) f(x)| < .

I This gives NNs fi f pointwise.

I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

x S = |f(x) f(x)| < .

I If (for instance) bounded on Sc, gives NNs fi f m-a.e..

Goal: 2-NNs approximate continuous functions over [0, 1]n.

• Meaning of Universal Approximation

Target set [0, 1]n; target function f C([0, 1]n).I For any > 0, exists NN f ,

x [0, 1]n = |f(x) f(x)| < .

I This gives NNs fi f pointwise.I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

x S = |f(x) f(x)| < .

I If (for instance) bounded on Sc, gives NNs fi f m-a.e..Goal: 2-NNs approximate continuous functions over [0, 1]n.

• Meaning of Universal Approximation

Target set [0, 1]n; target function f C([0, 1]n).I For any > 0, exists NN f ,

x [0, 1]n = |f(x) f(x)| < .

I This gives NNs fi f pointwise.I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

x S = |f(x) f(x)| < .

I If (for instance) bounded on Sc, gives NNs fi f m-a.e..

Goal: 2-NNs approximate continuous functions over [0, 1]n.

• Meaning of Universal Approximation

Target set [0, 1]n; target function f C([0, 1]n).I For any > 0, exists NN f ,

x [0, 1]n = |f(x) f(x)| < .

I This gives NNs fi f pointwise.I For any > 0, exists NN f and S [0, 1]n, m(S) 1 ,

x S = |f(x) f(x)| < .

I If (for instance) bounded on Sc, gives NNs fi f m-a.e..Goal: 2-NNs approximate continuous functions over [0, 1]n.

• Outline

I 2-nn via functional analysis (Cybenko, 1989).

I 2-nn via greedy approx (Barron, 1993).

I 3-nn via histograms (Folklore).

I 3-nn via wizardry (Kolmogorov, 1957).

• Overview of Functional Analysis proof (Cybenko, 1989)

I Hidden layer as a basis:

B := {(w, x+ w0) : w Rn, w0 R} .

I Want to show cl(span(B)) = C([0, 1]n).I Work via contradiction: if f C([0, 1]n) far from

cl(span(B)), can bridge the gap with a sigmoid.

• Overview of Functional Analysis proof (Cybenko, 1989)

I Hidden layer as a basis:

B := {(w, x+ w0) : w Rn, w0 R} .

I Want to show cl(span(B)) = C([0, 1]n).

I Work via contradiction: if f C([0, 1]n) far fromcl(span(B)), can bridge the gap with a sigmoid.

• Overview of Functional Analysis proof (Cybenko, 1989)

I Hidden layer as a basis:

B := {(w, x+ w0) : w Rn, w0 R} .

I Want to show cl(span(B)) = C([0, 1]n).I Work via contradiction: if f C([0, 1]n) far from

cl(span(B)), can bridge the gap with a sigmoid.

• Abstracting

I Cybenko needs discriminates:

= 0 w,w0 (w, x+ w0)d(x) = 0.

I Satisfied for the standard choices

s(x) =1

1 + ex,

1

2(tanh(x) + 1) =

1

2

(ex ex

ex + ex+ 1

)= s(2x).

I Most results today only need approximates 1[x 0]:

(x)

{1 as x +,0 as x .

Combined with bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

• Abstracting

I Cybenko needs discriminates:

= 0 w,w0 (w, x+ w0)d(x) = 0.

I Satisfied for the standard choices

s(x) =1

1 + ex,

1

2(tanh(x) + 1) =

1

2

(ex ex

ex + ex+ 1

)= s(2x).

I Most results today only need approximates 1[x 0]:

(x)

{1 as x +,0 as x .

Combined with bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

• Abstracting

I Cybenko needs discriminates:

= 0 w,w0 (w, x+ w0)d(x) = 0.

I Satisfied for the standard choices

s(x) =1

1 + ex,

1

2(tanh(x) + 1) =

1

2

(ex ex

ex + ex+ 1

)= s(2x).

I Most results today only need approximates 1[x 0]:

(x)

{1 as x +,0 as x .

Combined with bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).

• Proof of Cybenko (1989)

I Consider the

closed

subspace

S :=

cl

(

span ({(w, x+ w0) : w Rn, w0 R})

)

.

I Suppose (contradictorily) exists f C([0, 1]n) \ S.

I Exists bounded linear L 6= 0 with L|S = 0.I Exists 6= 0 so that L(h) =

h(x)d(x).

I L|S = 0 implies w,w0 (w, x+ w0)d(x) = 0.

I But discriminatory means = 0 w,w0 . . ..

• Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span ({(w, x+ w0) : w Rn, w0 R})

).

I Suppose (contradictorily) exists f C([0, 1]n) \ S.

I Exists bounded linear L 6= 0 with L|S = 0.I Exists 6= 0 so that L(h) =

h(x)d(x).

I L|S = 0 implies w,w0 (w, x+ w0)d(x) = 0.

I But discriminatory means = 0 w,w0 . . ..

• Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span ({(w, x+ w0) : w Rn, w0 R})

).

I Suppose (contradictorily) exists f C([0, 1]n) \ S.

I Exists bounded linear L 6= 0 with L|S = 0.I Exists 6= 0 so that L(h) =

h(x)d(x).

I L|S = 0 implies w,w0 (w, x+ w0)d(x) = 0.

I But discriminatory means = 0 w,w0 . . ..

• Proof of Cybenko (1989)

I Consider the closed subspace

S := cl

(span ({(w, x+ w0) : w Rn, w0 R})

).

I Suppose (contradictorily) exists f C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with