Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U...

35
Deep Learning Vazgen Mikayelyan December 22, 2020 V. Mikayelyan Deep Learning December 22, 2020 1 / 14

Transcript of Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U...

Page 1: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Deep Learning

Vazgen Mikayelyan

December 22, 2020

V. Mikayelyan Deep Learning December 22, 2020 1 / 14

Page 2: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGANs

Let X be a compact metric set (e.g [0, 1]d) and let Σ denote the set of allBorel subsets of X .

Total Variation (TV) distance

δ(Pr ,Pg ) = supA∈Σ|Pr (A)− Pg (A)| .

Kullback-Leibler (KL) divergence

KL(Pr‖Pg ) =

∫R

log

(pr (x)

pg (x)

)pr (x) dµ (x) .

Jensen-Shannon (JS) distance

JS(Pr‖Pg ) =12

(KL(Pr‖Pm) + KL(Pg‖Pm))

where Pm =Pr + Pg

2.

V. Mikayelyan Deep Learning December 22, 2020 2 / 14

Page 3: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGANs

Let X be a compact metric set (e.g [0, 1]d) and let Σ denote the set of allBorel subsets of X .

Total Variation (TV) distance

δ(Pr ,Pg ) = supA∈Σ|Pr (A)− Pg (A)| .

Kullback-Leibler (KL) divergence

KL(Pr‖Pg ) =

∫R

log

(pr (x)

pg (x)

)pr (x) dµ (x) .

Jensen-Shannon (JS) distance

JS(Pr‖Pg ) =12

(KL(Pr‖Pm) + KL(Pg‖Pm))

where Pm =Pr + Pg

2.

V. Mikayelyan Deep Learning December 22, 2020 2 / 14

Page 4: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGANs

Let X be a compact metric set (e.g [0, 1]d) and let Σ denote the set of allBorel subsets of X .

Total Variation (TV) distance

δ(Pr ,Pg ) = supA∈Σ|Pr (A)− Pg (A)| .

Kullback-Leibler (KL) divergence

KL(Pr‖Pg ) =

∫R

log

(pr (x)

pg (x)

)pr (x) dµ (x) .

Jensen-Shannon (JS) distance

JS(Pr‖Pg ) =12

(KL(Pr‖Pm) + KL(Pg‖Pm))

where Pm =Pr + Pg

2.

V. Mikayelyan Deep Learning December 22, 2020 2 / 14

Page 5: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGANs

Let X be a compact metric set (e.g [0, 1]d) and let Σ denote the set of allBorel subsets of X .

Total Variation (TV) distance

δ(Pr ,Pg ) = supA∈Σ|Pr (A)− Pg (A)| .

Kullback-Leibler (KL) divergence

KL(Pr‖Pg ) =

∫R

log

(pr (x)

pg (x)

)pr (x) dµ (x) .

Jensen-Shannon (JS) distance

JS(Pr‖Pg ) =12

(KL(Pr‖Pm) + KL(Pg‖Pm))

where Pm =Pr + Pg

2.

V. Mikayelyan Deep Learning December 22, 2020 2 / 14

Page 6: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGANs

The Earth-Mover (EM) distance or Wasserstein-1

W (Pr ,Pg ) = infγ∈Π(Pr ,Pg )

E(x ,y)∼γ (‖x − y‖)

where Π (Pr ,Pg ) denotes the set of all joint distributions γ (x , y)whose marginals are respectively Pr and Pg .

V. Mikayelyan Deep Learning December 22, 2020 3 / 14

Page 7: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

V. Mikayelyan Deep Learning December 22, 2020 4 / 14

Page 8: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

V. Mikayelyan Deep Learning December 22, 2020 4 / 14

Page 9: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

V. Mikayelyan Deep Learning December 22, 2020 4 / 14

Page 10: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

V. Mikayelyan Deep Learning December 22, 2020 4 / 14

Page 11: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

V. Mikayelyan Deep Learning December 22, 2020 4 / 14

Page 12: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGANs

Note that

All distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.

V. Mikayelyan Deep Learning December 22, 2020 5 / 14

Page 13: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGANs

Note thatAll distances other than EM are not continuous.

When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.

V. Mikayelyan Deep Learning December 22, 2020 5 / 14

Page 14: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGANs

Note thatAll distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.

Only EM distance has informative gradient.

V. Mikayelyan Deep Learning December 22, 2020 5 / 14

Page 15: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGANs

Note thatAll distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.

V. Mikayelyan Deep Learning December 22, 2020 5 / 14

Page 16: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Lipschitz functions

Definition 1Let X and Y are normed vector spaces. A function f : X → Y is called

K-Lipschitz if there exists a real constant K > 0 such that, for all x1and x2 in X

‖f (x1)− f (x2) ‖ ≤ K‖x1 − x2‖.

local Lipschitz if for every x ∈ X there exists a neighbourhood U of xsuch that f is Lipschitz on U.

Theorem 1If function f : Rn → R has bounded gradient, then f is a Lipschitz function.

V. Mikayelyan Deep Learning December 22, 2020 6 / 14

Page 17: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Lipschitz functions

Definition 1Let X and Y are normed vector spaces. A function f : X → Y is called

K-Lipschitz if there exists a real constant K > 0 such that, for all x1and x2 in X

‖f (x1)− f (x2) ‖ ≤ K‖x1 − x2‖.

local Lipschitz if for every x ∈ X there exists a neighbourhood U of xsuch that f is Lipschitz on U.

Theorem 1If function f : Rn → R has bounded gradient, then f is a Lipschitz function.

V. Mikayelyan Deep Learning December 22, 2020 6 / 14

Page 18: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

EM properties

V. Mikayelyan Deep Learning December 22, 2020 7 / 14

Page 19: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

EM properties

V. Mikayelyan Deep Learning December 22, 2020 7 / 14

Page 20: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

EM properties

V. Mikayelyan Deep Learning December 22, 2020 8 / 14

Page 21: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

EM properties

SummaryWasserstein (or EM) loss for neural networks is continuous anddifferentiable almost everywhere. Moreover, Convergence in KL impliesconvergence in TV and JS which implies convergence in EM.

V. Mikayelyan Deep Learning December 22, 2020 9 / 14

Page 22: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Training

Kantorovich-Rubinstein duality

W (Pr ,Pθ) = sup‖f ‖L≤1

(Ex∼Pr [f (x)]− Ex∼Pθ[f (x)])

where the supremum is over all the 1-Lipschitz functions f : X → R.

Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W

(Ex∼Pr [fw (x)]− Ez∼p(z) [fw (gθ (z))]

)for estimating W (Pr ,Pθ).

V. Mikayelyan Deep Learning December 22, 2020 10 / 14

Page 23: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Training

Kantorovich-Rubinstein duality

W (Pr ,Pθ) = sup‖f ‖L≤1

(Ex∼Pr [f (x)]− Ex∼Pθ[f (x)])

where the supremum is over all the 1-Lipschitz functions f : X → R.

Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).

Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W

(Ex∼Pr [fw (x)]− Ez∼p(z) [fw (gθ (z))]

)for estimating W (Pr ,Pθ).

V. Mikayelyan Deep Learning December 22, 2020 10 / 14

Page 24: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Training

Kantorovich-Rubinstein duality

W (Pr ,Pθ) = sup‖f ‖L≤1

(Ex∼Pr [f (x)]− Ex∼Pθ[f (x)])

where the supremum is over all the 1-Lipschitz functions f : X → R.

Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W

(Ex∼Pr [fw (x)]− Ez∼p(z) [fw (gθ (z))]

)for estimating W (Pr ,Pθ).

V. Mikayelyan Deep Learning December 22, 2020 10 / 14

Page 25: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Training

V. Mikayelyan Deep Learning December 22, 2020 11 / 14

Page 26: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Training

V. Mikayelyan Deep Learning December 22, 2020 12 / 14

Page 27: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Training

Note that

By clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

V. Mikayelyan Deep Learning December 22, 2020 13 / 14

Page 28: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,

we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

V. Mikayelyan Deep Learning December 22, 2020 13 / 14

Page 29: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,

if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

V. Mikayelyan Deep Learning December 22, 2020 13 / 14

Page 30: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

V. Mikayelyan Deep Learning December 22, 2020 13 / 14

Page 31: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,

improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

V. Mikayelyan Deep Learning December 22, 2020 14 / 14

Page 32: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,

results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

V. Mikayelyan Deep Learning December 22, 2020 14 / 14

Page 33: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),

does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

V. Mikayelyan Deep Learning December 22, 2020 14 / 14

Page 34: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.

slower to converge than KL loss.

V. Mikayelyan Deep Learning December 22, 2020 14 / 14

Page 35: Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U of x such that f is Lipschitz on U. Theorem1 If function f : Rn!R has bounded gradient,

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

V. Mikayelyan Deep Learning December 22, 2020 14 / 14