Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U...

Post on 06-Aug-2021

1 views 0 download

Transcript of Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U...

Deep Learning

Vazgen Mikayelyan

December 22, 2020

V. Mikayelyan Deep Learning December 22, 2020 1 / 14

WGANs

Let X be a compact metric set (e.g [0, 1]d) and let Σ denote the set of allBorel subsets of X .

Total Variation (TV) distance

δ(Pr ,Pg ) = supA∈Σ|Pr (A)− Pg (A)| .

Kullback-Leibler (KL) divergence

KL(Pr‖Pg ) =

∫R

log

(pr (x)

pg (x)

)pr (x) dµ (x) .

Jensen-Shannon (JS) distance

JS(Pr‖Pg ) =12

(KL(Pr‖Pm) + KL(Pg‖Pm))

where Pm =Pr + Pg

2.

V. Mikayelyan Deep Learning December 22, 2020 2 / 14

WGANs

Let X be a compact metric set (e.g [0, 1]d) and let Σ denote the set of allBorel subsets of X .

Total Variation (TV) distance

δ(Pr ,Pg ) = supA∈Σ|Pr (A)− Pg (A)| .

Kullback-Leibler (KL) divergence

KL(Pr‖Pg ) =

∫R

log

(pr (x)

pg (x)

)pr (x) dµ (x) .

Jensen-Shannon (JS) distance

JS(Pr‖Pg ) =12

(KL(Pr‖Pm) + KL(Pg‖Pm))

where Pm =Pr + Pg

2.

V. Mikayelyan Deep Learning December 22, 2020 2 / 14

WGANs

Let X be a compact metric set (e.g [0, 1]d) and let Σ denote the set of allBorel subsets of X .

Total Variation (TV) distance

δ(Pr ,Pg ) = supA∈Σ|Pr (A)− Pg (A)| .

Kullback-Leibler (KL) divergence

KL(Pr‖Pg ) =

∫R

log

(pr (x)

pg (x)

)pr (x) dµ (x) .

Jensen-Shannon (JS) distance

JS(Pr‖Pg ) =12

(KL(Pr‖Pm) + KL(Pg‖Pm))

where Pm =Pr + Pg

2.

V. Mikayelyan Deep Learning December 22, 2020 2 / 14

WGANs

Let X be a compact metric set (e.g [0, 1]d) and let Σ denote the set of allBorel subsets of X .

Total Variation (TV) distance

δ(Pr ,Pg ) = supA∈Σ|Pr (A)− Pg (A)| .

Kullback-Leibler (KL) divergence

KL(Pr‖Pg ) =

∫R

log

(pr (x)

pg (x)

)pr (x) dµ (x) .

Jensen-Shannon (JS) distance

JS(Pr‖Pg ) =12

(KL(Pr‖Pm) + KL(Pg‖Pm))

where Pm =Pr + Pg

2.

V. Mikayelyan Deep Learning December 22, 2020 2 / 14

WGANs

The Earth-Mover (EM) distance or Wasserstein-1

W (Pr ,Pg ) = infγ∈Π(Pr ,Pg )

E(x ,y)∼γ (‖x − y‖)

where Π (Pr ,Pg ) denotes the set of all joint distributions γ (x , y)whose marginals are respectively Pr and Pg .

V. Mikayelyan Deep Learning December 22, 2020 3 / 14

Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

V. Mikayelyan Deep Learning December 22, 2020 4 / 14

Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

V. Mikayelyan Deep Learning December 22, 2020 4 / 14

Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

V. Mikayelyan Deep Learning December 22, 2020 4 / 14

Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

V. Mikayelyan Deep Learning December 22, 2020 4 / 14

Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

V. Mikayelyan Deep Learning December 22, 2020 4 / 14

WGANs

Note that

All distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.

V. Mikayelyan Deep Learning December 22, 2020 5 / 14

WGANs

Note thatAll distances other than EM are not continuous.

When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.

V. Mikayelyan Deep Learning December 22, 2020 5 / 14

WGANs

Note thatAll distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.

Only EM distance has informative gradient.

V. Mikayelyan Deep Learning December 22, 2020 5 / 14

WGANs

Note thatAll distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.

V. Mikayelyan Deep Learning December 22, 2020 5 / 14

Lipschitz functions

Definition 1Let X and Y are normed vector spaces. A function f : X → Y is called

K-Lipschitz if there exists a real constant K > 0 such that, for all x1and x2 in X

‖f (x1)− f (x2) ‖ ≤ K‖x1 − x2‖.

local Lipschitz if for every x ∈ X there exists a neighbourhood U of xsuch that f is Lipschitz on U.

Theorem 1If function f : Rn → R has bounded gradient, then f is a Lipschitz function.

V. Mikayelyan Deep Learning December 22, 2020 6 / 14

Lipschitz functions

Definition 1Let X and Y are normed vector spaces. A function f : X → Y is called

K-Lipschitz if there exists a real constant K > 0 such that, for all x1and x2 in X

‖f (x1)− f (x2) ‖ ≤ K‖x1 − x2‖.

local Lipschitz if for every x ∈ X there exists a neighbourhood U of xsuch that f is Lipschitz on U.

Theorem 1If function f : Rn → R has bounded gradient, then f is a Lipschitz function.

V. Mikayelyan Deep Learning December 22, 2020 6 / 14

EM properties

V. Mikayelyan Deep Learning December 22, 2020 7 / 14

EM properties

V. Mikayelyan Deep Learning December 22, 2020 7 / 14

EM properties

V. Mikayelyan Deep Learning December 22, 2020 8 / 14

EM properties

SummaryWasserstein (or EM) loss for neural networks is continuous anddifferentiable almost everywhere. Moreover, Convergence in KL impliesconvergence in TV and JS which implies convergence in EM.

V. Mikayelyan Deep Learning December 22, 2020 9 / 14

Training

Kantorovich-Rubinstein duality

W (Pr ,Pθ) = sup‖f ‖L≤1

(Ex∼Pr [f (x)]− Ex∼Pθ[f (x)])

where the supremum is over all the 1-Lipschitz functions f : X → R.

Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W

(Ex∼Pr [fw (x)]− Ez∼p(z) [fw (gθ (z))]

)for estimating W (Pr ,Pθ).

V. Mikayelyan Deep Learning December 22, 2020 10 / 14

Training

Kantorovich-Rubinstein duality

W (Pr ,Pθ) = sup‖f ‖L≤1

(Ex∼Pr [f (x)]− Ex∼Pθ[f (x)])

where the supremum is over all the 1-Lipschitz functions f : X → R.

Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).

Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W

(Ex∼Pr [fw (x)]− Ez∼p(z) [fw (gθ (z))]

)for estimating W (Pr ,Pθ).

V. Mikayelyan Deep Learning December 22, 2020 10 / 14

Training

Kantorovich-Rubinstein duality

W (Pr ,Pθ) = sup‖f ‖L≤1

(Ex∼Pr [f (x)]− Ex∼Pθ[f (x)])

where the supremum is over all the 1-Lipschitz functions f : X → R.

Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W

(Ex∼Pr [fw (x)]− Ez∼p(z) [fw (gθ (z))]

)for estimating W (Pr ,Pθ).

V. Mikayelyan Deep Learning December 22, 2020 10 / 14

Training

V. Mikayelyan Deep Learning December 22, 2020 11 / 14

Training

V. Mikayelyan Deep Learning December 22, 2020 12 / 14

Training

Note that

By clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

V. Mikayelyan Deep Learning December 22, 2020 13 / 14

Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,

we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

V. Mikayelyan Deep Learning December 22, 2020 13 / 14

Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,

if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

V. Mikayelyan Deep Learning December 22, 2020 13 / 14

Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

V. Mikayelyan Deep Learning December 22, 2020 13 / 14

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,

improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

V. Mikayelyan Deep Learning December 22, 2020 14 / 14

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,

results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

V. Mikayelyan Deep Learning December 22, 2020 14 / 14

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),

does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

V. Mikayelyan Deep Learning December 22, 2020 14 / 14

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.

slower to converge than KL loss.

V. Mikayelyan Deep Learning December 22, 2020 14 / 14

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

V. Mikayelyan Deep Learning December 22, 2020 14 / 14