Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U...

Deep Learning

Vazgen Mikayelyan

December 22, 2020

V. Mikayelyan Deep Learning December 22, 2020 1 / 14

WGANs

Let X be a compact metric set (e.g [0, 1]d) and let Σ denote the set of allBorel subsets of X .

Total Variation (TV) distance

δ(Pr ,Pg ) = supA∈Σ|Pr (A)− Pg (A)| .

Kullback-Leibler (KL) divergence

KL(Pr‖Pg ) =

∫R

log

(pr (x)

pg (x)

)pr (x) dµ (x) .

Jensen-Shannon (JS) distance

JS(Pr‖Pg ) =12

(KL(Pr‖Pm) + KL(Pg‖Pm))

where Pm =Pr + Pg

2.


WGANs

The Earth-Mover (EM) distance or Wasserstein-1

W (Pr ,Pg ) = infγ∈Π(Pr ,Pg )

E(x ,y)∼γ (‖x − y‖)

where Π (Pr ,Pg ) denotes the set of all joint distributions γ (x , y)whose marginals are respectively Pr and Pg .


Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|


WGANs

Note that

All distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.


WGANs

Note thatAll distances other than EM are not continuous.

When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.


WGANs

Note thatAll distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.

Only EM distance has informative gradient.


WGANs

Note thatAll distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.


Lipschitz functions

Definition 1Let X and Y are normed vector spaces. A function f : X → Y is called

K-Lipschitz if there exists a real constant K > 0 such that, for all x1and x2 in X

‖f (x1)− f (x2) ‖ ≤ K‖x1 − x2‖.

local Lipschitz if for every x ∈ X there exists a neighbourhood U of xsuch that f is Lipschitz on U.

Theorem 1If function f : Rn → R has bounded gradient, then f is a Lipschitz function.


EM properties


EM properties

SummaryWasserstein (or EM) loss for neural networks is continuous anddifferentiable almost everywhere. Moreover, Convergence in KL impliesconvergence in TV and JS which implies convergence in EM.


Training

Kantorovich-Rubinstein duality

W (Pr ,Pθ) = sup‖f ‖L≤1

(Ex∼Pr [f (x)]− Ex∼Pθ[f (x)])

where the supremum is over all the 1-Lipschitz functions f : X → R.

Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W

(Ex∼Pr [fw (x)]− Ez∼p(z) [fw (gθ (z))]

)for estimating W (Pr ,Pθ).


Training





Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).

Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W




Training





Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W




Training


Training

Note that

By clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.


Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,

we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.


Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,

if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.


Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.


WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,

improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.


WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,

results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.


WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),

does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.


WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.

slower to converge than KL loss.


WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.


Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U...

Documents

Transcript of Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U...