Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U...

Deep Learning

Vazgen Mikayelyan

December 22, 2020

V. Mikayelyan Deep Learning December 22, 2020 1 / 14

Let X be a compact metric set (e.g [0, 1]d) and let Σ denote the set of allBorel subsets of X .

Total Variation (TV) distance

δ(Pr ,Pg ) = supA∈Σ|Pr (A)− Pg (A)| .

Kullback-Leibler (KL) divergence

KL(Pr‖Pg ) =

(pr (x)

pg (x)

)pr (x) dµ (x) .

Jensen-Shannon (JS) distance

JS(Pr‖Pg ) =12

(KL(Pr‖Pm) + KL(Pg‖Pm))

where Pm =Pr + Pg

KL(Pr‖Pg ) =

(pr (x)

pg (x)

)pr (x) dµ (x) .

JS(Pr‖Pg ) =12

where Pm =Pr + Pg

KL(Pr‖Pg ) =

(pr (x)

pg (x)

)pr (x) dµ (x) .

JS(Pr‖Pg ) =12

where Pm =Pr + Pg

KL(Pr‖Pg ) =

(pr (x)

pg (x)

)pr (x) dµ (x) .

JS(Pr‖Pg ) =12

where Pm =Pr + Pg

The Earth-Mover (EM) distance or Wasserstein-1

W (Pr ,Pg ) = infγ∈Π(Pr ,Pg )

E(x ,y)∼γ (‖x − y‖)

where Π (Pr ,Pg ) denotes the set of all joint distributions γ (x , y)whose marginals are respectively Pr and Pg .

Example

Let Z ∼ U [0, 1], P0 be the distribution of points (0,Z ) ∈ R2 andgθ (z) = (θ, z), then

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

Example

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

Example

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

Example

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

Example

δ (P0,Pθ) =

{1, if θ 6= 00, if θ = 0,

KL (P0‖Pθ) =

{∞, if θ 6= 00, if θ = 0,

JS (P0‖Pθ) =

{log 2, if θ 6= 00, if θ = 0,

W (P0,Pθ) = |θ|

Note that

All distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.

Note thatAll distances other than EM are not continuous.

When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.

Note thatAll distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.

Only EM distance has informative gradient.

Note thatAll distances other than EM are not continuous.When θt → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge at all under either the JS, KL, reverseKL or TV divergences.Only EM distance has informative gradient.

Lipschitz functions

Definition 1Let X and Y are normed vector spaces. A function f : X → Y is called

K-Lipschitz if there exists a real constant K > 0 such that, for all x1and x2 in X

‖f (x1)− f (x2) ‖ ≤ K‖x1 − x2‖.

local Lipschitz if for every x ∈ X there exists a neighbourhood U of xsuch that f is Lipschitz on U.

Theorem 1If function f : Rn → R has bounded gradient, then f is a Lipschitz function.

Lipschitz functions

Definition 1Let X and Y are normed vector spaces. A function f : X → Y is called

K-Lipschitz if there exists a real constant K > 0 such that, for all x1and x2 in X

‖f (x1)− f (x2) ‖ ≤ K‖x1 − x2‖.

local Lipschitz if for every x ∈ X there exists a neighbourhood U of xsuch that f is Lipschitz on U.

Theorem 1If function f : Rn → R has bounded gradient, then f is a Lipschitz function.

EM properties

SummaryWasserstein (or EM) loss for neural networks is continuous anddifferentiable almost everywhere. Moreover, Convergence in KL impliesconvergence in TV and JS which implies convergence in EM.

Training

Kantorovich-Rubinstein duality

W (Pr ,Pθ) = sup‖f ‖L≤1

(Ex∼Pr [f (x)]− Ex∼Pθ[f (x)])

where the supremum is over all the 1-Lipschitz functions f : X → R.

Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W

(Ex∼Pr [fw (x)]− Ez∼p(z) [fw (gθ (z))]

)for estimating W (Pr ,Pθ).

Training

Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).

Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W

Training

Note that if we replace ‖f ‖L ≤ 1 for ‖f ‖L ≤ K (consider K -Lipschitz forsome constant K ) then we end up with K ·W (Pr ,Pg ).Therefore, if we have a parameterized family of functions {fw}w∈W thatare all K-Lipschitz for some K, we could consider solving the problem

maxw∈W

Training

Note that

By clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,

we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,

if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

Training

Note thatBy clipping the weight parameters in the critic we force critic to havebounded gradients,we can train the critic till optimality,if the clipping parameter is large, then it can take a long time for anyweights to reach their limit, thereby making it harder to train the critictill optimality. If the clipping is small, this can easily lead to vanishinggradients when the number of layers is big.

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,

improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,

results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),

does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.

slower to converge than KL loss.

WGAN Results

A meaningful loss metric that correlates with the generator’sconvergence and sample quality,improved stability of the optimization process,results of WGAN are better than results of DCGAN (with the samegenerator and discriminator),does not work well with momentum-based optimizers e.g. Adam.slower to converge than KL loss.

Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U...

Documents

Transcript of Deep Learning · 2021. 7. 23. · local Lipschitz if for every x 2X there exists a neighbourhood U...

Introductiongm/gmhomefiles/papers/ClebschGordon.pdfisomorphic if there exists a unitary operator U: M1 → M2 such that U(a·h) = a·Uh, a∈ A, h∈ M1. Let K: Ω × Ω → Mk(C)

REQUIREMENT JOBS SALE HOUSE RENT EDUCATION SERICES … · 9“•ÊÈ¸ﬂÒÁŒ∑ Á‹Á ≈ u«u ∑§ê¬ŸË” Á’ŸÊ ß¢≈u⁄u√Í ÃÈ⁄¢uÃ ÖﬂÊßÁŸ¢ª ‹Ò≈u⁄u

Harmonic measure and Green’s functionsivrii/potential_theory/week2b.pdf · 2020. 10. 26. · Harmonic measure Suppose ˆRn is a bounded domain. If f 2L1(@), we call u(x) := Exf(B

I would do it if I wer I would do it if I were youe you

Dr. Hessa Al-BuainainCenterdrhessa.com/wp-content/uploads/2019/01/brochure_english.pdf · This treatment is ideal if u wish to remove small fat deposits on the face and body. Injection

If you go Crete: Koumos

4D SYSTEMSold.4dsystems.com.au/downloads/microDRIVE/uDRIVE-uSD-G1/Docs/u... · TURNING TECHNOLOGY INTO ART. ... 55 (hex) or . U (ascii) : ... NAK byte if unsuccessful or card not

Teko Classes IIT JEE/AIEEE MATHS by SUHAAG SIR ...∫ = is (code-V2T5PAQ22) (a) e (b) e (c) e2 (d) e 1 − Que. 18. If x 1 F(x) f(t)dt= ∫ where t2 4 1 1 u f(t) du u + =∫ then the

Variational analysis of non-Lipschitz spectral …overton/papers/pdffiles/eiganal.pdfVariational analysis of non-Lipschitz spectral functions 321 Having introduced the basic notation

REAL ANALYTIC APPROXIMATION OF LIPSCHITZ FUNCTIONS ON HILBERT …dazagrar/articulos/AFKanfinalRevised2015.pdf · 2015-03-20 · REAL ANALYTIC APPROXIMATION OF LIPSCHITZ FUNCTIONS

Introduction - unizg.hr · 2015. 5. 28. · 4 MAJA RESMAN If dim BU= dim U, then we put dim (U) = dim B U= dim Uand call it the box dimension of U. In literature, the upper box dimension

The Mixed Boundary Value Problem in Lipschitz Domains · The Mixed Boundary Value Problem in Lipschitz Domains Katharine Ott University of Kentucky Women and Mathematics Institute

math.vanderbilt.edu...MCMULLEN POLYNOMIALS AND LIPSCHITZ FLOWS FOR FREE-BY-CYCLIC GROUPS SPENCER DOWDALL, ILYA KAPOVICH, AND CHRISTOPHER J. LEININGER Abstract. Consider a …

P ro jectio n o f d i u sio n s o n su b m a n ifo ld s: A p p lica tio n to m …eve2/cpam_samp.pdf · 2006-09-12 · S A M P L IN G O N S U B M A N IF O L D S W IT H D IF F U S

Lipschitz-continuous local isometric immersions: rigid ...cvgmt.sns.it/media/doc/paper/895/DMPcvgmt.pdfLipschitz-continuous local isometric immersions: rigid maps and origami B. Dacorogna∗

Lipschitz Generative Adversarial Nets11-14-00)-11-15-10-4628-lipschitz_gener.pdf · Zhiming Zhou Lipschitz Generative Adversarial Nets ICML, 2019 2 / 11. The Gradient Uninformativeness

Heat kernel and Lipschitz-Besov spacesgrigor/besov.pdf · Heat kernel and Lipschitz-Besov spaces Alexander Grigor’yan and Liguang Liu April 2014 Abstract On a metric measure space

Section 1.2 - Vectors - University of Texas at Austin2 = v u u t nX 1 i=0 ˜2 i = q ˜2 0 + +˜2 n 1: Clearly kxk 2 = p xTx. Cost If computed with a dot product, it requires approximately

arXiv:1609.07454v1 [math.CV] 23 Sep 2016crab.rutgers.edu › ~sfu › publications › Fu-Laurent-Shaw.pdf · arXiv:1609.07454v1 [math.CV] 23 Sep 2016 HEARING PSEUDOCONVEXITY IN LIPSCHITZ

APPH 4200 Physics of Fluids - Columbia Universitysites.apam.columbia.edu/courses/apph4200x/Lecture-4.pdf · APPH 4200 Physics of Fluids ... If u and v are vectors, ... Tensors, and