Dualing GANsyujiali/papers/nips17_dualgan_poster.pdf · 2017-11-30 · Dualing GANs Yujia Li 1 ,...

1
Dualing GANs Yujia Li 1,* , Alexander Schwing 3 , Kuan-Chieh Wang 1,2 and Richard Zemel 1,2 1 University of Toronto 2 Vector Institute 3 University of Illinois at Urbana-Champaign * Now at DeepMind Introduction GAN training suffers from instability due to its saddle point formulation: max θ min w f (θ, w) θ and w are generator and discriminator parameters respectively, f is the GAN loss. Typical GAN training alternates gradient updates to θ and w θ θ + η θ θ f (θ, w), w w - η w w f (θ, w). However, to solve the saddle point problem ideally for each θ we want to solve for w * (θ ) = argmin w f (θ, w), and then optimize max θ f (θ, w * (θ )). For any w obtained from gradient updates, we have f (θ, w) f (θ, w * (θ )), therefore the outer optimization becomes a maximization of an upper bound, leading to instability. In this paper we propose to dualize the inner part min w f (θ, w) into max λ g (θ, λ) which is always a lower bound on f (θ, w * (θ )) and solve the much more stable maximization problem max θ max λ g (θ, λ). This formulation allows us to: Solve the instability problem for GANs with linear discriminators. Improve stability for GANs with nonlinear discriminators. GANs with Linear Discriminators We start from linear discriminators that rely on a scoring function F (w, x)= w x. Any differentiable nonlinear feature φ(x) can be used in place of x. The discriminator D w (x)= p w (y =1|x)= σ (F (w, x)) = 1 1+ e -w x . The GAN loss on a batch of data {x 1 , ..., x n } and latent samples {z 1 , ..., z n } is min w C 2 w 2 2 + 1 2n X i log 1+ e -w x i + 1 2n X i log 1+ e w G θ (z i ) . The loss is convex in w, we can derive the standard dual problem to be max λ g (θ, λ)= - 1 2C X i λ x i x i - X i λ z i G θ (z i ) 2 2 + 1 2n X i H (2x i )+ 1 2n X i H (2z i ), s.t. i, 0 λ x i 1 2n , 0 λ z i 1 2n . H (u)= -u log u - (1 - u) log(1 - u) is the binary entropy, and the optimal w * can be obtained from the optimal solution (λ * z * x ) for this dual problem as w * = 1 C X i λ * x i x i - X i λ * z i G θ (z i ) . Properties: The i λ x i x i - i λ z i G θ (z i ) 2 2 encourages moment matching. The entropy terms encourage the λ’s to be close to the mean. For training we optimize max θ max λ g (θ, λ), which is very stable. GANs with Non-Linear Discriminators In general the scoring function F (w, x) may be nonlinear in w and typ- ically implemented by a neural network. In this case the GAN loss f (θ, w)= C 2 w 2 2 + 1 2n X i log 1+ e -F (w,x i ) + 1 2n X i log 1+ e F (w,G θ (z i )) is not convex in w, therefore hard to dualize directly. Proposed solution: approximate f locally around any point w k using a model function m k,θ (s) f (θ, w k + s), then dualize m k,θ (s). The optimization problem for the discriminator becomes min s m k,θ (s) s.t. 1 2 s 2 2 Δ k , where 1 2 s 2 Δ k is a trust-region constraint that ensures the quality of the approximation. The overall algorithm is shown below: GAN optimization with model function Initialize θ , w 0 , k = 0 and iterate 1 One or few gradient ascent steps on f (θ, w k ) w.r.t. θ 2 Find step s using min s m k,θ (s) s.t. 1 2 s 2 2 Δ k 3 Update w k +1 w k + s 4 k k +1 We explore two approximations: (A). Cost function linearization: Linearize f as m k,θ (s)= f (w k )+ w f (w k ) s We can solve for the optimal s * = - k w f (w k ) 2 w f (w k ) analytically. This s * has the same form and direction as a gradient update used in standard GANs. (B). Score function linearization: Linearize F and keep the loss F (w k + s, x) ˆ F (s, x)= F (w k , x)+ s w F (w k , x), x. Model function is a more accurate approximation compared to (A). m k,θ (s)= C 2 w k + s 2 2 + 1 2n X i log 1+ e -F (w k ,x i )-s w F (w k ,x i ) + 1 2n X i log 1+ e F (w k ,G θ (z i ))+s w F (w k ,G θ (z i )) . This m is convex in s and can be dualized. See paper for details. With approximations the dual is not exact, but solving the dual may still be better than taking gradient steps for the primal. Experiments Linear Discriminator, 5 Gaussians Linear Discriminator, MNIST Dual GAN Standard GAN Sensitivity to Hyperparams Nonlinear Discriminator, MNIST Score Lin. Cost Lin. Standard GAN Nonlinear Discriminator, CIFAR-10 Score Type Std. GAN Score Lin Cost Lin Real Data Inception (end) 5.61±0.09 5.40±0.12 5.43±0.10 10.72 ± 0.38 Our classifier (end) 3.85±0.08 3.52±0.09 4.42±0.09 8.03 ± 0.07 Inception (avg) 5.59±0.38 5.44±0.08 5.16±0.37 - Our classifier (avg) 3.64±0.47 3.70±0.27 4.04±0.37 - Score Lin. Cost Lin. Standard GAN

Transcript of Dualing GANsyujiali/papers/nips17_dualgan_poster.pdf · 2017-11-30 · Dualing GANs Yujia Li 1 ,...

Page 1: Dualing GANsyujiali/papers/nips17_dualgan_poster.pdf · 2017-11-30 · Dualing GANs Yujia Li 1 , ∗ , Alexander Schwing 3 , Kuan-Chieh Wang 1 , 2 and Richard Zemel 1 , 2 1 University

Dualing GANsYujia Li1,∗, Alexander Schwing3, Kuan-Chieh Wang1,2 and Richard Zemel1,2

1University of Toronto 2Vector Institute 3University of Illinois at Urbana-Champaign ∗Now at DeepMind

Introduction

GAN training suffers from instability due to its saddle point formulation:maxθ

minw f (θ,w)θ and w are generator and discriminator parameters respectively, f is theGAN loss. Typical GAN training alternates gradient updates to θ and w

θ → θ + ηθ∇θf (θ,w), w→ w− ηw∇wf (θ,w).However, to solve the saddle point problem ideally for each θ we want tosolve for w∗(θ) = argminw f (θ,w), and then optimize maxθ f (θ,w∗(θ)).For any w obtained from gradient updates, we have f (θ,w) ≥f (θ,w∗(θ)), therefore the outer optimization becomes a maximizationof an upper bound, leading to instability.

In this paper we propose to dualize the inner part minw f (θ,w) intomaxλ g(θ, λ) which is always a lower bound on f (θ,w∗(θ)) and solve themuch more stable maximization problem

maxθ

maxλg(θ, λ).

This formulation allows us to:• Solve the instability problem for GANs with linear discriminators.• Improve stability for GANs with nonlinear discriminators.

GANs with Linear Discriminators

We start from linear discriminators that rely on a scoring functionF (w,x) = w>x. Any differentiable nonlinear feature φ(x) can be usedin place of x. The discriminator

Dw(x) = pw(y = 1|x) = σ(F (w,x)) = 11 + e−w>x.

The GAN loss on a batch of data {x1, ...,xn} and latent samples{z1, ..., zn} is

minwC

2‖w‖2

2 + 12n

∑i

log(

1 + e−w>xi)

+ 12n

∑i

log(

1 + ew>Gθ(zi)).

The loss is convex in w, we can derive the standard dual problem to be

maxλ

g(θ, λ) = − 12C

∥∥∥∥∥∥∑i

λxixi −∑i

λziGθ(zi)∥∥∥∥∥∥

2

2

+ 12n

∑i

H(2nλxi) + 12n

∑i

H(2nλzi),

s.t. ∀i, 0 ≤ λxi ≤1

2n, 0 ≤ λzi ≤

12n.

H(u) = −u log u − (1 − u) log(1 − u) is the binary entropy, and theoptimal w∗ can be obtained from the optimal solution (λ∗z, λ∗x) for thisdual problem as

w∗ = 1C

∑i

λ∗xixi −∑i

λ∗ziGθ(zi) .

Properties:• The ‖∑i λxixi −

∑i λziGθ(zi)‖2

2 encourages moment matching.• The entropy terms encourage the λ’s to be close to the mean.

For training we optimize maxθ maxλ g(θ, λ), which is very stable.

GANs with Non-Linear Discriminators

In general the scoring function F (w,x) may be nonlinear in w and typ-ically implemented by a neural network. In this case the GAN loss

f (θ,w) = C

2‖w‖2

2+ 12n

∑i

log(1 + e−F (w,xi)

)+ 1

2n∑i

log(1 + eF (w,Gθ(zi))

)is not convex in w, therefore hard to dualize directly.

Proposed solution: approximate f locally around any point wk usinga model function mk,θ(s) ≈ f (θ,wk + s), then dualize mk,θ(s). Theoptimization problem for the discriminator becomes

mins mk,θ(s) s.t. 12‖s‖2

2 ≤ ∆k,

where 12‖s‖

2 ≤ ∆k is a trust-region constraint that ensures the quality ofthe approximation. The overall algorithm is shown below:

GAN optimization with model function

Initialize θ, w0, k = 0 and iterate1 One or few gradient ascent steps on f (θ,wk) w.r.t. θ2 Find step s using minsmk,θ(s) s.t. 1

2‖s‖22 ≤ ∆k

3 Update wk+1← wk + s4 k ← k + 1

We explore two approximations:(A). Cost function linearization: Linearize f as

mk,θ(s) = f (wk, θ) +∇wf (wk, θ)>sWe can solve for the optimal s∗ = −

√2∆k

‖∇wf (wk,θ)‖2∇wf (wk, θ) analytically.

This s∗ has the same form and direction as a gradient update used instandard GANs.

(B). Score function linearization: Linearize F and keep the lossF (wk + s,x) ≈ F̂ (s,x) = F (wk,x) + s>∇wF (wk,x), ∀x.

Model function is a more accurate approximation compared to (A).

mk,θ(s) =C

2‖wk + s‖2

2 + 12n

∑i

log(

1 + e−F (wk,xi)−s>∇wF (wk,xi))

+ 12n

∑i

log(

1 + eF (wk,Gθ(zi))+s>∇wF (wk,Gθ(zi))).

This m is convex in s and can be dualized. See paper for details.

With approximations the dual is not exact, but solving the dual may stillbe better than taking gradient steps for the primal.

Experiments

Linear Discriminator, 5 Gaussians

Linear Discriminator, MNIST

0 1 2 3 4 5 6 7 8discretized inception scores

0.0

0.2

0.4

0.6

0.8

1.0

norm

alize

d co

unts

MNIST Sensitivity to Hyperparams

gandual

Dual GAN Standard GAN Sensitivity to Hyperparams

Nonlinear Discriminator, MNIST

Score Lin. Cost Lin. Standard GAN

Nonlinear Discriminator, CIFAR-10Score Type Std. GAN Score Lin Cost Lin Real Data

Inception (end) 5.61±0.09 5.40±0.12 5.43±0.10 10.72 ± 0.38Our classifier (end) 3.85±0.08 3.52±0.09 4.42±0.09 8.03 ± 0.07

Inception (avg) 5.59±0.38 5.44±0.08 5.16±0.37 -Our classifier (avg) 3.64±0.47 3.70±0.27 4.04±0.37 -

Score Lin. Cost Lin. Standard GAN