Chainer v3

Chainer v3Chainer Meetup #06 @ PFN, Sep. 30, 2017

Seiya Tokui @ Preferred Networks

Recent/coming releases

• Chainer v3.0.0 RC, v2.1.0: Sep. 12

• v3 RC was the 50th release!

• CuPy v2.0.0 RC, v1.0.3 on the same day

• Next release: Chainer v3.0.0 and v4.0.0α on Oct. 17

• CuPy v2.0.0 and v3.0.0α on the same day

• Today, I mainly talk about the features of CuPy v2.0.0 RC and

Chainer v3.0.0 RC

Chainer v3.0.0rc1

• For most users, the backward compatibility is maintained

• See the release notes of v3.0.0rc1 for some small breaks that do not

affect most users

• The inner-working is greatly changed

• It may cause some existing code that directly touches the

computational graphs broken

• Thanks to this change, we now support double backprop

(a.k.a. gradient of gradients) as announced

Double backprop

• Automatic backpropagation through gradients

• When is it needed?

• Consider a loss function that includes a gradient computation as a

term/factor

• E.g. the loss function for WGAN-GP:

𝔼 𝑥∼ℙ𝑔𝐷 𝑥 − 𝔼𝑥∼ℙ𝑟

𝐷 𝑥 + 𝜆𝔼 𝑥∼ℙ 𝑥𝛻 𝑥𝐷 𝑥 2 − 1 2

• To take the gradient of this loss function, we need to do backprop

through 𝛻 𝑥𝐷( 𝑥), which itself we want to compute with backprop!

gradient

Double backprop in Chainer v3

• Many functions now support double backprop

• Those functions are rewritten to implement a new interface named

FunctionNode (such functions are called new-style Functions)

• backward() takes Variable instead of ndarray as grad_outputs

and return values, which means backward() itself can be

differentiated

• Variable has now an attribute grad_var, which represents

the gradient as a Variable (so that we can use it in the

computational graph)

How to implement WGAN-GP

1. Using Variable.backward()

x_tilde = generator(z)

x_hat = x + u * (x_tilde – x)

D(x_hat).backward(enable_double_backprop=True)

# 1st diff

gp = lambda * (x_hat.grad_var – 1) ** 2

loss = D(x_tilde) – D(x) + gp

model.cleargrads() # to clear the 1st diff of params

loss.backward() # 2nd diff

How to implement WGAN-GP

2. Using grad()

x_tilde = generator(z)

x_hat = x + u * (x_tilde – x)

gx_hat, = chainer.grad([D(x_hat)], [x_hat],

enable_double_backprop=True) # 1st diff

gp = lambda * (gx_hat – 1) ** 2

loss = D(x_tilde) – D(x) + gp

loss.backward() # 2nd diff

This version is more efficient because grad() can skip the gradient

computation for parameters (thus also we can drop cleargrads()).

New-style Function support

• Most “standard” functions are now ported to the new-style

interface:

+, -, *, Convolution2D, Deconvolution2D, EmbedID, Linear,

LSTM, BatchNormalization, sigmoid, relu, leaky_relu, softmax,

log_softmax, tanh, exp, mean_squared_error,

softmax_cross_entropy, dropout, layer_normalization,

transpose, reshape, broadcast_to, sum, concat, __getitem__,

etc…

• We are still working on widening the double backprop

support. Contributions are also welcome!!

Other features

• Functions: layer_normalization, selu, arctan2, prod,

NumPy-compatible matmul

• Links: ChildSumTreeLSTM, NaryTreeLSTM,

BatchRenormalization

• Other new features: LeCunNormal, as_variable(),

Variable.array, strict option of load_npz(), etc.

CuPy v2.0.0rc1

• Sparse matrix support

• Complex number support

• Improved memory allocator

• Many new functions, esp. of linear algebra routines

Sparse matrix support

• cupy.sparse --- the sparse matrix support with APIs

compatible to scipy.sparse

• CSR/CSC/COO and diagonal format

• Basic arithmetics, matrix product, element indexing

• Slicing along the major axis

• Dense <-> Sparse conversion

Complex number support

• CuPy now supports complex numbers!

• Dtypes complex32, complex64, complex128 are now available

• Routines related to complex numbers:

angle, conj, imag, real

Linear algebra routines

• Solvers, matrix inversion, determinant, eigenvalues, etc.:

solve, tensorsolve, inv, pinv, det, slogdet, eigh,

eigvalsh, matrix_rank

• All under cupy.linalg namespace

• einsum is also supported (thanks, @fukatani!)

• Flexible tensor product/reduction based on Einstein convention

Improved memory allocator

• The memory pool is greatly improved

• It now uses “best-fit with coalescing” algorithm

• The memory region is reused even if the size does not exactly match

• It may also contribute to the speed improvement, thanks to the

reduced number of reallocations

• Example: the new seq2seq example originally uses all the

memory of 12GB GPU, whose usage is reduced to 3GB, and

also the execution time is reduced by appx. 25%.

Next versions

• As you may know, we slightly changed the release policy

again; the stable releases may now include some new

features (thus v2.1.0 instead of v2.0.3).

• v4 is scheduled based on our release policy: v4.0.0 will be

three months after v3.0.0 (which will be mid Jan. if there is no

delay).

• The core features of v4 is not determined yet; let’s have

discussions!

Chainer v3

Technology

Transcript of Chainer v3