Chainer v3
-
Upload
seiya-tokui -
Category
Technology
-
view
2.446 -
download
0
Transcript of Chainer v3
Recent/coming releases
• Chainer v3.0.0 RC, v2.1.0: Sep. 12
• v3 RC was the 50th release!
• CuPy v2.0.0 RC, v1.0.3 on the same day
• Next release: Chainer v3.0.0 and v4.0.0α on Oct. 17
• CuPy v2.0.0 and v3.0.0α on the same day
• Today, I mainly talk about the features of CuPy v2.0.0 RC and
Chainer v3.0.0 RC
Chainer v3.0.0rc1
• For most users, the backward compatibility is maintained
• See the release notes of v3.0.0rc1 for some small breaks that do not
affect most users
• The inner-working is greatly changed
• It may cause some existing code that directly touches the
computational graphs broken
• Thanks to this change, we now support double backprop
(a.k.a. gradient of gradients) as announced
Double backprop
• Automatic backpropagation through gradients
• When is it needed?
• Consider a loss function that includes a gradient computation as a
term/factor
• E.g. the loss function for WGAN-GP:
𝔼 𝑥∼ℙ𝑔𝐷 𝑥 − 𝔼𝑥∼ℙ𝑟
𝐷 𝑥 + 𝜆𝔼 𝑥∼ℙ 𝑥𝛻 𝑥𝐷 𝑥 2 − 1 2
• To take the gradient of this loss function, we need to do backprop
through 𝛻 𝑥𝐷( 𝑥), which itself we want to compute with backprop!
gradient
Double backprop in Chainer v3
• Many functions now support double backprop
• Those functions are rewritten to implement a new interface named
FunctionNode (such functions are called new-style Functions)
• backward() takes Variable instead of ndarray as grad_outputs
and return values, which means backward() itself can be
differentiated
• Variable has now an attribute grad_var, which represents
the gradient as a Variable (so that we can use it in the
computational graph)
How to implement WGAN-GP
1. Using Variable.backward()
x_tilde = generator(z)
x_hat = x + u * (x_tilde – x)
D(x_hat).backward(enable_double_backprop=True)
# 1st diff
gp = lambda * (x_hat.grad_var – 1) ** 2
loss = D(x_tilde) – D(x) + gp
model.cleargrads() # to clear the 1st diff of params
loss.backward() # 2nd diff
How to implement WGAN-GP
2. Using grad()
x_tilde = generator(z)
x_hat = x + u * (x_tilde – x)
gx_hat, = chainer.grad([D(x_hat)], [x_hat],
enable_double_backprop=True) # 1st diff
gp = lambda * (gx_hat – 1) ** 2
loss = D(x_tilde) – D(x) + gp
loss.backward() # 2nd diff
This version is more efficient because grad() can skip the gradient
computation for parameters (thus also we can drop cleargrads()).
New-style Function support
• Most “standard” functions are now ported to the new-style
interface:
+, -, *, Convolution2D, Deconvolution2D, EmbedID, Linear,
LSTM, BatchNormalization, sigmoid, relu, leaky_relu, softmax,
log_softmax, tanh, exp, mean_squared_error,
softmax_cross_entropy, dropout, layer_normalization,
transpose, reshape, broadcast_to, sum, concat, __getitem__,
etc…
• We are still working on widening the double backprop
support. Contributions are also welcome!!
Other features
• Functions: layer_normalization, selu, arctan2, prod,
NumPy-compatible matmul
• Links: ChildSumTreeLSTM, NaryTreeLSTM,
BatchRenormalization
• Other new features: LeCunNormal, as_variable(),
Variable.array, strict option of load_npz(), etc.
CuPy v2.0.0rc1
• Sparse matrix support
• Complex number support
• Improved memory allocator
• Many new functions, esp. of linear algebra routines
Sparse matrix support
• cupy.sparse --- the sparse matrix support with APIs
compatible to scipy.sparse
• CSR/CSC/COO and diagonal format
• Basic arithmetics, matrix product, element indexing
• Slicing along the major axis
• Dense <-> Sparse conversion
Complex number support
• CuPy now supports complex numbers!
• Dtypes complex32, complex64, complex128 are now available
• Routines related to complex numbers:
angle, conj, imag, real
Linear algebra routines
• Solvers, matrix inversion, determinant, eigenvalues, etc.:
solve, tensorsolve, inv, pinv, det, slogdet, eigh,
eigvalsh, matrix_rank
• All under cupy.linalg namespace
• einsum is also supported (thanks, @fukatani!)
• Flexible tensor product/reduction based on Einstein convention
Improved memory allocator
• The memory pool is greatly improved
• It now uses “best-fit with coalescing” algorithm
• The memory region is reused even if the size does not exactly match
• It may also contribute to the speed improvement, thanks to the
reduced number of reallocations
• Example: the new seq2seq example originally uses all the
memory of 12GB GPU, whose usage is reduced to 3GB, and
also the execution time is reduced by appx. 25%.
Next versions
• As you may know, we slightly changed the release policy
again; the stable releases may now include some new
features (thus v2.1.0 instead of v2.0.3).
• v4 is scheduled based on our release policy: v4.0.0 will be
three months after v3.0.0 (which will be mid Jan. if there is no
delay).
• The core features of v4 is not determined yet; let’s have
discussions!