Binarized Neural Network - Xianda Xu Presentation.pdf · DarryI D. Lin etc. Overcoming challenges...

10
Binarized Neural Network Xianda ( Bryce ) Xu [email protected] November 7th Using to Compress DNN

Transcript of Binarized Neural Network - Xianda Xu Presentation.pdf · DarryI D. Lin etc. Overcoming challenges...

Page 1: Binarized Neural Network - Xianda Xu Presentation.pdf · DarryI D. Lin etc. Overcoming challenges in challenges in fixed point training of deep convolutional networks. 8 Jul 2016.

Binarized Neural Network

Xianda ( Bryce ) Xu

[email protected]

November 7th

Using

to

Compress DNN

Page 2: Binarized Neural Network - Xianda Xu Presentation.pdf · DarryI D. Lin etc. Overcoming challenges in challenges in fixed point training of deep convolutional networks. 8 Jul 2016.

Why is model compression so important ?

Figure 1. AlexNet architecture

~ 60 M Parameters !

60 M Parameters = 240 MB Memory !

Problem 1: Computation Cost

Problem 2: Memory Cost

A =σ (X iWT + B)Multiplication is energy & time consuming !

If float-32

However,The energy and memory are limited on mobile devices and embedded devices !

(ImageNet Large Scale Visual Recognition Challenge)

Top-1 accuracy: 57.1%

Top-5 accuracy: 80.2%

Page 3: Binarized Neural Network - Xianda Xu Presentation.pdf · DarryI D. Lin etc. Overcoming challenges in challenges in fixed point training of deep convolutional networks. 8 Jul 2016.

How can we compress DNN ?

Parameters!

Pruning

Decomposition

Distillation

Quantization

Removing unimportant parameters

Apply Singular Value Decomposition on W

Use the combined output of several largenetworks to train a simpler model

Finding efficient representation of eachparameters

Page 4: Binarized Neural Network - Xianda Xu Presentation.pdf · DarryI D. Lin etc. Overcoming challenges in challenges in fixed point training of deep convolutional networks. 8 Jul 2016.

What is BNN ?In brief, it binarizes parameters and activations to +1 and -1

Why should we choose BNN ?-- Reduce memory cost

Full-precision parameter takes 32 bitsBinary parameter takes 1 bit

-- Save energy and speed up

Full-precision multiplication:Binary multiplication: (XOR)

i⊙

Compress network by 32-X theoretically Multiply-accumulations can be replaced

by XOR and bit-count

Page 5: Binarized Neural Network - Xianda Xu Presentation.pdf · DarryI D. Lin etc. Overcoming challenges in challenges in fixed point training of deep convolutional networks. 8 Jul 2016.

How do we implement BNN ?Problem 1: How to binarize ?

-- Stochastic Binarization

-- Deterministic Binarization

Though Stochastic Binarization seemsmore reasonable, we prefer deterministicbinarization for its efficiency.

Problem 2: When to binarize ?-- Forward propagation1. First Layer:

2. Hidden Layers:We do not binarize the input but binarize the Ws and As

We binarize all the Ws and As

3. Output Layer:We binarize Ws and only binarize the output in training

-- Back-propagationWe do not binarize gradients in back-propagationBut we have to clip weights when we update them

Page 6: Binarized Neural Network - Xianda Xu Presentation.pdf · DarryI D. Lin etc. Overcoming challenges in challenges in fixed point training of deep convolutional networks. 8 Jul 2016.

How do we implement BNN ?Problem 3: How to do back-propagation ?Recall: We calculate the gradients of

the loss function ! with respect to "#, the input of the $ layer.

gl =∂L∂Il

The layer activation gradients:

The layer weight gradients:

gl−1 = glWlT

gWl = gl Il−1T

But since we use Binarizing function, gradients are all zero !

Straight-Through Estimator !

Straight-Through Estimator (STE)Yoshua Bengio etc. Estimating or propagating gradients through stochastic neurons for conditional computation ( 15 Aug 2013 ).

Adapted, for hidden layers:

gaL =∂C∂aL

For the last layer:

For hidden layers: (map sign(x))gak = gakb1|ak |≤1

gsk = BN (gak )

gak−1b = gskWk

b

STE

gWkb = gsk

T ak−1b

Weight gradients

Activation gradients

Back Batch Norm

STE = The gradient on Htanh(x)So, we use Htanh(x) as our activation function

H tanh(x) = Clip(x,−1,1)

Page 7: Binarized Neural Network - Xianda Xu Presentation.pdf · DarryI D. Lin etc. Overcoming challenges in challenges in fixed point training of deep convolutional networks. 8 Jul 2016.

How was my experiment ?The architecture of BNN in this paper (fed by Cifar-10).

https://github.com/brycexu/BinarizedNeuralNetwork/tree/master/SecondTry

The block

The training accuracy95%

The validation accuracy84%

In this paper:

The validation accuracy89%

My experiment:

Page 8: Binarized Neural Network - Xianda Xu Presentation.pdf · DarryI D. Lin etc. Overcoming challenges in challenges in fixed point training of deep convolutional networks. 8 Jul 2016.

The problems about the current BNN model ?

Problems 1: Robustness Issue

Accuracy Loss ! Possible reasons ?Problems 2: Stability Issue

BNN always has larger output change whichmakes them more susceptible to inputperturbation.

BNN is hard to optimize due to problems such as gradient mismatch. This is because of the non-smoothness of the whole architecture.

Gradient mismatch:

DarryI D. Lin etc. Overcoming challenges in challenges in fixed point training of deep convolutional networks. 8 Jul 2016.

The effective activation function in a fixed point networkis a non-differentiable function in a discrete point network

That is why we cannot apply ReLU in BNN !

Page 9: Binarized Neural Network - Xianda Xu Presentation.pdf · DarryI D. Lin etc. Overcoming challenges in challenges in fixed point training of deep convolutional networks. 8 Jul 2016.

The potential ways to optimize BNN model ?Robustness Issue

Stability Issue ?

1. Adding more bits ?

-- Ternary model (-1,0,+1)-- QuantizationResearch shows that having more bitsat activations improve model’ robustness.

2. Weakening learning rate ? Research shows that higher learning ratecan cause turbulence inside the model, soBNN needs finer tuning.

3. Adding more weights ?-- WRPN

-- AdaBoost (BENN)

1. Better activation function ?

4. Modifying the architecture ?

2. Better back-propagation methods ?

-- Recursively using binarization

More networks per bit ?

More bits per network ?