Chih-Jen Lin - 國立臺灣大學cjlin/courses/optdl... · Implementation Chih-Jen Lin National...

85
Implementation Chih-Jen Lin National Taiwan University Last updated: June 18, 2019 Chih-Jen Lin (National Taiwan Univ.) 1 / 85

Transcript of Chih-Jen Lin - 國立臺灣大學cjlin/courses/optdl... · Implementation Chih-Jen Lin National...

Implementation

Chih-Jen LinNational Taiwan University

Last updated: June 18, 2019

Chih-Jen Lin (National Taiwan Univ.) 1 / 85

Outline

1 Introduction

2 Storage

3 Generation of φ(pad(Zm,i))

4 Evaluation of (v i)TPmφ

5 Discussion

Chih-Jen Lin (National Taiwan Univ.) 2 / 85

Introduction

Outline

1 Introduction

2 Storage

3 Generation of φ(pad(Zm,i))

4 Evaluation of (v i)TPmφ

5 Discussion

Chih-Jen Lin (National Taiwan Univ.) 3 / 85

Introduction

Introduction I

After checking formulations for gradient calculationwe would like to get into implementation details

Take the following operation as an example

∂ξi∂Wm

=∂ξi∂Sm,i

φ(pad(Zm,i))T

It’s a matrix-matrix product

We all know that a three-level for loop does the job

Does that mean we can then write an efficientimplementation?

The answer is noChih-Jen Lin (National Taiwan Univ.) 4 / 85

Introduction

Introduction II

To explain this, we check some details ofmatrix-matrix products

We also introduce optimized BLAS (Basic LinearAlgebra Subprograms)

Chih-Jen Lin (National Taiwan Univ.) 5 / 85

Introduction

Matrix Multiplication I

We know thatC = AB

is a mathematics operation with

Cij =n∑

k=1

AikBkj

Chih-Jen Lin (National Taiwan Univ.) 6 / 85

Introduction

Optimized BLAS: an Example by UsingBlock Algorithms I

Let’s test the matrix multiplication

A C program:

#define n 2000

double a[n][n], b[n][n], c[n][n];

int main()

{

int i, j, k;

for (i=0;i<n;i++)

for (j=0;j<n;j++) {

Chih-Jen Lin (National Taiwan Univ.) 7 / 85

Introduction

Optimized BLAS: an Example by UsingBlock Algorithms II

a[i][j]=1; b[i][j]=1;

}

for (i=0;i<n;i++)

for (j=0;j<n;j++) {

c[i][j]=0;

for (k=0;k<n;k++)

c[i][j] += a[i][k]*b[k][j];

}

}

Chih-Jen Lin (National Taiwan Univ.) 8 / 85

Introduction

Optimized BLAS: an Example by UsingBlock Algorithms III

The result

cjlin@linux6:~$ gcc -O3 mat.c

cjlin@linux6:~$ time ./a.out

real 0m59.251s

user 0m58.994s

sys 0m0.096s

Let’s try another way:

Chih-Jen Lin (National Taiwan Univ.) 9 / 85

Introduction

Optimized BLAS: an Example by UsingBlock Algorithms IV

#define n 2000

double a[n][n], b[n][n], c[n][n];

int main()

{

int i, j, k;

for (i=0;i<n;i++)

for (j=0;j<n;j++) {

a[i][j]=1; b[i][j]=1;

c[i][j]=0;

Chih-Jen Lin (National Taiwan Univ.) 10 / 85

Introduction

Optimized BLAS: an Example by UsingBlock Algorithms V

}

for (j=0;j<n;j++) {

for (k=0;k<n;k++)

for (i=0;i<n;i++)

c[i][j] += a[i][k]*b[k][j];

}

}

The result

Chih-Jen Lin (National Taiwan Univ.) 11 / 85

Introduction

Optimized BLAS: an Example by UsingBlock Algorithms VI

cjlin@linux6:~$ gcc -O3 mat1.c

cjlin@linux6:~$ time ./a.out

real 2m13.199s

user 2m12.810s

sys 0m0.060s

We see that first approach is faster. Why?

For each of

c[i][j] a[i][k] b[k][j];

we do column-access

Chih-Jen Lin (National Taiwan Univ.) 12 / 85

Introduction

Optimized BLAS: an Example by UsingBlock Algorithms VII

C is row-oriented rather than column-oriented

Now we sense that memory access can be an issue

Let’s try a Matlab program

n = 2000;

A = randn(n,n); B = randn(n,n);

t = cputime; C = A*B; t = cputime -t

To remove the effect of multi-threading, use

matlab -singleCompThread

Timing is an issue

Chih-Jen Lin (National Taiwan Univ.) 13 / 85

Introduction

Optimized BLAS: an Example by UsingBlock Algorithms VIII

Elapsed time versus CPU time

cjlin@linux6:~$ matlab -singleCompThread

>> n = 2000;

>> A = randn(n,n); B = randn(n,n);

>> tic; C = A*B; toc

Elapsed time is 1.139780 seconds.

>> t = cputime; C = A*B; t = cputime -t

t =

1.1200

If using multiple cores,

Chih-Jen Lin (National Taiwan Univ.) 14 / 85

Introduction

Optimized BLAS: an Example by UsingBlock Algorithms IX

cjlin@linux6:~$ matlab

>> tic; C = A*B; toc

Elapsed time is 0.227179 seconds.

>> t = cputime; C = A*B; t = cputime -t

t =

1.6800

Matlab is much faster than a code written byourselves. Why ?

Optimized BLAS: data locality is exploited

Use the highest level of memory as possible

Chih-Jen Lin (National Taiwan Univ.) 15 / 85

Introduction

Optimized BLAS: an Example by UsingBlock Algorithms X

Block algorithms: transferring sub-matrices betweendifferent levels of storage

localize operations to achieve good performance

Chih-Jen Lin (National Taiwan Univ.) 16 / 85

Introduction

Memory Hierarchy I

CPU

↓Registers

↓Cache

↓Main Memory

↓Secondary storage (Disk)

Chih-Jen Lin (National Taiwan Univ.) 17 / 85

Introduction

Memory Hierarchy II

↑: increasing in speed

↓: increasing in capacity

When I studied computer architecture, I didn’t quiteunderstand that this setting is so useful

But from optimized BLAS I realize that it isextremely powerful

Chih-Jen Lin (National Taiwan Univ.) 18 / 85

Introduction

Memory Management I

Page fault: operand not available in main memory

transported from secondary memory

(usually) overwrites page least recently used

I/O increases the total time

An example: C = AB + C , n = 1, 024

Assumption: a page 65,536 doubles = 64 columns

16 pages for each matrix

48 pages for three matrices

Chih-Jen Lin (National Taiwan Univ.) 19 / 85

Introduction

Memory Management II

Assumption: available memory 16 pages, matricesaccess: column oriented

A =

[1 23 4

]column oriented: 1 3 2 4

row oriented: 1 2 3 4

access each row of A: 16 page faults, 1024/64 = 16

Assumption: each time a continuous segment ofdata into one page

Approach 1: inner productChih-Jen Lin (National Taiwan Univ.) 20 / 85

Introduction

Memory Management III

for i =1:n

for j=1:n

for k=1:n

c(i,j) = a(i,k)*b(k,j)+c(i,j);

end

end

end

We use a matlab-like syntax here

At each (i,j): each row a(i, 1:n) causes 16 pagefaults

Chih-Jen Lin (National Taiwan Univ.) 21 / 85

Introduction

Memory Management IV

Total: 10242 × 16 page faults

at least 16 million page faults

Approach 2:

for j =1:n

for k=1:n

for i=1:n

c(i,j) = a(i,k)*b(k,j)+c(i,j);

end

end

end

Chih-Jen Lin (National Taiwan Univ.) 22 / 85

Introduction

Memory Management V

For each j , access all columns of A

A needs 16 pages, but B and C take spaces as well

So A must be read for every j

For each j , 16 page faults for A

1024× 16 page faults

C ,B : 16 page faults

Approach 3: block algorithms (nb = 256)

Chih-Jen Lin (National Taiwan Univ.) 23 / 85

Introduction

Memory Management VI

for j =1:nb:n

for k=1:nb:n

for jj=j:j+nb-1

for kk=k:k+nb-1

c(:,jj) = a(:,kk)*b(kk,jj)+c(:,jj);

end

end

end

end

In MATLAB, 1:256:1025 means 1, 257, 513, 769

Chih-Jen Lin (National Taiwan Univ.) 24 / 85

Introduction

Memory Management VII

Note that we calculateA11 · · · A14...

A41 · · · A44

B11 · · · B14...

B41 · · · B44

=

[A11B11 + · · ·+ A14B41 · · ·

... . . .

]

Chih-Jen Lin (National Taiwan Univ.) 25 / 85

Introduction

Memory Management VIII

Each block: 256× 256

C11 = A11B11 + · · ·+ A14B41

C21 = A21B11 + · · ·+ A24B41

C31 = A31B11 + · · ·+ A34B41

C41 = A41B11 + · · ·+ A44B41

For each (j , k), Bk ,j is used to add A:,kBk,j to C:,j

Chih-Jen Lin (National Taiwan Univ.) 26 / 85

Introduction

Memory Management IX

Example: when j = 1, k = 1

C11 ← C11 + A11B11

...

C41 ← C41 + A41B11

Use Approach 2 for A:,1B11

A:,1: 256 columns, 1024× 256/65536 = 4 pages.A:,1, . . . ,A:,4 : 4× 4 = 16 page faults in calculatingC:,1

For A: 16× 4 page faults

B : 16 page faults, C : 16 page faultsChih-Jen Lin (National Taiwan Univ.) 27 / 85

Introduction

Optimized BLAS Implementations

OpenBLAS

http://www.openblas.net/

It is an optimized BLAS library based onGotoBLAS2 (see the story in the next slide)

Intel MKL (Math Kernel Library)

https://software.intel.com/en-us/mkl

Chih-Jen Lin (National Taiwan Univ.) 28 / 85

Introduction

Some Past Stories about Optimized BLAS

BLAS by Kazushige Goto

https://www.tacc.utexas.edu/

research-development/tacc-software/

gotoblas2

See the NY Times article: “Writing the fastestcode, by hand, for fun: a human computer keepsspeeding up chips”

http://www.nytimes.com/2005/11/28/

technology/28super.html?pagewanted=all

Chih-Jen Lin (National Taiwan Univ.) 29 / 85

Introduction

Discussion I

This discussion roughly explains why GPU is usedfor deep learning

Somehow we can do fast matrix-matrix operationson GPU

Note that we did not touch multi-coreimplementations, though parallelization is possible

Anyway, the conclusion is that for some operations,using code written by experts is more efficient thanour own implementation

How about other operations besides matrix-matrixproducts?

Chih-Jen Lin (National Taiwan Univ.) 30 / 85

Introduction

Discussion II

If they can also be done by calling others’ efficientimplementation, then a simple and efficient CNNimplementation can be done

In the past few months my group has been buildingsuch a code

See the repository athttps://github.com/cjlin1/simplenn

The purpose of that code is neither for industry usenor for quickly trying various architectures

Instead, it aims to be a simple end-to-end code for

education, and

Chih-Jen Lin (National Taiwan Univ.) 31 / 85

Introduction

Discussion III

experiments on optimization algorithms

We will explain details and use it in our subsequentprojects

Chih-Jen Lin (National Taiwan Univ.) 32 / 85

Storage

Outline

1 Introduction

2 Storage

3 Generation of φ(pad(Zm,i))

4 Evaluation of (v i)TPmφ

5 Discussion

Chih-Jen Lin (National Taiwan Univ.) 33 / 85

Storage

Storage I

In the earlier discussion, we check each individualdata.

However, for practical implementations, all (orsome) instances must be considered together formemory and computational efficiency.

Recall we do mini-batch stochastic gradient

We assume l is the number of data instances usedto calculate the gradient (or the sub-gradient)

Chih-Jen Lin (National Taiwan Univ.) 34 / 85

Storage

Storage II

In our implementation, we store Zm,i , ∀i = 1, . . . , las the following matrix.[

Zm,1 Zm,2 . . . Zm,l]∈ Rdm×ambml . (1)

Similarly, we store

∂ξi∂vec(Sm,i)T

, ∀i

as [∂ξ1∂Sm,1 . . . ∂ξl

∂Sm,l

]∈ Rdm+1×amconvbmconvl . (2)

Chih-Jen Lin (National Taiwan Univ.) 35 / 85

Storage

Storage III

We will explain our decision.

Note that (1)-(2) are only the main setting to storethese matrices because for some operations theymay need to be re-shaped.

For an easy description we may follow pastdiscussion to let

Z in,i and Z out,i

be the input and output images of a layer,respectively.

Chih-Jen Lin (National Taiwan Univ.) 36 / 85

Storage

Operations of a Convolutional Layer I

Recall that we conduct the following operations

∂ξi

∂vec(Sm,i)T

=

(∂ξi

∂vec(Zm+1,i)T� vec(I [Zm+1,i ])T

)Pm,ipool

(3)

∂ξi∂Wm

=∂ξi∂Sm,i

φ(pad(Zm,i))T (4)

∂ξi

∂vec(Zm,i)T= vec

((Wm)T

∂ξi∂Sm,i

)T

Pmφ P

mpad, (5)

Chih-Jen Lin (National Taiwan Univ.) 37 / 85

Storage

Operations of a Convolutional Layer II

Based on the way discussed to store variables, wewill discuss two operations in detail

Generation of φ(pad(Zm,i))vector ×Pm

φ

Chih-Jen Lin (National Taiwan Univ.) 38 / 85

Generation of φ(pad(Zm,i ))

Outline

1 Introduction

2 Storage

3 Generation of φ(pad(Zm,i))

4 Evaluation of (v i)TPmφ

5 Discussion

Chih-Jen Lin (National Taiwan Univ.) 39 / 85

Generation of φ(pad(Zm,i ))

im2col in Existing Packages I

Due to the wide use of CNN, a subroutine forφ(pad(Zm,i)) has been available in some packages

For example, MATLAB has a built-in functionim2col that can generate φ(pad(Zm,i)) for

s = 1 and s = h (width of filter)

But this function cannot handle general s

Can we do a reasonably efficient implementation byourselves?

Chih-Jen Lin (National Taiwan Univ.) 40 / 85

Generation of φ(pad(Zm,i ))

im2col in Existing Packages II

For an easy description we consider

pad(Zm,i) = Z in,i → Z out,i = φ(Z in,i).

Chih-Jen Lin (National Taiwan Univ.) 41 / 85

Generation of φ(pad(Zm,i ))

Linear Indices and an Example I

Consider the following column-oriented linear indices(i.e., counting elements in a column-oriented way)of Z in,i :

1 d in + 1 . . . (binain − 1)d in + 12 d in + 2 . . . (binain − 1)d in + 2...

... . . . ...d in 2d in . . . (binain)d in

∈ Rd in×ainbin.

(6)

Chih-Jen Lin (National Taiwan Univ.) 42 / 85

Generation of φ(pad(Zm,i ))

Linear Indices and an Example II

Every element in

φ(Z in,i) ∈ Rhhd in×aoutbout,

is extracted from Z in,i

The task is to find the mapping between eachelement in φ(Z in,i) and a linear index of Z in,i .

Consider an example with

ain = 3, bin = 2, d in = 1.

Because d in = 1, we omit the channel subscript.

Chih-Jen Lin (National Taiwan Univ.) 43 / 85

Generation of φ(pad(Zm,i ))

Linear Indices and an Example III

In addition, we omit the instance index i , so theimage is z11 z12

z21 z22z31 z32

.If

h = 2, s = 1,

two sub-images are[z11 z12z21 z22

]and

[z21 z22z31 z32

]Chih-Jen Lin (National Taiwan Univ.) 44 / 85

Generation of φ(pad(Zm,i ))

Linear Indices and an Example IV

By our earlier way of representing images,

Z in,i =

z i1,1,1 z i2,1,1 . . . z iain,bin,1...

... . . . ...z i1,1,d in z i2,1,d in . . . z iain,bin,d in

the one we have is

Z in =[z11 z21 z31 z12 z22 z32

]The linear indices from (6) are[

1 2 3 4 5 6].

Chih-Jen Lin (National Taiwan Univ.) 45 / 85

Generation of φ(pad(Zm,i ))

Linear Indices and an Example V

Recall that

φ(Z in,i) =

z i1,1,1 z i1+s,1,1 z i1+(aout−1)s,1+(bout−1)s,1z i2,1,1 z i2+s,1,1 z i2+(aout−1)s,1+(bout−1)s,1

...... . . .

...z ih,h,1 z ih+s,h,1 z ih+(aout−1)s,h+(bout−1)s,1

......

...z ih,h,d in z ih+s,h,d in z ih+(aout−1)s,h+(bout−1)s,d in

Chih-Jen Lin (National Taiwan Univ.) 46 / 85

Generation of φ(pad(Zm,i ))

Linear Indices and an Example VI

Therefore,

φ(Z in) =

z11 z21z21 z31z12 z22z22 z32

.Linear indices of Zm to get elements of φ(Zm):

Zm,i[1 2 3 4 5 6

]Tφ(Zm,i)

[1 2 4 5 2 3 5 6

]T.

Example of using Matlab/Octave

Chih-Jen Lin (National Taiwan Univ.) 47 / 85

Generation of φ(pad(Zm,i ))

Linear Indices and an Example VII

octave:8> reshape((1:6)’, 3, 2)

ans =

1 4

2 5

3 6

octave:9>

octave:9> im2col(reshape((1:6)’, 3, 2),

[2,2], "sliding")

ans =

Chih-Jen Lin (National Taiwan Univ.) 48 / 85

Generation of φ(pad(Zm,i ))

Linear Indices and an Example VIII

1 2

2 3

4 5

5 6

To handle all instances together, we store

Z in,1, . . . ,Z in,l

as [vec(Z in,1) . . . vec(Z in,l)

]Chih-Jen Lin (National Taiwan Univ.) 49 / 85

Generation of φ(pad(Zm,i ))

Linear Indices and an Example IX

Denote it as a MATLAB matrix

Z

Then [vec(φ(Zm,1)) . . . vec(φ(Zm,l))

]is simply

Z(P,:)

in MATLAB, where we store the mapping by

P = [ 1 2 4 5 2 3 5 6 ]T

Chih-Jen Lin (National Taiwan Univ.) 50 / 85

Generation of φ(pad(Zm,i ))

Linear Indices and an Example X

All instances handled in one line

Moreover, we hope Matlab’s implementation on thisoperation is efficient

But how to obtain P?

Note that [1 2 4 5 2 3 5 6

]T.

also corresponds to column indices of non-zeroelements in Pm

φ .

Chih-Jen Lin (National Taiwan Univ.) 51 / 85

Generation of φ(pad(Zm,i ))

Linear Indices and an Example XI

z11z21z12z22z21z31z22z32

=

11

11

11

11

z11z21z31z12z22z32

(7)

Chih-Jen Lin (National Taiwan Univ.) 52 / 85

Generation of φ(pad(Zm,i ))

A General Setting I

We begin with checking how linear indices of Z in,i

can be mapped to the first column of φ(Z in,i).

For simplicity, we consider only channel j .

From

Z in,i =

z i1,1,1 z i2,1,1 . . . z iain,bin,1

...... . . . ...

z i1,1,j z i2,1,j . . . z iain,bin,j...

... . . . ...z i1,1,d in z i2,1,d in . . . z iain,bin,d in

,

Chih-Jen Lin (National Taiwan Univ.) 53 / 85

Generation of φ(pad(Zm,i ))

A General Setting II

we have

linear indices in z in valuesj z in1,1,j

d in + j z in2,1,j...

...(h − 1)d in + j z inh,1,jaind in + j z in1,2,j

......

((h − 1) + ain)d in + j z inh,2,j...

...((h − 1) + (h − 1)ain)d in + j z inh,h,j

Chih-Jen Lin (National Taiwan Univ.) 54 / 85

Generation of φ(pad(Zm,i ))

A General Setting III

We rewrite linear indices in the earlier table as

0 + 0ain...

(h − 1) + 0ain

0 + 1ain...

(h − 1) + 1ain...

0 + (h − 1)ain...

(h − 1) + (h − 1)ain

d in + j . (8)

Chih-Jen Lin (National Taiwan Univ.) 55 / 85

Generation of φ(pad(Zm,i ))

A General Setting IV

Every linear index in (8) can be represented as

(p + qain)d in + j , (9)

wherep, q ∈ {0, . . . , h − 1}

Then (p + 1, q + 1) correspond to the pixel positionin the convolutional filter

Next we consider other columns in φ(Z in,i) by stillfixing the channel to be j .

Chih-Jen Lin (National Taiwan Univ.) 56 / 85

Generation of φ(pad(Zm,i ))

A General Setting V

From

φ(Z in,i) =

z i1,1,1 z i1+s,1,1 z i1+(aout−1)s,1+(bout−1)s,1z i2,1,1 z i2+s,1,1 z i2+(aout−1)s,1+(bout−1)s,1

...... . . .

...z ih,h,1 z ih+s,h,1 z ih+(aout−1)s,h+(bout−1)s,1

......

...z ih,h,d in z ih+s,h,d in z ih+(aout−1)s,h+(bout−1)s,d in

Chih-Jen Lin (National Taiwan Univ.) 57 / 85

Generation of φ(pad(Zm,i ))

A General Setting VI

each column contains the following elements fromthe jth channel of Z in,i .

z in,i1+p+as,1+q+bs,j , a = 0, 1, . . . , aout − 1,

b = 0, 1, . . . , bout − 1, (10)

where(1 + as, 1 + bs)

is the top-left position of a sub-image in the channelj of Z in,i .

Chih-Jen Lin (National Taiwan Univ.) 58 / 85

Generation of φ(pad(Zm,i ))

A General Setting VII

From (6), the linear index of each element in (10) is

((1 + p + as − 1) + (1 + q + bs − 1)ain)︸ ︷︷ ︸column index in Z in,i

d in + j

= (a + bain)sd in + (p + qain)d in + j︸ ︷︷ ︸see (9)

. (11)

Now we have known for each element of φ(Z in,i)what the corresponding linear index in Z in,i is.

Next we discuss the implementation details

Chih-Jen Lin (National Taiwan Univ.) 59 / 85

Generation of φ(pad(Zm,i ))

A General Setting VIII

First, we compute elements in (8) with j = 1 byapplying Matlab’s ‘+’ operator, which has theimplicit expansion behavior, to compute the outersum of the following two arrays.

1d in + 1

...(h − 1)d in + 1

and [

0 aind in . . . (h − 1)aind in].

Chih-Jen Lin (National Taiwan Univ.) 60 / 85

Generation of φ(pad(Zm,i ))

A General Setting IX

The result is the following matrix1 aind in + 1 . . . (h − 1)aind in + 1

d in + 1 (1 + ain)d in + 1 . . . (1 + (h − 1)ain)d in + 1...

... . . ....

(h − 1)d in + 1 ((h − 1) + ain)d in + 1 . . . ((h − 1) + (h − 1)ain)d in + 1

(12)

If columns are concatenated, we get (8) with j = 1

To get (9) for all channels j = 1, . . . , d in, wecompute the outer sum:

vec((12)) +[0 1 . . . d in − 1

](13)

Chih-Jen Lin (National Taiwan Univ.) 61 / 85

Generation of φ(pad(Zm,i ))

A General Setting X

Then we have the first column of φ(Z in,i)

Next, we obtain other columns in φ(Z in,i)

In the linear indices in (11), the second termcorresponds to indices of the first column, while thefirst term is the following column offset

(a + bain)sd in, ∀a = 0, 1, . . . , aout − 1,

b = 0, 1, . . . , bout − 1.

Chih-Jen Lin (National Taiwan Univ.) 62 / 85

Generation of φ(pad(Zm,i ))

A General Setting XI

This is the outer sum of the following two arrays. 0...

aout − 1

×sd in and[0 . . . bout − 1

]×ainsd in

(14)

Finally, we compute the outer sum of the columnoffset and the linear indices in the first column ofφ(Z in,i)

vec((14))T + vec((13)) (15)

Chih-Jen Lin (National Taiwan Univ.) 63 / 85

Generation of φ(pad(Zm,i ))

A General Setting XII

In the end we store

vec((15)) ∈ Rhhd inaoutbout×1

It is a vector collecting

column index of the non-zero in each row of Pmφ

Note that each row in the 0/1 matrix Pmφ contains

exactly only one non-zero element.

See the example in (7)

The obtained linear indices are independent of thevalues of Z in,i .

Chih-Jen Lin (National Taiwan Univ.) 64 / 85

Generation of φ(pad(Zm,i ))

A General Setting XIII

Thus the above procedure only needs to be run oncein the beginning.

Chih-Jen Lin (National Taiwan Univ.) 65 / 85

Generation of φ(pad(Zm,i ))

A Simple Code I

function idx = find_index_phiZ(a,b,d,h,s)

first_channel_idx = ([0:h-1]*d+1)’ +

[0:h-1]*a*d;

first_col_idx = first_channel_idx(:) + [0:d-1];

a_out = floor((a - h)/s) + 1;

b_out = floor((b - h)/s) + 1;

column_offset = ([0:a_out-1]’ +

[0:b_out-1]*a)*s*d;

idx = column_offset(:)’ + first_col_idx(:);

idx = idx(:);

Chih-Jen Lin (National Taiwan Univ.) 66 / 85

Generation of φ(pad(Zm,i ))

Discussion

The code is simple and short

We assume that Matlab operations used here areefficient and so is our resulting code

But is that really the case?

We will do experiments to check this

Some works have tried to do similar things (e.g.,https://github.com/wiseodd/hipsternet),though we don’t see complete documents andevaluation

Chih-Jen Lin (National Taiwan Univ.) 67 / 85

Evaluation of (v i )TPmφ

Outline

1 Introduction

2 Storage

3 Generation of φ(pad(Zm,i))

4 Evaluation of (v i)TPmφ

5 Discussion

Chih-Jen Lin (National Taiwan Univ.) 68 / 85

Evaluation of (v i )TPmφ

(v i)TPmφ I

In the backward process, the following operation isapplied.

(v i)TPmφ , (16)

where

v i = vec

((Wm)T

∂ξi∂Sm,i

)Consider the same example used for φ(Z in,i)

Chih-Jen Lin (National Taiwan Univ.) 69 / 85

Evaluation of (v i )TPmφ

(v i)TPmφ II

We have

Pmφ =

11

11

11

11

Chih-Jen Lin (National Taiwan Univ.) 70 / 85

Evaluation of (v i )TPmφ

(v i)TPmφ III

Thus

(Pmφ )Tv i = [ v1 v2 + v5 v6 v3 v4 + v7 v8 ]T ,

(17)which is a kind of “inverse” operation ofφ(pad(Zm,i))

We accumulate elements in φ(pad(Zm,i)) back totheir original positions in pad(Zm,i).

Chih-Jen Lin (National Taiwan Univ.) 71 / 85

Evaluation of (v i )TPmφ

(v i)TPmφ IV

In MATLAB, given indices[1 2 4 5 2 3 5 6

]T(18)

and the vector v , a function accumarray candirectly generate the vector (17).

Example:

Chih-Jen Lin (National Taiwan Univ.) 72 / 85

Evaluation of (v i )TPmφ

(v i)TPmφ V

octave:18> [v a]

ans =

1 0.406445

2 0.067872

4 0.036638

5 0.279801

2 0.490535

3 0.369743

5 0.429186

6 0.054324

Chih-Jen Lin (National Taiwan Univ.) 73 / 85

Evaluation of (v i )TPmφ

(v i)TPmφ VI

octave:19> accumarray(v,a)

ans =

0.406445

0.558407

0.369743

0.036638

0.708987

0.054324

Chih-Jen Lin (National Taiwan Univ.) 74 / 85

Evaluation of (v i )TPmφ

(v i)TPmφ VII

To do the calculation over a batch of instances, weaim to have (Pm

φ )Tv 1

...(Pm

φ )Tv l

T

. (19)

We can apply MATLAB’s accumarray on the vectorv 1

...v l

, (20)

Chih-Jen Lin (National Taiwan Univ.) 75 / 85

Evaluation of (v i )TPmφ

(v i)TPmφ VIII

by giving the following indices as the input.(18)

(18) + ampadbmpadd

m1hmhmdmamconvbmconv

(18) + 2ampadbmpadd

m1hmhmdmamconvbmconv

...(18) + (l − 1)ampadb

mpadd

m1hmhmdmamconvbmconv

, (21)

where

ampadbmpadd

m is the size of pad(Zm,i)

Chih-Jen Lin (National Taiwan Univ.) 76 / 85

Evaluation of (v i )TPmφ

(v i)TPmφ IX

and

hmhmdmamconvbmconv is the size of φ(pad(Zm,i)) and v i .

That is, by using the offset (i − 1)ampadbmpadd

m,

accumarray accumulates v i to the followingpositions:

(i − 1)ampadbmpadd

m + 1, . . . , iampadbmpadd

m. (22)

Chih-Jen Lin (National Taiwan Univ.) 77 / 85

Evaluation of (v i )TPmφ

(v i)TPmφ X

(21) can be easily obtained by the following outerproduct

vec((18) +[0 . . . l − 1

]ampadb

mpadd

m)

To obtain v 1

...v l

we note that it is the same as

vec((Wm)T

[∂ξ1∂Sm,1 . . . ∂ξl

∂Sm,l

]). (23)

Chih-Jen Lin (National Taiwan Univ.) 78 / 85

Evaluation of (v i )TPmφ

(v i)TPmφ XI

Thus we do a matrix-matrix multiplication

From (23), we can see why ∂ξi/∂vec(Sm,i)T over abatch of instances are stored in the form of[

∂ξ1∂Sm,1 . . . ∂ξl

∂Sm,l

]∈ Rdm+1×amconvbmconvl .

Chih-Jen Lin (National Taiwan Univ.) 79 / 85

Evaluation of (v i )TPmφ

A Simple Code I

a_prev = model.ht_pad(m);

b_prev = model.wd_pad(m);

d_prev = model.ch_input(m);

idx = net.idx_phiZm(:) +

[0:num_v-1]*d_prev*a_prev*b_prev;

vTP = accumarray(idx(:), V(:),

[d_prev*a_prev*b_prev*num_v 1])’;

Chih-Jen Lin (National Taiwan Univ.) 80 / 85

Evaluation of (v i )TPmφ

A Simple Code II

Here we assume

V =[v 1 · · · v l

]and num v is the number of columns

Chih-Jen Lin (National Taiwan Univ.) 81 / 85

Discussion

Outline

1 Introduction

2 Storage

3 Generation of φ(pad(Zm,i))

4 Evaluation of (v i)TPmφ

5 Discussion

Chih-Jen Lin (National Taiwan Univ.) 82 / 85

Discussion

Efficient Implementation I

If a package provide efficient implementations of thefollowing operations

matrix-matrix productsmatrix expansion for φ(pad(Zm,i))outer productaccumarray

then we can easily have a good CNNimplementation

A comparison between MATLAB and Octave willsee their respective strengths and weaknesses

Chih-Jen Lin (National Taiwan Univ.) 83 / 85

Discussion

Discussion I

To work on instances together, it’s difficult todecide the best storage settings

Further, storage settings affect the implementations

Do you think our setting is already the best?

How do easily check the running time of usingdifferent storage settings? Is our code flexibleenough for such experiments?

Chih-Jen Lin (National Taiwan Univ.) 84 / 85

Discussion

References I

Chih-Jen Lin (National Taiwan Univ.) 85 / 85