Chih-Jen Lin - 國立臺灣大學cjlin/courses/optdl... · Implementation Chih-Jen Lin National...
Transcript of Chih-Jen Lin - 國立臺灣大學cjlin/courses/optdl... · Implementation Chih-Jen Lin National...
Implementation
Chih-Jen LinNational Taiwan University
Last updated: June 18, 2019
Chih-Jen Lin (National Taiwan Univ.) 1 / 85
Outline
1 Introduction
2 Storage
3 Generation of φ(pad(Zm,i))
4 Evaluation of (v i)TPmφ
5 Discussion
Chih-Jen Lin (National Taiwan Univ.) 2 / 85
Introduction
Outline
1 Introduction
2 Storage
3 Generation of φ(pad(Zm,i))
4 Evaluation of (v i)TPmφ
5 Discussion
Chih-Jen Lin (National Taiwan Univ.) 3 / 85
Introduction
Introduction I
After checking formulations for gradient calculationwe would like to get into implementation details
Take the following operation as an example
∂ξi∂Wm
=∂ξi∂Sm,i
φ(pad(Zm,i))T
It’s a matrix-matrix product
We all know that a three-level for loop does the job
Does that mean we can then write an efficientimplementation?
The answer is noChih-Jen Lin (National Taiwan Univ.) 4 / 85
Introduction
Introduction II
To explain this, we check some details ofmatrix-matrix products
We also introduce optimized BLAS (Basic LinearAlgebra Subprograms)
Chih-Jen Lin (National Taiwan Univ.) 5 / 85
Introduction
Matrix Multiplication I
We know thatC = AB
is a mathematics operation with
Cij =n∑
k=1
AikBkj
Chih-Jen Lin (National Taiwan Univ.) 6 / 85
Introduction
Optimized BLAS: an Example by UsingBlock Algorithms I
Let’s test the matrix multiplication
A C program:
#define n 2000
double a[n][n], b[n][n], c[n][n];
int main()
{
int i, j, k;
for (i=0;i<n;i++)
for (j=0;j<n;j++) {
Chih-Jen Lin (National Taiwan Univ.) 7 / 85
Introduction
Optimized BLAS: an Example by UsingBlock Algorithms II
a[i][j]=1; b[i][j]=1;
}
for (i=0;i<n;i++)
for (j=0;j<n;j++) {
c[i][j]=0;
for (k=0;k<n;k++)
c[i][j] += a[i][k]*b[k][j];
}
}
Chih-Jen Lin (National Taiwan Univ.) 8 / 85
Introduction
Optimized BLAS: an Example by UsingBlock Algorithms III
The result
cjlin@linux6:~$ gcc -O3 mat.c
cjlin@linux6:~$ time ./a.out
real 0m59.251s
user 0m58.994s
sys 0m0.096s
Let’s try another way:
Chih-Jen Lin (National Taiwan Univ.) 9 / 85
Introduction
Optimized BLAS: an Example by UsingBlock Algorithms IV
#define n 2000
double a[n][n], b[n][n], c[n][n];
int main()
{
int i, j, k;
for (i=0;i<n;i++)
for (j=0;j<n;j++) {
a[i][j]=1; b[i][j]=1;
c[i][j]=0;
Chih-Jen Lin (National Taiwan Univ.) 10 / 85
Introduction
Optimized BLAS: an Example by UsingBlock Algorithms V
}
for (j=0;j<n;j++) {
for (k=0;k<n;k++)
for (i=0;i<n;i++)
c[i][j] += a[i][k]*b[k][j];
}
}
The result
Chih-Jen Lin (National Taiwan Univ.) 11 / 85
Introduction
Optimized BLAS: an Example by UsingBlock Algorithms VI
cjlin@linux6:~$ gcc -O3 mat1.c
cjlin@linux6:~$ time ./a.out
real 2m13.199s
user 2m12.810s
sys 0m0.060s
We see that first approach is faster. Why?
For each of
c[i][j] a[i][k] b[k][j];
we do column-access
Chih-Jen Lin (National Taiwan Univ.) 12 / 85
Introduction
Optimized BLAS: an Example by UsingBlock Algorithms VII
C is row-oriented rather than column-oriented
Now we sense that memory access can be an issue
Let’s try a Matlab program
n = 2000;
A = randn(n,n); B = randn(n,n);
t = cputime; C = A*B; t = cputime -t
To remove the effect of multi-threading, use
matlab -singleCompThread
Timing is an issue
Chih-Jen Lin (National Taiwan Univ.) 13 / 85
Introduction
Optimized BLAS: an Example by UsingBlock Algorithms VIII
Elapsed time versus CPU time
cjlin@linux6:~$ matlab -singleCompThread
>> n = 2000;
>> A = randn(n,n); B = randn(n,n);
>> tic; C = A*B; toc
Elapsed time is 1.139780 seconds.
>> t = cputime; C = A*B; t = cputime -t
t =
1.1200
If using multiple cores,
Chih-Jen Lin (National Taiwan Univ.) 14 / 85
Introduction
Optimized BLAS: an Example by UsingBlock Algorithms IX
cjlin@linux6:~$ matlab
>> tic; C = A*B; toc
Elapsed time is 0.227179 seconds.
>> t = cputime; C = A*B; t = cputime -t
t =
1.6800
Matlab is much faster than a code written byourselves. Why ?
Optimized BLAS: data locality is exploited
Use the highest level of memory as possible
Chih-Jen Lin (National Taiwan Univ.) 15 / 85
Introduction
Optimized BLAS: an Example by UsingBlock Algorithms X
Block algorithms: transferring sub-matrices betweendifferent levels of storage
localize operations to achieve good performance
Chih-Jen Lin (National Taiwan Univ.) 16 / 85
Introduction
Memory Hierarchy I
CPU
↓Registers
↓Cache
↓Main Memory
↓Secondary storage (Disk)
Chih-Jen Lin (National Taiwan Univ.) 17 / 85
Introduction
Memory Hierarchy II
↑: increasing in speed
↓: increasing in capacity
When I studied computer architecture, I didn’t quiteunderstand that this setting is so useful
But from optimized BLAS I realize that it isextremely powerful
Chih-Jen Lin (National Taiwan Univ.) 18 / 85
Introduction
Memory Management I
Page fault: operand not available in main memory
transported from secondary memory
(usually) overwrites page least recently used
I/O increases the total time
An example: C = AB + C , n = 1, 024
Assumption: a page 65,536 doubles = 64 columns
16 pages for each matrix
48 pages for three matrices
Chih-Jen Lin (National Taiwan Univ.) 19 / 85
Introduction
Memory Management II
Assumption: available memory 16 pages, matricesaccess: column oriented
A =
[1 23 4
]column oriented: 1 3 2 4
row oriented: 1 2 3 4
access each row of A: 16 page faults, 1024/64 = 16
Assumption: each time a continuous segment ofdata into one page
Approach 1: inner productChih-Jen Lin (National Taiwan Univ.) 20 / 85
Introduction
Memory Management III
for i =1:n
for j=1:n
for k=1:n
c(i,j) = a(i,k)*b(k,j)+c(i,j);
end
end
end
We use a matlab-like syntax here
At each (i,j): each row a(i, 1:n) causes 16 pagefaults
Chih-Jen Lin (National Taiwan Univ.) 21 / 85
Introduction
Memory Management IV
Total: 10242 × 16 page faults
at least 16 million page faults
Approach 2:
for j =1:n
for k=1:n
for i=1:n
c(i,j) = a(i,k)*b(k,j)+c(i,j);
end
end
end
Chih-Jen Lin (National Taiwan Univ.) 22 / 85
Introduction
Memory Management V
For each j , access all columns of A
A needs 16 pages, but B and C take spaces as well
So A must be read for every j
For each j , 16 page faults for A
1024× 16 page faults
C ,B : 16 page faults
Approach 3: block algorithms (nb = 256)
Chih-Jen Lin (National Taiwan Univ.) 23 / 85
Introduction
Memory Management VI
for j =1:nb:n
for k=1:nb:n
for jj=j:j+nb-1
for kk=k:k+nb-1
c(:,jj) = a(:,kk)*b(kk,jj)+c(:,jj);
end
end
end
end
In MATLAB, 1:256:1025 means 1, 257, 513, 769
Chih-Jen Lin (National Taiwan Univ.) 24 / 85
Introduction
Memory Management VII
Note that we calculateA11 · · · A14...
A41 · · · A44
B11 · · · B14...
B41 · · · B44
=
[A11B11 + · · ·+ A14B41 · · ·
... . . .
]
Chih-Jen Lin (National Taiwan Univ.) 25 / 85
Introduction
Memory Management VIII
Each block: 256× 256
C11 = A11B11 + · · ·+ A14B41
C21 = A21B11 + · · ·+ A24B41
C31 = A31B11 + · · ·+ A34B41
C41 = A41B11 + · · ·+ A44B41
For each (j , k), Bk ,j is used to add A:,kBk,j to C:,j
Chih-Jen Lin (National Taiwan Univ.) 26 / 85
Introduction
Memory Management IX
Example: when j = 1, k = 1
C11 ← C11 + A11B11
...
C41 ← C41 + A41B11
Use Approach 2 for A:,1B11
A:,1: 256 columns, 1024× 256/65536 = 4 pages.A:,1, . . . ,A:,4 : 4× 4 = 16 page faults in calculatingC:,1
For A: 16× 4 page faults
B : 16 page faults, C : 16 page faultsChih-Jen Lin (National Taiwan Univ.) 27 / 85
Introduction
Optimized BLAS Implementations
OpenBLAS
http://www.openblas.net/
It is an optimized BLAS library based onGotoBLAS2 (see the story in the next slide)
Intel MKL (Math Kernel Library)
https://software.intel.com/en-us/mkl
Chih-Jen Lin (National Taiwan Univ.) 28 / 85
Introduction
Some Past Stories about Optimized BLAS
BLAS by Kazushige Goto
https://www.tacc.utexas.edu/
research-development/tacc-software/
gotoblas2
See the NY Times article: “Writing the fastestcode, by hand, for fun: a human computer keepsspeeding up chips”
http://www.nytimes.com/2005/11/28/
technology/28super.html?pagewanted=all
Chih-Jen Lin (National Taiwan Univ.) 29 / 85
Introduction
Discussion I
This discussion roughly explains why GPU is usedfor deep learning
Somehow we can do fast matrix-matrix operationson GPU
Note that we did not touch multi-coreimplementations, though parallelization is possible
Anyway, the conclusion is that for some operations,using code written by experts is more efficient thanour own implementation
How about other operations besides matrix-matrixproducts?
Chih-Jen Lin (National Taiwan Univ.) 30 / 85
Introduction
Discussion II
If they can also be done by calling others’ efficientimplementation, then a simple and efficient CNNimplementation can be done
In the past few months my group has been buildingsuch a code
See the repository athttps://github.com/cjlin1/simplenn
The purpose of that code is neither for industry usenor for quickly trying various architectures
Instead, it aims to be a simple end-to-end code for
education, and
Chih-Jen Lin (National Taiwan Univ.) 31 / 85
Introduction
Discussion III
experiments on optimization algorithms
We will explain details and use it in our subsequentprojects
Chih-Jen Lin (National Taiwan Univ.) 32 / 85
Storage
Outline
1 Introduction
2 Storage
3 Generation of φ(pad(Zm,i))
4 Evaluation of (v i)TPmφ
5 Discussion
Chih-Jen Lin (National Taiwan Univ.) 33 / 85
Storage
Storage I
In the earlier discussion, we check each individualdata.
However, for practical implementations, all (orsome) instances must be considered together formemory and computational efficiency.
Recall we do mini-batch stochastic gradient
We assume l is the number of data instances usedto calculate the gradient (or the sub-gradient)
Chih-Jen Lin (National Taiwan Univ.) 34 / 85
Storage
Storage II
In our implementation, we store Zm,i , ∀i = 1, . . . , las the following matrix.[
Zm,1 Zm,2 . . . Zm,l]∈ Rdm×ambml . (1)
Similarly, we store
∂ξi∂vec(Sm,i)T
, ∀i
as [∂ξ1∂Sm,1 . . . ∂ξl
∂Sm,l
]∈ Rdm+1×amconvbmconvl . (2)
Chih-Jen Lin (National Taiwan Univ.) 35 / 85
Storage
Storage III
We will explain our decision.
Note that (1)-(2) are only the main setting to storethese matrices because for some operations theymay need to be re-shaped.
For an easy description we may follow pastdiscussion to let
Z in,i and Z out,i
be the input and output images of a layer,respectively.
Chih-Jen Lin (National Taiwan Univ.) 36 / 85
Storage
Operations of a Convolutional Layer I
Recall that we conduct the following operations
∂ξi
∂vec(Sm,i)T
=
(∂ξi
∂vec(Zm+1,i)T� vec(I [Zm+1,i ])T
)Pm,ipool
(3)
∂ξi∂Wm
=∂ξi∂Sm,i
φ(pad(Zm,i))T (4)
∂ξi
∂vec(Zm,i)T= vec
((Wm)T
∂ξi∂Sm,i
)T
Pmφ P
mpad, (5)
Chih-Jen Lin (National Taiwan Univ.) 37 / 85
Storage
Operations of a Convolutional Layer II
Based on the way discussed to store variables, wewill discuss two operations in detail
Generation of φ(pad(Zm,i))vector ×Pm
φ
Chih-Jen Lin (National Taiwan Univ.) 38 / 85
Generation of φ(pad(Zm,i ))
Outline
1 Introduction
2 Storage
3 Generation of φ(pad(Zm,i))
4 Evaluation of (v i)TPmφ
5 Discussion
Chih-Jen Lin (National Taiwan Univ.) 39 / 85
Generation of φ(pad(Zm,i ))
im2col in Existing Packages I
Due to the wide use of CNN, a subroutine forφ(pad(Zm,i)) has been available in some packages
For example, MATLAB has a built-in functionim2col that can generate φ(pad(Zm,i)) for
s = 1 and s = h (width of filter)
But this function cannot handle general s
Can we do a reasonably efficient implementation byourselves?
Chih-Jen Lin (National Taiwan Univ.) 40 / 85
Generation of φ(pad(Zm,i ))
im2col in Existing Packages II
For an easy description we consider
pad(Zm,i) = Z in,i → Z out,i = φ(Z in,i).
Chih-Jen Lin (National Taiwan Univ.) 41 / 85
Generation of φ(pad(Zm,i ))
Linear Indices and an Example I
Consider the following column-oriented linear indices(i.e., counting elements in a column-oriented way)of Z in,i :
1 d in + 1 . . . (binain − 1)d in + 12 d in + 2 . . . (binain − 1)d in + 2...
... . . . ...d in 2d in . . . (binain)d in
∈ Rd in×ainbin.
(6)
Chih-Jen Lin (National Taiwan Univ.) 42 / 85
Generation of φ(pad(Zm,i ))
Linear Indices and an Example II
Every element in
φ(Z in,i) ∈ Rhhd in×aoutbout,
is extracted from Z in,i
The task is to find the mapping between eachelement in φ(Z in,i) and a linear index of Z in,i .
Consider an example with
ain = 3, bin = 2, d in = 1.
Because d in = 1, we omit the channel subscript.
Chih-Jen Lin (National Taiwan Univ.) 43 / 85
Generation of φ(pad(Zm,i ))
Linear Indices and an Example III
In addition, we omit the instance index i , so theimage is z11 z12
z21 z22z31 z32
.If
h = 2, s = 1,
two sub-images are[z11 z12z21 z22
]and
[z21 z22z31 z32
]Chih-Jen Lin (National Taiwan Univ.) 44 / 85
Generation of φ(pad(Zm,i ))
Linear Indices and an Example IV
By our earlier way of representing images,
Z in,i =
z i1,1,1 z i2,1,1 . . . z iain,bin,1...
... . . . ...z i1,1,d in z i2,1,d in . . . z iain,bin,d in
the one we have is
Z in =[z11 z21 z31 z12 z22 z32
]The linear indices from (6) are[
1 2 3 4 5 6].
Chih-Jen Lin (National Taiwan Univ.) 45 / 85
Generation of φ(pad(Zm,i ))
Linear Indices and an Example V
Recall that
φ(Z in,i) =
z i1,1,1 z i1+s,1,1 z i1+(aout−1)s,1+(bout−1)s,1z i2,1,1 z i2+s,1,1 z i2+(aout−1)s,1+(bout−1)s,1
...... . . .
...z ih,h,1 z ih+s,h,1 z ih+(aout−1)s,h+(bout−1)s,1
......
...z ih,h,d in z ih+s,h,d in z ih+(aout−1)s,h+(bout−1)s,d in
Chih-Jen Lin (National Taiwan Univ.) 46 / 85
Generation of φ(pad(Zm,i ))
Linear Indices and an Example VI
Therefore,
φ(Z in) =
z11 z21z21 z31z12 z22z22 z32
.Linear indices of Zm to get elements of φ(Zm):
Zm,i[1 2 3 4 5 6
]Tφ(Zm,i)
[1 2 4 5 2 3 5 6
]T.
Example of using Matlab/Octave
Chih-Jen Lin (National Taiwan Univ.) 47 / 85
Generation of φ(pad(Zm,i ))
Linear Indices and an Example VII
octave:8> reshape((1:6)’, 3, 2)
ans =
1 4
2 5
3 6
octave:9>
octave:9> im2col(reshape((1:6)’, 3, 2),
[2,2], "sliding")
ans =
Chih-Jen Lin (National Taiwan Univ.) 48 / 85
Generation of φ(pad(Zm,i ))
Linear Indices and an Example VIII
1 2
2 3
4 5
5 6
To handle all instances together, we store
Z in,1, . . . ,Z in,l
as [vec(Z in,1) . . . vec(Z in,l)
]Chih-Jen Lin (National Taiwan Univ.) 49 / 85
Generation of φ(pad(Zm,i ))
Linear Indices and an Example IX
Denote it as a MATLAB matrix
Z
Then [vec(φ(Zm,1)) . . . vec(φ(Zm,l))
]is simply
Z(P,:)
in MATLAB, where we store the mapping by
P = [ 1 2 4 5 2 3 5 6 ]T
Chih-Jen Lin (National Taiwan Univ.) 50 / 85
Generation of φ(pad(Zm,i ))
Linear Indices and an Example X
All instances handled in one line
Moreover, we hope Matlab’s implementation on thisoperation is efficient
But how to obtain P?
Note that [1 2 4 5 2 3 5 6
]T.
also corresponds to column indices of non-zeroelements in Pm
φ .
Chih-Jen Lin (National Taiwan Univ.) 51 / 85
Generation of φ(pad(Zm,i ))
Linear Indices and an Example XI
z11z21z12z22z21z31z22z32
=
11
11
11
11
z11z21z31z12z22z32
(7)
Chih-Jen Lin (National Taiwan Univ.) 52 / 85
Generation of φ(pad(Zm,i ))
A General Setting I
We begin with checking how linear indices of Z in,i
can be mapped to the first column of φ(Z in,i).
For simplicity, we consider only channel j .
From
Z in,i =
z i1,1,1 z i2,1,1 . . . z iain,bin,1
...... . . . ...
z i1,1,j z i2,1,j . . . z iain,bin,j...
... . . . ...z i1,1,d in z i2,1,d in . . . z iain,bin,d in
,
Chih-Jen Lin (National Taiwan Univ.) 53 / 85
Generation of φ(pad(Zm,i ))
A General Setting II
we have
linear indices in z in valuesj z in1,1,j
d in + j z in2,1,j...
...(h − 1)d in + j z inh,1,jaind in + j z in1,2,j
......
((h − 1) + ain)d in + j z inh,2,j...
...((h − 1) + (h − 1)ain)d in + j z inh,h,j
Chih-Jen Lin (National Taiwan Univ.) 54 / 85
Generation of φ(pad(Zm,i ))
A General Setting III
We rewrite linear indices in the earlier table as
0 + 0ain...
(h − 1) + 0ain
0 + 1ain...
(h − 1) + 1ain...
0 + (h − 1)ain...
(h − 1) + (h − 1)ain
d in + j . (8)
Chih-Jen Lin (National Taiwan Univ.) 55 / 85
Generation of φ(pad(Zm,i ))
A General Setting IV
Every linear index in (8) can be represented as
(p + qain)d in + j , (9)
wherep, q ∈ {0, . . . , h − 1}
Then (p + 1, q + 1) correspond to the pixel positionin the convolutional filter
Next we consider other columns in φ(Z in,i) by stillfixing the channel to be j .
Chih-Jen Lin (National Taiwan Univ.) 56 / 85
Generation of φ(pad(Zm,i ))
A General Setting V
From
φ(Z in,i) =
z i1,1,1 z i1+s,1,1 z i1+(aout−1)s,1+(bout−1)s,1z i2,1,1 z i2+s,1,1 z i2+(aout−1)s,1+(bout−1)s,1
...... . . .
...z ih,h,1 z ih+s,h,1 z ih+(aout−1)s,h+(bout−1)s,1
......
...z ih,h,d in z ih+s,h,d in z ih+(aout−1)s,h+(bout−1)s,d in
Chih-Jen Lin (National Taiwan Univ.) 57 / 85
Generation of φ(pad(Zm,i ))
A General Setting VI
each column contains the following elements fromthe jth channel of Z in,i .
z in,i1+p+as,1+q+bs,j , a = 0, 1, . . . , aout − 1,
b = 0, 1, . . . , bout − 1, (10)
where(1 + as, 1 + bs)
is the top-left position of a sub-image in the channelj of Z in,i .
Chih-Jen Lin (National Taiwan Univ.) 58 / 85
Generation of φ(pad(Zm,i ))
A General Setting VII
From (6), the linear index of each element in (10) is
((1 + p + as − 1) + (1 + q + bs − 1)ain)︸ ︷︷ ︸column index in Z in,i
d in + j
= (a + bain)sd in + (p + qain)d in + j︸ ︷︷ ︸see (9)
. (11)
Now we have known for each element of φ(Z in,i)what the corresponding linear index in Z in,i is.
Next we discuss the implementation details
Chih-Jen Lin (National Taiwan Univ.) 59 / 85
Generation of φ(pad(Zm,i ))
A General Setting VIII
First, we compute elements in (8) with j = 1 byapplying Matlab’s ‘+’ operator, which has theimplicit expansion behavior, to compute the outersum of the following two arrays.
1d in + 1
...(h − 1)d in + 1
and [
0 aind in . . . (h − 1)aind in].
Chih-Jen Lin (National Taiwan Univ.) 60 / 85
Generation of φ(pad(Zm,i ))
A General Setting IX
The result is the following matrix1 aind in + 1 . . . (h − 1)aind in + 1
d in + 1 (1 + ain)d in + 1 . . . (1 + (h − 1)ain)d in + 1...
... . . ....
(h − 1)d in + 1 ((h − 1) + ain)d in + 1 . . . ((h − 1) + (h − 1)ain)d in + 1
(12)
If columns are concatenated, we get (8) with j = 1
To get (9) for all channels j = 1, . . . , d in, wecompute the outer sum:
vec((12)) +[0 1 . . . d in − 1
](13)
Chih-Jen Lin (National Taiwan Univ.) 61 / 85
Generation of φ(pad(Zm,i ))
A General Setting X
Then we have the first column of φ(Z in,i)
Next, we obtain other columns in φ(Z in,i)
In the linear indices in (11), the second termcorresponds to indices of the first column, while thefirst term is the following column offset
(a + bain)sd in, ∀a = 0, 1, . . . , aout − 1,
b = 0, 1, . . . , bout − 1.
Chih-Jen Lin (National Taiwan Univ.) 62 / 85
Generation of φ(pad(Zm,i ))
A General Setting XI
This is the outer sum of the following two arrays. 0...
aout − 1
×sd in and[0 . . . bout − 1
]×ainsd in
(14)
Finally, we compute the outer sum of the columnoffset and the linear indices in the first column ofφ(Z in,i)
vec((14))T + vec((13)) (15)
Chih-Jen Lin (National Taiwan Univ.) 63 / 85
Generation of φ(pad(Zm,i ))
A General Setting XII
In the end we store
vec((15)) ∈ Rhhd inaoutbout×1
It is a vector collecting
column index of the non-zero in each row of Pmφ
Note that each row in the 0/1 matrix Pmφ contains
exactly only one non-zero element.
See the example in (7)
The obtained linear indices are independent of thevalues of Z in,i .
Chih-Jen Lin (National Taiwan Univ.) 64 / 85
Generation of φ(pad(Zm,i ))
A General Setting XIII
Thus the above procedure only needs to be run oncein the beginning.
Chih-Jen Lin (National Taiwan Univ.) 65 / 85
Generation of φ(pad(Zm,i ))
A Simple Code I
function idx = find_index_phiZ(a,b,d,h,s)
first_channel_idx = ([0:h-1]*d+1)’ +
[0:h-1]*a*d;
first_col_idx = first_channel_idx(:) + [0:d-1];
a_out = floor((a - h)/s) + 1;
b_out = floor((b - h)/s) + 1;
column_offset = ([0:a_out-1]’ +
[0:b_out-1]*a)*s*d;
idx = column_offset(:)’ + first_col_idx(:);
idx = idx(:);
Chih-Jen Lin (National Taiwan Univ.) 66 / 85
Generation of φ(pad(Zm,i ))
Discussion
The code is simple and short
We assume that Matlab operations used here areefficient and so is our resulting code
But is that really the case?
We will do experiments to check this
Some works have tried to do similar things (e.g.,https://github.com/wiseodd/hipsternet),though we don’t see complete documents andevaluation
Chih-Jen Lin (National Taiwan Univ.) 67 / 85
Evaluation of (v i )TPmφ
Outline
1 Introduction
2 Storage
3 Generation of φ(pad(Zm,i))
4 Evaluation of (v i)TPmφ
5 Discussion
Chih-Jen Lin (National Taiwan Univ.) 68 / 85
Evaluation of (v i )TPmφ
(v i)TPmφ I
In the backward process, the following operation isapplied.
(v i)TPmφ , (16)
where
v i = vec
((Wm)T
∂ξi∂Sm,i
)Consider the same example used for φ(Z in,i)
Chih-Jen Lin (National Taiwan Univ.) 69 / 85
Evaluation of (v i )TPmφ
(v i)TPmφ II
We have
Pmφ =
11
11
11
11
Chih-Jen Lin (National Taiwan Univ.) 70 / 85
Evaluation of (v i )TPmφ
(v i)TPmφ III
Thus
(Pmφ )Tv i = [ v1 v2 + v5 v6 v3 v4 + v7 v8 ]T ,
(17)which is a kind of “inverse” operation ofφ(pad(Zm,i))
We accumulate elements in φ(pad(Zm,i)) back totheir original positions in pad(Zm,i).
Chih-Jen Lin (National Taiwan Univ.) 71 / 85
Evaluation of (v i )TPmφ
(v i)TPmφ IV
In MATLAB, given indices[1 2 4 5 2 3 5 6
]T(18)
and the vector v , a function accumarray candirectly generate the vector (17).
Example:
Chih-Jen Lin (National Taiwan Univ.) 72 / 85
Evaluation of (v i )TPmφ
(v i)TPmφ V
octave:18> [v a]
ans =
1 0.406445
2 0.067872
4 0.036638
5 0.279801
2 0.490535
3 0.369743
5 0.429186
6 0.054324
Chih-Jen Lin (National Taiwan Univ.) 73 / 85
Evaluation of (v i )TPmφ
(v i)TPmφ VI
octave:19> accumarray(v,a)
ans =
0.406445
0.558407
0.369743
0.036638
0.708987
0.054324
Chih-Jen Lin (National Taiwan Univ.) 74 / 85
Evaluation of (v i )TPmφ
(v i)TPmφ VII
To do the calculation over a batch of instances, weaim to have (Pm
φ )Tv 1
...(Pm
φ )Tv l
T
. (19)
We can apply MATLAB’s accumarray on the vectorv 1
...v l
, (20)
Chih-Jen Lin (National Taiwan Univ.) 75 / 85
Evaluation of (v i )TPmφ
(v i)TPmφ VIII
by giving the following indices as the input.(18)
(18) + ampadbmpadd
m1hmhmdmamconvbmconv
(18) + 2ampadbmpadd
m1hmhmdmamconvbmconv
...(18) + (l − 1)ampadb
mpadd
m1hmhmdmamconvbmconv
, (21)
where
ampadbmpadd
m is the size of pad(Zm,i)
Chih-Jen Lin (National Taiwan Univ.) 76 / 85
Evaluation of (v i )TPmφ
(v i)TPmφ IX
and
hmhmdmamconvbmconv is the size of φ(pad(Zm,i)) and v i .
That is, by using the offset (i − 1)ampadbmpadd
m,
accumarray accumulates v i to the followingpositions:
(i − 1)ampadbmpadd
m + 1, . . . , iampadbmpadd
m. (22)
Chih-Jen Lin (National Taiwan Univ.) 77 / 85
Evaluation of (v i )TPmφ
(v i)TPmφ X
(21) can be easily obtained by the following outerproduct
vec((18) +[0 . . . l − 1
]ampadb
mpadd
m)
To obtain v 1
...v l
we note that it is the same as
vec((Wm)T
[∂ξ1∂Sm,1 . . . ∂ξl
∂Sm,l
]). (23)
Chih-Jen Lin (National Taiwan Univ.) 78 / 85
Evaluation of (v i )TPmφ
(v i)TPmφ XI
Thus we do a matrix-matrix multiplication
From (23), we can see why ∂ξi/∂vec(Sm,i)T over abatch of instances are stored in the form of[
∂ξ1∂Sm,1 . . . ∂ξl
∂Sm,l
]∈ Rdm+1×amconvbmconvl .
Chih-Jen Lin (National Taiwan Univ.) 79 / 85
Evaluation of (v i )TPmφ
A Simple Code I
a_prev = model.ht_pad(m);
b_prev = model.wd_pad(m);
d_prev = model.ch_input(m);
idx = net.idx_phiZm(:) +
[0:num_v-1]*d_prev*a_prev*b_prev;
vTP = accumarray(idx(:), V(:),
[d_prev*a_prev*b_prev*num_v 1])’;
Chih-Jen Lin (National Taiwan Univ.) 80 / 85
Evaluation of (v i )TPmφ
A Simple Code II
Here we assume
V =[v 1 · · · v l
]and num v is the number of columns
Chih-Jen Lin (National Taiwan Univ.) 81 / 85
Discussion
Outline
1 Introduction
2 Storage
3 Generation of φ(pad(Zm,i))
4 Evaluation of (v i)TPmφ
5 Discussion
Chih-Jen Lin (National Taiwan Univ.) 82 / 85
Discussion
Efficient Implementation I
If a package provide efficient implementations of thefollowing operations
matrix-matrix productsmatrix expansion for φ(pad(Zm,i))outer productaccumarray
then we can easily have a good CNNimplementation
A comparison between MATLAB and Octave willsee their respective strengths and weaknesses
Chih-Jen Lin (National Taiwan Univ.) 83 / 85
Discussion
Discussion I
To work on instances together, it’s difficult todecide the best storage settings
Further, storage settings affect the implementations
Do you think our setting is already the best?
How do easily check the running time of usingdifferent storage settings? Is our code flexibleenough for such experiments?
Chih-Jen Lin (National Taiwan Univ.) 84 / 85