L -S M C A I S -I F - Semantic Scholar€¦ · accelerated inexact soft-impute for fast large-scale...

1
A CCELERATED I NEXACT S OFT -I MPUTE FOR FAST L ARGE -S CALE M ATRIX C OMPLETION Q UANMING Y AO AND J AMES T. K WOK D EPARTMENT OF C OMPUTER S CIENCE AND E NGINEERING H ONG K ONG U NIVERSITY OF S CIENCE AND T ECHNOLOGY M ATRIX C OMPLETION min X 1 2 kP Ω (X - O )k 2 F + λkX k * , where [P Ω (A)] ij = A ij if Ω ij =1, and 0 otherwise; and kX k * is the nuclear norm of X . An important application is recommendation systems. It assumes that the underlying fully observed matrix X is low-rank, i.e., X = UV . Missing elements are to be inferred from the known observations. P ROXIMAL G RADIENT D ESCENT min x f (x)+ λg (x), where (i) both f (·) and g (·) are smooth, convex; (ii) g (·) can be nonsmooth. Let z t = x t -∇f (x t ), it generates x t+1 as arg min x hx - x t , f (x t )i + 1 2 kx - x t k 2 F + λg (x) = arg min x 1 2 kx - z t k 2 + μg (x) prox μg (z t ). this is called the proximal step, often with closed-form solutions. convergence rate is O (1/T ), but can be acceler- ated to O (1/T 2 ) by using y t = (1 + θ t )x t - θ t x t-1 , z t = y t -∇f (y t ). Lemma 1 (SVT, Cai et al. 2010) Let the SVD of a matrix Z be U ΣV > . The solution to min X 1 2 kX - Z k 2 F + λkX k * is a low-rank matrix given by SVT λ (Z ) U - λI ) + V > , where [(A) + ] ij = max(A ij , 0). For g (X )= kX k * , it is known as Singular Value Thresholding (SVT). S OFT -I MPUTE [M AZUMDER ET AL ., 2010] At iteration t, it generates X t+1 = SVT λ (Z t ) where Z t = P Ω (O )+ P Ω (X t )= P Ω (O - X t )+ X t . "Sparse + Low-Rank" structure in Z t : let SVD of X t be U t Σ t V t > . for any u R n , Z t u (same for Z > t v ) can be computed as Z t u = P Ω (O - X t )u | {z } O(kΩk 1 ) + U t Σ t (V t > u) | {z } O((m+n)k) . (1) a rank-k SVD on Z t R m×n takes O (kΩk 1 k + (m + n)k 2 ) instead of O (mnk ) time. Z t = X t -∇f (X t )= P Ω (X t )+ P Ω (O ), and so Soft-Impute = Proximal Gradients acceleration is possible, but previous results suggested it is useless as the sparse + low-rank structure in Z t is lost. Proposed AIS-Impute can be download- ed at https://github.com/quanmingyao/ AIS-impute. P ROPOSED A LGORITHM (AIS-I MPUTE ) Contributions efficient acceleration is possible with sparse + low-rank structure; speeding up approximate SVT using power method. Fast convergence + Low iteration complexity Sparse + Low-Rank Z t with acceleration: Z t = P Ω (O - Y t ) | {z } sparse + (1 + θ t )X t - θ t X t-1 | {z } sum of two low rank matrices . Similar to (1), for any u R n , Z t u can be performed as P Ω (O - Y t )u | {z } O(kΩk 1 ) + (1 + θ t )U t Σ t V > t u | {z } O((m+n)k) - θ t U t-1 Σ t-1 V > t-1 u | {z } O((m+n)k) . again, takes O (kΩk 1 k +(m + n)k 2 ) time, but convergence rate is improved to O (1/T 2 ). Approximate SVT A1 PowerMethod( ˜ Z,R, ˜ ). Require: ˜ Z R m×n , initial R R k×n for warm-start, tolerance ˜ ; 1: Initialize Y 1 ˜ ZR; 2: for τ =1, 2,..., do 3: Q τ +1 = QR(Y τ ); Y τ +1 = ˜ Z ( ˜ Z > Q τ +1 ); 4: if kQ τ +1 Q τ +1 > - Q τ Q τ > k F ˜ ; break; 5: end for 6: return Q τ +1 . Since only singular values large than λ are needed, we use power iteration [Halko et al., 2011] to ap- proximate SVT. The error is controlled by ˜ . A2 approx-SVT( ˜ Z t , R, λ, ˜ ). Require: ˜ Z t R m×n , R R n×k , λ and ˜ ; 1: Q = PowerMethod( ˜ Z t , R, ˜ ); 2: [U, Σ,V ]= SVD(Q > ˜ Z t ); 3: U = {u i | σ i }; V = {v i | σ i }; 4: Σ = (Σ - λI ) + ; 5: return QU, Σ and V . A3 AIS-Impute (proposed algorithm). Require: partially observed matrix O , parameter λ, de- cay parameter ν (0, 1), threshold ; 1: [U 0 0 ,V 0 ]= rank-1 SVD(P Ω (O )); 2: c =1, ˜ 0 = kP Ω (O )k F , X 0 = X 1 = λ 0 U 0 V > 0 ; 3: for t =1, 2,... do 4: ˜ t = ν t ˜ 0 ; θ t =(c - 1)/(c + 2); 5: λ t = ν t (λ 0 - λ)+ λ; 6: Y t = X t + θ t (X t - X t-1 ); 7: ˜ Z t = Y t + P Ω (O - Y t ); 8: V t-1 = V t-1 - V t (V t > V t-1 ), remove zero columns; 9: R t = QR([V t ,V t-1 ]); 10: [U t+1 , Σ t+1 ,V t+1 ]= approx-SVT( ˜ Z t ,R t t , ˜ t ); 11: if F (U t+1 Σ t+1 V > t+1 ) >F (U t Σ t V > t ); c =1 else c = c +1; 12: end for 13: return X t+1 = U t+1 Σ t+1 V > t+1 . core steps are 7–9 (approximate SVT); power method is warm-started with last two itera- tions (V t and V t-1 ). step 5: continuation strategy [Toh and Yun, 2010; Mazumder et al., 2010]; step 11: restarts if F (X ) starts to increase. C ONVERGENCE A NALYSIS Approximation quality of approximate SVT: Proposition 2 Let h λk·k * (X ; ˜ Z t ) 1 2 kX - ˜ Z t k 2 + λkX k * . Assume k ˆ k , ˜ α t η j t p 1+ η 2 t . Then, h λk·k * ( ˆ X t ; ˜ Z t ) h λk·k * (SVT λ ( ˜ Z t ); ˜ Z t )+ η t 1 - η t ˜ β t γ t . Theorem 3 (Convergence of AIS-Impute) Assume the conditions in Proposition 2, Algorithm 3 converges to the optimal with a rate of O (1/T 2 ). S YNTHETIC D ATA Methods compared: (i) accelerated proximal gradient algorithm (“APG") [Ji and Ye, 2009]; (ii) Soft-Impute [Mazumder et al., 2010]; (iii) pro- posed accelerated inexact Soft-Impute algorithm (“AIS-Impute") Results clearly indicate "fast rate + low iteration complexity" of AIS-Impute. (a) objective vs #iterations. (b) objective vs time. M OVIE L ENS D ATA Besides proximal algorithms, we also compare with (i) active subspace selection (“active") [Hsieh and Olsen, 2014]; (ii) Frank-Wolfe algorithm (“boost") [Zhang et al., 2012]; (iii) a recent variant of Soft-Impute [Hastie et al., 2014]; and (iv) a trust-region method (“TR") [Mishra et al., 2013]. (c) MovieLens-100K. (d) MovieLens-1M. (e) MovieLens-10M. All algorithms are equally good at recovering the missing entries, but AIS-Impute is the fastest.

Transcript of L -S M C A I S -I F - Semantic Scholar€¦ · accelerated inexact soft-impute for fast large-scale...

Page 1: L -S M C A I S -I F - Semantic Scholar€¦ · accelerated inexact soft-impute for fast large-scale matrix completion quanming yao and james t. kwok department of computer science

ACCELERATED INEXACT SOFT-IMPUTE FOR FASTLARGE-SCALE MATRIX COMPLETION

QUANMING YAO AND JAMES T. KWOKDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY

MATRIX COMPLETION

minX

1

2‖PΩ(X −O)‖2F + λ‖X‖∗,

where [PΩ(A)]ij = Aij if Ωij = 1, and 0 otherwise;and ‖X‖∗ is the nuclear norm of X . An importantapplication is recommendation systems.

It assumes that the underlying fully observed matrixX is low-rank, i.e., X = UV . Missing elements areto be inferred from the known observations.

PROXIMAL GRADIENT DESCENTminx f(x) + λg(x), where (i) both f(·) and g(·) aresmooth, convex; (ii) g(·) can be nonsmooth. Let zt =xt −∇f(xt), it generates xt+1 as

arg minx〈x− xt,∇f(xt)〉+

1

2‖x− xt‖2F + λg(x)

= arg minx

1

2‖x− zt‖2 + µg(x) ≡ proxµg(zt).

• this is called the proximal step, often withclosed-form solutions.

• convergence rate isO(1/T ), but can be acceler-ated to O(1/T 2) by using

yt = (1 + θt)xt − θtxt−1, zt = yt −∇f(yt).

Lemma 1 (SVT, Cai et al. 2010) Let the SVD of amatrix Z be UΣV >. The solution to minX

12‖X−Z‖

2F +

λ‖X‖∗ is a low-rank matrix given by

SVTλ(Z) ≡ U(Σ− λI)+V>,

where [(A)+]ij = max(Aij , 0).

For g(X) = ‖X‖∗, it is known as Singular ValueThresholding (SVT).

SOFT-IMPUTE [MAZUMDER ET AL., 2010]

At iteration t, it generates Xt+1 = SVTλ(Zt) where

Zt = PΩ(O) + P⊥Ω (Xt) = PΩ(O −Xt) +Xt.

"Sparse + Low-Rank" structure in Zt:

• let SVD of Xt be UtΣtVt>.

• for any u ∈ Rn, Ztu (same for Z>t v) can becomputed as

Ztu = PΩ(O −Xt)u︸ ︷︷ ︸O(‖Ω‖1)

+UtΣt(Vt>u)︸ ︷︷ ︸

O((m+n)k)

. (1)

• a rank-k SVD on Zt ∈ Rm×n takes O(‖Ω‖1k +(m+ n)k2) instead of O(mnk) time.

• Zt = Xt −∇f(Xt) = P⊥Ω (Xt) + PΩ(O), and so

Soft-Impute = Proximal Gradients

• acceleration is possible, but previous resultssuggested it is useless as the sparse + low-rankstructure in Zt is lost.

Proposed AIS-Impute can be download-ed at https://github.com/quanmingyao/AIS-impute.

PROPOSED ALGORITHM (AIS-IMPUTE)Contributions

• efficient acceleration is possible with sparse +low-rank structure;

• speeding up approximate SVT using powermethod.

Fast convergence + Low iterationcomplexity

Sparse + Low-Rank

Zt with acceleration:

Zt = PΩ(O − Yt)︸ ︷︷ ︸sparse

+ (1 + θt)Xt − θtXt−1︸ ︷︷ ︸sum of two low rank matrices

.

Similar to (1), for any u ∈ Rn, Ztu can be performedas

PΩ(O − Yt)u︸ ︷︷ ︸O(‖Ω‖1)

+ (1 + θt)UtΣtV>t u︸ ︷︷ ︸

O((m+n)k)

− θtUt−1Σt−1V>t−1u︸ ︷︷ ︸

O((m+n)k)

.

• again, takes O(‖Ω‖1k + (m + n)k2) time, butconvergence rate is improved to O(1/T 2).

Approximate SVT

A 1 PowerMethod(Z, R, ε).Require: Z ∈ Rm×n, initial R ∈ Rk×n for warm-start,

tolerance ε;1: Initialize Y1 ← ZR;2: for τ = 1, 2, . . . , do3: Qτ+1 = QR(Yτ ); Yτ+1 = Z(Z>Qτ+1);4: if ‖Qτ+1Qτ+1

> −QτQτ>‖F ≤ ε; break;5: end for6: return Qτ+1.

Since only singular values large than λ are needed,we use power iteration [Halko et al., 2011] to ap-proximate SVT. The error is controlled by ε.

A 2 approx-SVT(Zt, R, λ, ε).

Require: Zt ∈ Rm×n, R ∈ Rn×k, λ and ε;1: Q = PowerMethod(Zt, R, ε);2: [U,Σ, V ] = SVD(Q>Zt);3: U = ui | σi > λ; V = vi | σi > λ;4: Σ = (Σ− λI)+;5: return QU,Σ and V .

A 3 AIS-Impute (proposed algorithm).Require: partially observed matrix O, parameter λ, de-

cay parameter ν ∈ (0, 1), threshold ε;1: [U0, λ0, V0] = rank-1 SVD(PΩ(O));2: c = 1, ε0 = ‖PΩ(O)‖F , X0 = X1 = λ0U0V

>0 ;

3: for t = 1, 2, . . . do4: εt = νtε0; θt = (c− 1)/(c+ 2);5: λt = νt(λ0 − λ) + λ;6: Yt = Xt + θt(Xt −Xt−1);7: Zt = Yt + PΩ(O − Yt);8: Vt−1 = Vt−1 − Vt(Vt>Vt−1), remove zero columns;9: Rt = QR([Vt, Vt−1]);

10: [Ut+1,Σt+1, Vt+1] = approx-SVT(Zt, Rt, λt, εt);11: if F (Ut+1Σt+1V

>t+1) > F (UtΣtV

>t ); c = 1 else c =

c+ 1;12: end for13: return Xt+1 = Ut+1Σt+1V

>t+1.

• core steps are 7–9 (approximate SVT); powermethod is warm-started with last two itera-tions (Vt and Vt−1).

• step 5: continuation strategy [Toh and Yun,2010; Mazumder et al., 2010];

• step 11: restarts if F (X) starts to increase.

CONVERGENCE ANALYSISApproximation quality of approximate SVT:Proposition 2 Let hλ‖·‖∗(X; Zt) ≡ 1

2‖X − Zt‖2 + λ‖X‖∗. Assume k ≥ k, ε ≥ αtηjt

√1 + η2

t . Then,

hλ‖·‖∗(Xt; Zt)≤hλ‖·‖∗(SVTλ(Zt); Zt)+ηt

1− ηtεβtγt.

Theorem 3 (Convergence of AIS-Impute) Assume the conditions in Proposition 2, Algorithm 3 converges to theoptimal with a rate of O(1/T 2).

SYNTHETIC DATA

Methods compared:(i) accelerated proximal gradient algorithm(“APG") [Ji and Ye, 2009];(ii) Soft-Impute [Mazumder et al., 2010]; (iii) pro-posed accelerated inexact Soft-Impute algorithm(“AIS-Impute")Results clearly indicate "fast rate + low iterationcomplexity" of AIS-Impute. (a) objective vs #iterations. (b) objective vs time.

MOVIELENS DATABesides proximal algorithms, we also compare with (i) active subspace selection (“active") [Hsieh and Olsen,2014]; (ii) Frank-Wolfe algorithm (“boost") [Zhang et al., 2012]; (iii) a recent variant of Soft-Impute [Hastie etal., 2014]; and (iv) a trust-region method (“TR") [Mishra et al., 2013].

(c) MovieLens-100K. (d) MovieLens-1M. (e) MovieLens-10M.

All algorithms are equally good at recovering the missing entries, but AIS-Impute is the fastest.