ECS 171: Machine Learningchohsieh/teaching/ECS171_Winter2018/lecture16.pdfLecture 16: Matrix...
Transcript of ECS 171: Machine Learningchohsieh/teaching/ECS171_Winter2018/lecture16.pdfLecture 16: Matrix...
ECS 171: Machine LearningLecture 16: Matrix completion, word embedding
Cho-Jui HsiehUC Davis
March 11, 2018
Outline
Matrix Completion
PCA
Word Embedding
Recommender Systems
Matrix Factorization Approach A ≈ WHT
Matrix Factorization Approach A ≈ WHT
Matrix Factorization Approach
minW∈Rm×k
H∈Rn×k
∑(i ,j)∈Ω
(Aij −wTi hj)
2 + λ(‖W ‖2
F + ‖H‖2F
),
Ω = (i , j) | Aij is observedRegularized terms to avoid over-fitting
Matrix factorization maps users/items to latent feature space Rk
the i th user ⇒ i th row of W , wTi ,
the j th item ⇒ j th row of H, hTj .
wTi hj : measures the interaction between i th user and j th item.
Latent Feature Space
Latent Feature Space
Properties of the Objective Function
Nonconvex problem (why?)
Example: f (x , y) = 12 (xy − 1)2
∇f (0, 0) = 0, but clearly [0, 0] is not a global optimum
ALS: Alternating Least Squares
Objective function:
minW ,H
1
2
∑i ,j∈Ω
(Aij − (WHT )ij)2 +
λ
2‖W ‖2
F +λ
2‖H‖2
F
:= f (W ,H)
Iteratively fix either H or W and optimize the other:
Input: partially observed matrix A, initial values of W ,HFor t = 1, 2, . . .
Fix W and update H: H ← argminH f (W ,H)Fix H and update W : W ← argminW f (W ,H)
ALS: Alternating Least Squares
Define: Ωj := i | (i , j) ∈ Ωwi : the i-th row of W ; hj : the j-th row of H
The subproblem:
argminH
1
2
∑i ,j∈Ω
(Aij − (WHT )ij)2 +
λ
2‖H‖2
F
=n∑
j=1
(1
2
∑i∈Ωj
(Aij −wTi hj)
2 +λ
2‖hj‖2
︸ ︷︷ ︸ridge regression problem
)
n ridge regression problems, each with k variables
⇒ O(|Ω|k2 + nk3)
Easy to parallelize (n independent ridge regression subproblems)
ALS: Alternating Least Squares
Define: Ωj := i | (i , j) ∈ Ωwi : the i-th row of W ; hj : the j-th row of H
The subproblem:
argminH
1
2
∑i ,j∈Ω
(Aij − (WHT )ij)2 +
λ
2‖H‖2
F
=n∑
j=1
(1
2
∑i∈Ωj
(Aij −wTi hj)
2 +λ
2‖hj‖2
︸ ︷︷ ︸ridge regression problem
)
n ridge regression problems, each with k variables
⇒ O(|Ω|k2 + nk3)
Easy to parallelize (n independent ridge regression subproblems)
Principal Component Analysis
Principal Component Analysis (PCA)
Data matrix can be big.Example: bag-of-word modelEach document is represented by a d-dimensional vector x , where xi isnumber of occurrences of word i .
number of features = number of potential words ≈ 10,000
Feature generation for documents
Bag of n-gram features (n = 2):
10,000 words ⇒ 10, 0002 potential features
Data Matrix (document)
Use the bag-of-word matrix or the normalized version (TF-IDF) for adataset (denoted by D):
tfidf(doc,word,D) = tf (doc,word) · idf (word,D)
tf(doc, word): term frequency
(word count in the document)/(total number of terms in the document)
idf(word, Dataset): inverse document frequencylog((Number of documents)/(Number of documents with this word))
PCA: Motivation
Data can have huge dimensionality:
Reuters text collection (rcv1): 677,399 documents, 47,236 features(words)Pubmed abstract collection: 8,200,000 documents, 141,043 features(words)
Can we find a low-dimensional representation for each document?
Enable many learning algorithms to run efficientlySometimes achieve better prediction performance (de-noising)Visualize the data
PCA: Motivation
Orthogonal projection of data onto lower-dimensional linear space that:
Maximize variance of projected data(preserve as much information as possible)Minimize reconstruction error
PCA: Formulation
Given the data x1, · · · , xn ∈ Rd , compute the principal vector w by:
w = arg max‖w‖=1
1
n
n∑i=1
(wTxi −wT x)2
where x =∑
i xi/n is the mean.
First, shift data so that xi = xi − x , so
w = arg max‖w‖=1
1
n
n∑i=1
(wT xi )2 = arg max‖w‖=1
1
nwT X XTw
where each column of X is xiThe first principal component w is the leading eigenvector of X XT
(eigenvector corresponding to the largest eigenvalue)
PCA: Formulation
Given the data x1, · · · , xn ∈ Rd , compute the principal vector w by:
w = arg max‖w‖=1
1
n
n∑i=1
(wTxi −wT x)2
where x =∑
i xi/n is the mean.
First, shift data so that xi = xi − x , so
w = arg max‖w‖=1
1
n
n∑i=1
(wT xi )2 = arg max‖w‖=1
1
nwT X XTw
where each column of X is xi
The first principal component w is the leading eigenvector of X XT
(eigenvector corresponding to the largest eigenvalue)
PCA: Formulation
Given the data x1, · · · , xn ∈ Rd , compute the principal vector w by:
w = arg max‖w‖=1
1
n
n∑i=1
(wTxi −wT x)2
where x =∑
i xi/n is the mean.
First, shift data so that xi = xi − x , so
w = arg max‖w‖=1
1
n
n∑i=1
(wT xi )2 = arg max‖w‖=1
1
nwT X XTw
where each column of X is xiThe first principal component w is the leading eigenvector of X XT
(eigenvector corresponding to the largest eigenvalue)
PCA: Formulation
2nd principal component w2:
Perpendicular to w1
Again, largest varianceEigenvector corresponding to the second eigenvalue
Top k principal components:
w1, · · · ,wk
Top k eigenvectorsThe k-dimensional subspace with largest variance
W = arg maxW∈Rd×k ,WTW=I
k∑r=1
1
nwT
r X XTwr
PCA: Formulation
2nd principal component w2:
Perpendicular to w1
Again, largest varianceEigenvector corresponding to the second eigenvalue
Top k principal components:
w1, · · · ,wk
Top k eigenvectorsThe k-dimensional subspace with largest variance
W = arg maxW∈Rd×k ,WTW=I
k∑r=1
1
nwT
r X XTwr
PCA: illustration
Word2vec: Learning Word Representations
Word2vec: Motivation
Goal: understand the meaning of a word
Given a large text corpus, how to learn low-dimensional features torepresent a word?
Skip-gram model:
For each word wi , define the “contexts” of the word as the wordssurrounding it in an L-sized window:
wi−L−2,wi−L−1,wi−L, · · · ,wi−1︸ ︷︷ ︸contexts of wi
,wi ,wi+1, · · · ,wi+L︸ ︷︷ ︸contexts of wi
,wi+L+1, · · ·
Get a collection of (word, context) pairs, denoted by D.
Skip-gram model
(Figure from http://mccormickml.com/2016/04/19/
word2vec-tutorial-the-skip-gram-model/)
Use bag-of-word model
Idea 1: Use the bag-of-word model to “describe” each word
Assume we have context words c1, · · · , cd in the corpus, compute
#(w , ci ) := number of times the pair (w , ci ) appears in D
For each word w , form a d-dimensional (sparse) vector to describe w
#(w , c1), · · · ,#(w , cd),
PMI/PPMI Representation
Similar to TF-IDF: Need to consider the frequency for each word andeach context
Instead of using co-ocurrent count #(w , c), we can define pointwisemutual information:
PMI(w , c) = log(P(w , c)
P(w)P(c)) = log
#(w , c)|D|#(w)#(c)
,
#(w) =∑
c #(w , c): number of times word w occurred in D#(c) =
∑w #(w , c): number of times context c occurred in D
|D|: number of pairs in D
Positive PMI (PPMI) usually achieves better performance:
PPMI(w , c) = max(PMI(w , c), 0)
MPPMI: a n by d word feature matrix, each row is a word and eachcolumn is a context
PPMI Matrix
Low-dimensional embedding (Word2vec)
Advantages to extracting low-dimensional dense representations:
Improve computational efficiency for end applicationsBetter visualizationBetter performance (?)
Perform PCA/SVD on the sparse feature matrix:
MPPMI ≈ UkΣkVTk
Then W SVD = UkΣk is the context representation of each word(Each row is a k-dimensional feature for a word)
This is one of the word2vec algorithm.
Generalized Low-rank Embedding
SVD basis will minimize
minW ,V‖MPPMI −WV T‖2
F
Extensions (Glove, Google W2V, . . . ):
Use different loss function (instead of ‖ · ‖F )Negative sampling (less weights to 0s in MPPMI )Adding bias term:
MPPMI ≈WV T + bweT + ebTc
Details and comparisons:
“Improving Distributional Similarity with Lessons Learned from WordEmbeddings”, Levy et al., ACL 2015.“Glove: Global Vectors for Word Representation”, Pennington et al.,EMNLP 2014.
Results
The low-dimensional embeddings are (often) meaningful:
(Figure from https://www.tensorflow.org/tutorials/word2vec)
Conclusions
Questions?