ECS 171: Machine Learningchohsieh/teaching/ECS171_Winter2018/lecture16.pdfLecture 16: Matrix...

ECS 171: Machine LearningLecture 16: Matrix completion, word embedding

Cho-Jui HsiehUC Davis

March 11, 2018

Outline

Matrix Completion

PCA

Word Embedding

Recommender Systems

Matrix Factorization Approach A ≈ WHT

Matrix Factorization Approach

minW∈Rm×k

H∈Rn×k

∑(i ,j)∈Ω

(Aij −wTi hj)

2 + λ(‖W ‖2

F + ‖H‖2F

),

Ω = (i , j) | Aij is observedRegularized terms to avoid over-fitting

Matrix factorization maps users/items to latent feature space Rk

the i th user ⇒ i th row of W , wTi ,

the j th item ⇒ j th row of H, hTj .

wTi hj : measures the interaction between i th user and j th item.

Latent Feature Space

Properties of the Objective Function

Nonconvex problem (why?)

Example: f (x , y) = 12 (xy − 1)2

∇f (0, 0) = 0, but clearly [0, 0] is not a global optimum

ALS: Alternating Least Squares

Objective function:

minW ,H

1

2

∑i ,j∈Ω

(Aij − (WHT )ij)2 +

λ

2‖W ‖2

F +λ

2‖H‖2

F

:= f (W ,H)

Iteratively fix either H or W and optimize the other:

Input: partially observed matrix A, initial values of W ,HFor t = 1, 2, . . .

Fix W and update H: H ← argminH f (W ,H)Fix H and update W : W ← argminW f (W ,H)

ALS: Alternating Least Squares

Define: Ωj := i | (i , j) ∈ Ωwi : the i-th row of W ; hj : the j-th row of H

The subproblem:

argminH

1

2

∑i ,j∈Ω

(Aij − (WHT )ij)2 +

λ

2‖H‖2

F

=n∑

j=1

(1

2

∑i∈Ωj

(Aij −wTi hj)

2 +λ

2‖hj‖2

︸︷︷︸ridge regression problem

)

n ridge regression problems, each with k variables

⇒ O(|Ω|k2 + nk3)

Easy to parallelize (n independent ridge regression subproblems)

Principal Component Analysis

Principal Component Analysis (PCA)

Data matrix can be big.Example: bag-of-word modelEach document is represented by a d-dimensional vector x , where xi isnumber of occurrences of word i .

number of features = number of potential words ≈ 10,000

Feature generation for documents

Bag of n-gram features (n = 2):

10,000 words ⇒ 10, 0002 potential features

Data Matrix (document)

Use the bag-of-word matrix or the normalized version (TF-IDF) for adataset (denoted by D):

tfidf(doc,word,D) = tf (doc,word) · idf (word,D)

tf(doc, word): term frequency

(word count in the document)/(total number of terms in the document)

idf(word, Dataset): inverse document frequencylog((Number of documents)/(Number of documents with this word))

PCA: Motivation

Data can have huge dimensionality:

Reuters text collection (rcv1): 677,399 documents, 47,236 features(words)Pubmed abstract collection: 8,200,000 documents, 141,043 features(words)

Can we find a low-dimensional representation for each document?

Enable many learning algorithms to run efficientlySometimes achieve better prediction performance (de-noising)Visualize the data

PCA: Motivation

Orthogonal projection of data onto lower-dimensional linear space that:

Maximize variance of projected data(preserve as much information as possible)Minimize reconstruction error

PCA: Formulation

Given the data x1, · · · , xn ∈ Rd , compute the principal vector w by:

w = arg max‖w‖=1

1

n

n∑i=1

(wTxi −wT x)2

where x =∑

i xi/n is the mean.

First, shift data so that xi = xi − x , so


1

n

n∑i=1

(wT xi )2 = arg max‖w‖=1

1

nwT X XTw

where each column of X is xiThe first principal component w is the leading eigenvector of X XT

(eigenvector corresponding to the largest eigenvalue)

PCA: Formulation



1

n

n∑i=1

(wTxi −wT x)2

where x =∑

i xi/n is the mean.



1

n

n∑i=1


1

nwT X XTw

where each column of X is xi

The first principal component w is the leading eigenvector of X XT


PCA: Formulation



1

n

n∑i=1

(wTxi −wT x)2

where x =∑

i xi/n is the mean.



1

n

n∑i=1


1

nwT X XTw

where each column of X is xiThe first principal component w is the leading eigenvector of X XT


PCA: Formulation

2nd principal component w2:

Perpendicular to w1

Again, largest varianceEigenvector corresponding to the second eigenvalue

Top k principal components:

w1, · · · ,wk

Top k eigenvectorsThe k-dimensional subspace with largest variance

W = arg maxW∈Rd×k ,WTW=I

k∑r=1

1

nwT

r X XTwr

PCA: illustration

Word2vec: Learning Word Representations

Word2vec: Motivation

Goal: understand the meaning of a word

Given a large text corpus, how to learn low-dimensional features torepresent a word?

Skip-gram model:

For each word wi , define the “contexts” of the word as the wordssurrounding it in an L-sized window:

wi−L−2,wi−L−1,wi−L, · · · ,wi−1︸︷︷︸contexts of wi

,wi ,wi+1, · · · ,wi+L︸︷︷︸contexts of wi

,wi+L+1, · · ·

Get a collection of (word, context) pairs, denoted by D.

Skip-gram model

(Figure from http://mccormickml.com/2016/04/19/

word2vec-tutorial-the-skip-gram-model/)

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Use bag-of-word model

Idea 1: Use the bag-of-word model to “describe” each word

Assume we have context words c1, · · · , cd in the corpus, compute

#(w , ci ) := number of times the pair (w , ci ) appears in D

For each word w , form a d-dimensional (sparse) vector to describe w

#(w , c1), · · · ,#(w , cd),

PMI/PPMI Representation

Similar to TF-IDF: Need to consider the frequency for each word andeach context

Instead of using co-ocurrent count #(w , c), we can define pointwisemutual information:

PMI(w , c) = log(P(w , c)

P(w)P(c)) = log

#(w , c)|D|#(w)#(c)

,

#(w) =∑

c #(w , c): number of times word w occurred in D#(c) =

∑w #(w , c): number of times context c occurred in D

|D|: number of pairs in D

Positive PMI (PPMI) usually achieves better performance:

PPMI(w , c) = max(PMI(w , c), 0)

MPPMI: a n by d word feature matrix, each row is a word and eachcolumn is a context

PPMI Matrix

Low-dimensional embedding (Word2vec)

Advantages to extracting low-dimensional dense representations:

Improve computational efficiency for end applicationsBetter visualizationBetter performance (?)

Perform PCA/SVD on the sparse feature matrix:

MPPMI ≈ UkΣkVTk

Then W SVD = UkΣk is the context representation of each word(Each row is a k-dimensional feature for a word)

This is one of the word2vec algorithm.

Generalized Low-rank Embedding

SVD basis will minimize

minW ,V‖MPPMI −WV T‖2

F

Extensions (Glove, Google W2V, . . . ):

Use different loss function (instead of ‖ · ‖F )Negative sampling (less weights to 0s in MPPMI )Adding bias term:

MPPMI ≈WV T + bweT + ebTc

Details and comparisons:

“Improving Distributional Similarity with Lessons Learned from WordEmbeddings”, Levy et al., ACL 2015.“Glove: Global Vectors for Word Representation”, Pennington et al.,EMNLP 2014.

Results

The low-dimensional embeddings are (often) meaningful:

(Figure from https://www.tensorflow.org/tutorials/word2vec)

https://www.tensorflow.org/tutorials/word2vec

Conclusions

Questions?

ECS 171: Machine Learningchohsieh/teaching/ECS171_Winter2018/lecture16.pdfLecture 16: Matrix...

Documents

Transcript of ECS 171: Machine Learningchohsieh/teaching/ECS171_Winter2018/lecture16.pdfLecture 16: Matrix...