Statistical Methods for NLP

Text Categorization, Support Vector

Machines

Sameer Maskey

Announcement

� Reading Assignments

� Will be posted online tonight

� Homework 1

� Assigned and available from the course website

� Due in 2 Weeks (Feb 16, 4pm)

� 2 programming assignments

Project Proposals

� Reminder to think about projects

� Proposals due in 3 weeks (Feb 23)

Topics for Today

� Naïve Bayes Classifier for Text

� Smoothing

� Support Vector Machines

� Paper review session

Naïve Bayes Classifier for Text

Prior Probability

of the Class

Conditional Probability

of feature given the

Here N is the number of words, not to

confuse with the total vocabulary size

P (yk,X1,X2, ..., XN ) = P (yk)ΠiP (Xi|yk)

P (y = yk|X1,X2, ..., XN ) = P (y =yk)P (X1,X2,..,XN |y =yk)∑jP (y =yj )P (X1,X2,..,XN |y =yj)

= P (y =yk)ΠiP (Xi|y =yk)∑jP (y =yj)ΠiP (Xi|y =yj)

y ← argmaxykP (y = yk)ΠiP (Xi|y = yk)

� Given the training data what are the

parameters to be estimated?

P (X|y2)P (X|y1)P (y )

Diabetes : 0.8

Hepatitis : 0.2

the: 0.001

diabetic : 0.02

blood : 0.0015

sugar : 0.02

weight : 0.018

the: 0.001

diabetic : 0.0001

water : 0.0118

fever : 0.01

weight : 0.008

y ← argmaxykP (y = yk)ΠiP (Xi|y = yk)

Estimating Parameters

� Maximum Likelihood Estimates

� Relative Frequency Counts

� For a new document

� Find which one gives higher posterior probability

� Log ratio

� Thresholding

� Classify accordingly

Smoothing

� MLE for Naïve Bayes (relative frequency

counts) may not generalize well

� Zero counts

� Smoothing

� With less evidence, believe in prior more

� With more evidence, believe in data more

Laplace Smoothing

� Assume we have one more count for each

element

� Zero counts become 1

Psmooth(w) =cw+1∑

w{c(w)+1}

Psmooth(w) =cw+1N+V

Vocab Size

Back to Discriminative Classification

f(x) = wTx+ b

Linear Classification

� If we have linearly separable data we can find

w such that

yi(wTxi + b) > 0 ∀i

Margin

� Let us have hyperplanes such that

Total margin is sum of d+ and d-

wTxi + b ≥ +1 if yi = +1

wTxi + b ≤ −1 if yi = −1

yi(wTxi + b)− 1 ≥ 0 ∀i

Maximizing Margin

� Distance between H and H+ is

� Distance between H+ and H- is

� In order to maximize the margin need to minimize

the denominator

2‖w‖

1‖w‖

12‖w‖2

Maximizing Margin with Constraints

� We can combine the two inequalities to get

� Problem formulation

� Minimize

� Subject to

‖w‖2

yi(wTxi + b)− 1 ≥ 0 ∀i

Solving with Lagrange Multipliers

� Solve by introducing Lagrange Multipliers for

the constraints

� Minimize

∂∂wJ(w, b, α) = w−∑n

i=1 αiyixi

∂∂bJ(w, b, α) = −∑n

i=1 αiyi

For given αi

J(w, b, α) = ‖w‖2

2 −∑n

i=1 αi{yi(wTxi + b)− 1}

� Solve dual problem instead

� Maximize

� subject to constraints of

Dual Problem

J(α) =∑n

i=1 αi − 12

i,j=1 αiαjyiyj(xi.xj)

αi ≥ 0 ∀i∑n

i=1 αiyi = 0

Quadratic Programming Problem

� Minimize f(x) such that g(x) = k

� Where f(x) is quadratic and g(x) are linear constraints

� Constrained optimization problem

� Saw the example before

SVM Solution

� Linear combination of weighted training example

� Sparse Solution, why?

� Weights zero for non-support vectors

∑i∈SV αiyi(xi.x) + b

w=∑ni=1 αiyixi

Sequential Minimal Optimization (SMO)

Algorithm

� The weights are just linear combinations of training

vectors weighted with alphas

� We still have not answered how do we get alphas

� Coordinate ascent

Do until converged

select pair of alpha(i) and alpha(j)

reoptimize W(alpha) with respect to alpha(i) and alpha(j)

holding all other alphas constant

Not Linearly Separable

Transformation

Transformation h( ) =

Non Linear SVMs

� Map data to a higher dimension where linear

separation is possible

� We can get a longer feature vector by adding

dimensions

x→ (x2, x)

φ(x) = (x21, x22,√2x1x2,

√2x1,

√2x2, 1)

Kernels

Given feature mapping φ(x) defineK(x, z) = φ(x)Tφ(z)

φ(x)Tφ(z)

= x21z21 + x

22z22 + 2x1x2z1z2 + 2x1z1 + 2x2z2 + 1

May not need to

explicitly transform

= (x.z + 1)2

Example of Kernel Functions

K(x, z) = x.z

K(x, z) = (x.z + 1)p

K(x, z) = exp(−‖x−z‖2

2σ2 )

Linear Kernel

Polynomial Kernel

Gaussian Kernel

Non-separable case

� Some data sets may not be linearly separable

� Introduce slack variable

� Also helps regularization

� Less sensitive to outliers

� Minimize

� Subject to

‖w‖2

i=1 ξi

yi(wTxi + b) ≥ 1− ξi ∀i

ξi ≥ 0 ∀i

SummaryCLASS1

Features X PREDICT

CLASS2

� Linear Classification Methods

� Fisher’s Linear Discriminant

� Perceptron

� Support Vector Machines

References

� Tutorials on www.svms.org/tutorial

Statistical Methods for NLP - Columbia University

Transcript of Statistical Methods for NLP - Columbia University

Statistical Methods for NLP - Columbia University

Documents

Transcript of Statistical Methods for NLP - Columbia University

Presentation - Setting specifications, statistical considerations

Statistical Modelingorfe.princeton.edu/~jqfan/fan/classes/524/notes1.pdf · Statistical Modeling 1.1 Statistical Models Example 1: ... — Probability is from a box to sample, while

THERMODYNAMICS & STATISTICAL MECHANICS

Σύστημα καθαρισμού νερού Columbia II Plus

Ύπνωση και NLP

Topic 4: Statistical Inference. Outline Statistical inference –confidence intervals –significance tests Statistical inference for β 1 Statistical inference.

Chapter 6 Statistical Quality Control, 7th Edition by ...haalshraideh/QC/xbarRchart.pdf · Chapter 6 Statistical Quality Control, ... Chapter 6 Introduction to Statistical Quality

Algorithms for NLP with nonlinear constraintsmurty/611/611slides14.pdf · 2004. 1. 5. · µ→∞, x(µ) → opt. of original NLP assuming it exists, hence method called exterior

The NLP/NLPt metaphor OOOOOOOOO GGGGGGGGG Id Id Id Id Id Id Id Id Id Vr KId Kc Kc Vc Kc Vc Kc NLP patterns/strategies. VVVVVVVVV AAAAAAAAA KKKKKKKKK OOOOOOOOO GGGGGGGGG Id Id Id ...

Review:Trajectory - Columbia University

Statistical Inference - Harvard University · Kosuke Imai (Princeton University) Statistical Inference POL 345 Lecture 16 / 46 Overview of Statistical Hypothesis Testing R EADINGS

Statistical Process Costantrol

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 1.

Statistical Inference for Diffusion Processes · Ergodicity, LLN, CLTs Statistical inference for SDEs (high frequency data) Statistical inference for SDEs (discrete data) Numerical

Unit3: Statistical Inferences

Appendix: Statistical Tables

Columbia - Φίλτρο νερού της Camelot

Fundamental Theory of Statistical Inferenceayoung/ltcc/slideschapter2.pdf · G. Alastair Young Fundamental Theory of Statistical Inference. Title: Fundamental Theory of Statistical

EE 6885 Statistical Pattern Recognition

Statistical Physics - Trinity College, Dublingavinche/college/stat.pdf · Chapter 2 Framework for Statistical Physics The way we formulate statistical physics will be established