Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and...

27
Statistical Methods for NLP Statistical Inference Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(27)

Transcript of Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and...

Page 1: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Methods for NLP

Statistical Inference

Joakim Nivre

Uppsala UniversityDepartment of Linguistics and Philology

[email protected]

Statistical Methods for NLP 1(27)

Page 2: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Stochastic Variables

I A stochastic variable (or random variable) X is a functionfrom a sample space Ω (the domain of X ) to a value spaceΩX (the range of X ).

I Examples:1. The part-of-speech of an arbitrary word from a corpus is a

stochastic variable X with Ω = w |w is a corpus token andΩX = noun, verb,adjective, . . ..

2. The sum of two dice is a stochastic variable Y withΩ = (x , y)|1 ≤ x , y ≤ 6 and ΩY = z|2 ≤ z ≤ 12.

Statistical Methods for NLP 2(27)

Page 3: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Types of Variables

I If ΩX is a subset of the set of real numbers, then X is saidto to be numerical; otherwise it is categorical.

I If ΩX is finite or countably infinite, then X is said to bediscrete.

I Examples:1. The part-of-speech X of an arbitrary word from a corpus is

a discrete, categorical variable, since ΩX is finite and notnumerical.

2. The sum Y of two dice is a discrete numerical variable,since ΩY is finite and numerical.

Statistical Methods for NLP 3(27)

Page 4: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Frequency Functions

I The probability P(X = x) of variable X assuming value x isgiven by the frequency function fX :

fX (x) = P(X = x)

I For discrete variables, this is equivalent to summing theprobability of all outcomes in Ω that are mapped to x by X :

fX (x) = P(u ∈ Ω|X (u) = x) =∑

u:X(u)=x

P(u)

I Example:I The probability of sampling a noun from a corpus is the

sum of the probabilities of sampling each noun.

Statistical Methods for NLP 4(27)

Page 5: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Expectation

I Let X be a discrete numerical variable with value spaceΩX . The expectation of X , E [X ], is defined as follows:

E [X ] =∑

x∈ΩX

x · fX (x)

I Example: The expectation of the sum Y of two dice:

E [Y ] =12∑

y=2

y · fY (y) =25236

= 7

Statistical Methods for NLP 5(27)

Page 6: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Variance

I Let X be a discrete stochastic variable with expectation µ.The variance of X , Var [X ], is defined as follows:

Var [X ] = E [(X − µ)2] =∑

x∈ΩX

(x − µ)2 · fX (x)

I Example: The variance of the sum Y of two dice:

Var [Y ] =12∑

y=2

(y − 7)2 · fY (y) =21036≈ 5.83

I If X is a variable with variance σ2, then σ =√σ2 is the

standard deviation of X .

Statistical Methods for NLP 6(27)

Page 7: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Entropy

I Let X be a discrete stochastic variable. The entropy of X ,H[X ], is defined as follows:

H[X ] = E [− log2 fX ] = −∑

x∈ΩX

fX (x) log2 fX (x)

I Example: The entropy of the sum Y of two dice:

H[Y ] = −12∑

y=2

fY (y) log2 fY (y) ≈ 3.27

Statistical Methods for NLP 7(27)

Page 8: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

More on Entropy

I The entropy of a variable X can be interpreted as theexpected amount of information (measured in bits) whenlearning the value of X :

IX (x) = − log2 fX (x)

I Given a finite value space ΩX of size n, entropy ismaximized if fX (x) = 1

n for all x ∈ ΩX .I Example: Entropy of the outcome Z of an 11-sided die

(2–12):

H[Z ] = −12∑

z=2

111

log21

11≈ 3.46

Statistical Methods for NLP 8(27)

Page 9: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Joint and Conditional Probability

I Let X and Y be stochastic variables with sample spacesΩ1 and Ω2 and value spaces ΩX and ΩY , respectively.

1. The joint probability of X and Y is given by their jointprobability function f(X ,Y ):

f(X ,Y )(x , y) = P(X = x ,Y = y)= P((u, v) ∈ Ω1 × Ω2|X (u) = x ,Y (v) = y)

2. The conditional probability of X given Y is given by theconditional probability function fX |Y :

fX |Y (x |y) = P(X = x |Y = y) =P(X = x ,Y = y)

P(Y = y)

Statistical Methods for NLP 9(27)

Page 10: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Independence

I Stochastic variables X and Y (defined on the sameunderlying sample space) are independent if and only ifthe following holds for all x ∈ ΩX and y ∈ ΩY :

P(X = x ,Y = y) = P(X = x)P(Y = y)

I Corollary: If X and Y are independent variables then thefollowing conditions hold (for all x ∈ ΩX and y ∈ ΩY ):

1. P(X = x |Y = y) = P(X = x)2. P(Y = y |X = x) = P(Y = y)

Statistical Methods for NLP 10(27)

Page 11: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Part-of-Speech Bigrams 1

I Let (X1,X2) be the parts-of-speech of an arbitrary bigramand let the following probabilities be given:

1. P(X1 = noun) = P(X2 = noun) = 0.22. P(X1 = adj) = P(X2 = adj) = 0.053. P(X1 = det|X2 = noun) = 0.34. P(X1 = det|X2 = adj) = 0.65. P(X1 = det|X2 6∈ noun,adj) = 0

I Question: What is P(X2 = noun|X1 = det)?

Statistical Methods for NLP 11(27)

Page 12: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Part-of-Speech Bigrams 2

I Using Bayes’ law:

P(X2 = noun) · P(X1 = det|X2 = noun)

P(X1 = det)

I Using the law of total probability:

P(X2 = noun) · P(X1 = det|X2 = noun)

P(X1 = d|X2 = n) · P(X2 = n) + P(X1 = d|X2 = a) · P(X2 = a)

I Putting in the numbers:

0.2 · 0.30.3 · 0.2 + 0.6 · 0.05

= 0.67

Statistical Methods for NLP 12(27)

Page 13: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Part-of-Speech Bigrams 3

I Consider:1. P(X1 = det) = 0.092. P(X2 = noun) = 0.23. P(X1 = d,X2 = n) = P(X1 = d) · P(X2 = n|X1 = d) = 0.064. P(X1 = det) · P(X2 = noun) = 0.2 · 0.09 = 0.0185. 0.018 6= 0.06

I Conclusion: X1 and X2 are not independent variables.

Statistical Methods for NLP 13(27)

Page 14: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Statistical Inference

I Statistical inference is the science of making predictions orinferences from finite sets of observations (samples) to(potentially infinite) sets of new observations (populations).

I Two main kinds of statistical inference:1. Estimation: Use samples and sample variables to predict

population variables.2. Hypothesis testing: Use samples and sample variables to

test hypotheses about population variables.I Note: In statistical modeling, we often talk about models

instead of populations.

Statistical Methods for NLP 14(27)

Page 15: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Sampling

I Let X be a stochastic variable.1. A vector (X1, . . . ,Xn) of independent variables Xi with the

same distribution as X is said to be a random sample of X.2. A value vector (x1, . . . , xn) such that X1 = x1, . . . ,Xn = xn in

a particular experiment is called a statistical material.I Example:

I Consider a corpus C consisting of words (w1, . . . ,wn).I Can we regard C as a statistical material resulting from a

sample (W1, . . . ,Wn) of the word variable W?I Why (not)?

Statistical Methods for NLP 15(27)

Page 16: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Sample Variables

I Given a random sample of a variable X , we can define newstochastic variables that are functions of the sample, calledsample variables:

1. The sample mean: X n =1n

n∑i=1

Xi

2. The sample variance: sn2 =

1(n − 1)

n∑i=1

(Xi − X n)2

I These variables are called sample variables to distinguishthem from the expectation µ and (true) variance σ2 of X ,which are called population variables or model parameters.

Statistical Methods for NLP 16(27)

Page 17: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Estimation

I Two kinds of estimation:1. Point estimation: Use sample variable f (X1, . . . ,Xn) to

estimate parameter φ.2. Interval estimation: Use sample variables f1(X1, . . . ,Xn) and

f2(X1, . . . ,Xn) to construct an interval such thatP(f1(X1, . . . ,Xn) < φ < f2(X1, . . . ,Xn)) = p, where p is theconfidence level adopted.

Statistical Methods for NLP 17(27)

Page 18: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Maximum Likelihood Estimation (MLE)

I Given a statistical material x1, . . . , xn and a set ofparameters θ, the likelihood function L is:

L(x1, . . . , xn, θ) =n∏

i=1

Pθ(xi)

where Pθ(xi) is the probability that the variable Xi assumesthe value xi given a set of values for the parameters in θ.

I Maximum likelihood estimation means choosing θ so thatthe likelihood function is maximized:

maxθ

L(x1, . . . , xn, θ)

Statistical Methods for NLP 18(27)

Page 19: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

MLE: Example 1

I Given a random sample (X1, . . . ,Xn) of a numericalvariable X , the sample mean X n is a maximum likelihoodestimate of the expectation E [X ].

I The average sentence length X in a certain type of textcan be estimated with the mean sentence length in arepresentative sample:

E [X ] = X n

Statistical Methods for NLP 19(27)

Page 20: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

MLE: Example 2

I Given a random sample (X1, . . . ,Xn) of a categoricalvariable X , the relative frequency of the value x , fn(x), is amaximum likelihood estimate of the probability P(X = x).

I The probability of an arbitrary word from a text being anoun can be estimated with the relative frequency of nounsin a suitable corpus of texts:

P(noun) = fn(noun)

Statistical Methods for NLP 20(27)

Page 21: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

The Rationale of MLE

I We want to choose the most probable model given thedata:

P(θ|x1, . . . , xn) =P(x1, . . . , xn|θ)P(θ)

P(x1, . . . , xn)

argmaxθ

P(θ|x1, . . . , xn) = argmaxθ

P(x1, . . . , xn|θ)P(θ)

I If we assume a uniform distribution for P(θ), then

argmaxθ

P(θ|x1, . . . , xn) = argmaxθ

P(x1, . . . , xn|θ)

I The status of P(θ) is controversial in statistical theory(Bayesians vs. Frequentists)

Statistical Methods for NLP 21(27)

Page 22: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

MLE and Smoothing

I MLE is usually a good solution to the estimation problem ifthe statistical material is large enough.

I For language data, MLE is often suboptimal because ofsparse data and requires smoothing (or regularization).

I Example:I Additive smoothing:

Padd(X = x) =fn(x) + m

n + m · |ΩX |

where m is a constant (usually m ≤ 1).

I Note: Padd(X = x) 6= 0.

Statistical Methods for NLP 22(27)

Page 23: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Interval Estimation

I In general, we can derive a 95% confidence interval for ourmaximum likelihood estimate µ of a mean as follows:

µ± 1.96

√σ2

n

I Examples:

1. Sentence length: E [X ] = X n ± 1.96√

σ2

n

2. Noun probability: P(noun) = fn(noun)± 1.96√

σ2

n

I Note:I The interval grows with increasing variance σ2.I The interval shrinks with the sample size n.

Statistical Methods for NLP 23(27)

Page 24: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

More on Interval Estimation

I Where does the number 1.96 come from?

I Assumptions:1. Parameter has a normal distribution – okay for large n.2. True variance σ2 is known – usually not the case.

Statistical Methods for NLP 24(27)

Page 25: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Hypothesis Testing

I When are differences in sample variables f (x1, . . . , xm) andf (y1, . . . , yn) significant?

I Do they reflect differences in population variables ξ and υ?I Null hypothesis H0: ξ = υ

I Test procedure:I Choose a test statistic f (X ,Y ) whose distribution is known

when H0 is true.I Calculate the probability p of f (x , y) given H0.I If p < α, reject H0 at significance level α.

Statistical Methods for NLP 25(27)

Page 26: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Example: Z-test

I Data:I Mean sentence length in 50 novels from 1950s: X 1 = 19.3.I Mean sentence length in 50 novels from 2000s: X 2 = 16.4.I X is normally distributed with variance σ2 = 134.2.

I Test statistic:

Z =X 1 − X 2√

σ2

n1+n2

=19.3− 16.9√

134.250+50

= 2.28

I Probability calculation:I Z = 2.28 corresponds to p = P(Z = 2.28|H0) = 0.0226.I Reject H0 at α = 0.05 (but not α = 0.01).I Note: This does not mean that P(H1) > 0.95.

Statistical Methods for NLP 26(27)

Page 27: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Tips and Tricks

I What to do in real life?I When variance is not known – use sample variance and a

t-test (instead of Z -test):

t =X 1 − X 2√

s21

n1+

s22

n2

I When distribution is not normal – use a non-parametric test.I The special case of proportions:

I If X is binary with P(X = 1) = p, then Var [X ] = p(1− p):

p = fn(1)±√

p(1−p)n Z = p1−p2r

p1(1−p1)

n1+

p2(1−p2)

n2

Statistical Methods for NLP 27(27)