Download - Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

Transcript
Page 1: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

Statistical Methods for Corpus Analysis

Xiaofei Lu

APLNG 596D

July 14, 2009

Page 2: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

2

Overview

Describing data Comparing groups Describing relationships

Page 3: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

3

Basic concepts

Probability experiments – jargon Experiment: a situation for which the outcomes

occur randomly Sample space (Ω): the set of all possible outcomes Outcome (w): a point in the sample space Event: a subset of the sample space

Page 4: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

4

Example 1

Experiment: toss a fair die 6 outcomes: 1,2,3,4,5,6 Sample space = Ω = {1,2,3,4,5,6} An event is any subset of the sample space

“an even number is rolled”: A = {2,4,6}, P(A) = 1/2 “ a 3 is rolled”: B = {3}, P(B) = 1/6

Page 5: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

5

Example 2

Experiment: toss 2 fair dice Outcomes: ordered pairs (x, y); x and y are

results of the 1st and 2nd toss respectively Sample space = Ω = set of such ordered pairs =

{(x,y)|x = 1, 2,…, or 6 and y = 1,2,…, or 6} An event is any subset of Ω, e.g., “sum is 7”

A = {(x,y}|x+y=7}

= {(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)}

Page 6: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

6

More jargon

A=“sum is 7”; B=“first toss is an odd number” Union of two events

The event C that either A or B occurs or both occur Intersection of two events

The event C that both A and B occur, C=A∩B Complement of an event

The event that A does not occur Disjoint events

Two events with no common outcome

Page 7: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

7

Independence

Two events A and B are independent if knowing that one had occurred gave no information about whether the other had occurred

P(A∩B) = P(A)P(B) Outcomes of two successive tosses of an unbiased coin

P(2 heads)=P(A=1∩B=1)=P(A=1)×P(B=1)=1/4

Page 8: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

8

Random variable

Random variable (X) Essentially a random number Formally a function from Ω to the real numbers

Discrete random variable A random variable that can take on only a finite or a

countably infinite number of possible values

Page 9: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

9

Example 3

Experiment: toss a biased coin 3 times Bias: P(Heads) = 0.6 Ω = {hhh, hht, htt, hth, ttt, tth, thh, tht} X = total number of heads in the 3 tosses X is a r.v., a function from Ω to the real

numbers with possible values (x) 0, 1, 2, 3

Page 10: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

10

Example 3 (cont.)

P(x=0)=0.064 P(x=1)=0.288 P(x=2)=0.432 P(x=3)=0.216

w X(w) P{w}

HHH 3 (0.6)

HHT 2 (0.6)(0.4)

HTH 2 (0.6)(0.4)

HTT 1 (0.6)(0.4)

THH 2 (0.6)(0.4)

THT 1 (0.6)(0.4)

TTH 1 (0.6)(0.4)

TTT 0 (0.4)

Page 11: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

11

Random variable (cont.)

Continuous random variable A random variable that can take an uncountably

infinite number of possible values, e.g., height Defined over an interval of values, e.g., (0,2], and

represented by the area under a curve The probability of observing any single value is

equal to 0

Page 12: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

12

Probability distribution

Describes the possible values of a random variable and their probabilities

Probability mass function (discrete) Probability density function (continuous)

Page 13: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

13

Descriptive vs. inferential statistics

Descriptive statistics Summarize important properties of observed data Measures of central tendency Measures of variability

Inferential statistics The use of statistics to make inferences concerning

some unknown aspect of a population Hypothesis testing

Page 14: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

14

Measures of central tendency

The most typical score for a data set The mode

The most frequently obtained score in a data set, (2, 4, 4, 7, 8)

The median Central score in sample with an odd number of

items, (2, 4, 4, 7, 8) Average of two central scores in sample with an

even number of items (2, 4, 4, 7, 8, 100)

Page 15: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

15

Measures of central tendency (cont.)

The mean The average of all scores in a data set, (2,4,4,7,8)

Disadvantage of the mean Affected by extreme values (2,4,4,7,100) What is a more suitable measure in such cases?

Page 16: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

16

Measures of variability

Statistical dispersion in a r.v. or probability distribution

Range Highest value minus lowest value: (2,4,4,7,8) Affected by extreme scores: (2,4,4,7,100)

Inter-quartile range: difference between The value ¼ of the way from the top, and The value ¼ of the way from the bottom

Semi inter-quartile range: ½ of the IQR

Page 17: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

17

Measures of variability (cont.)

The variance Considers distance of every data item from mean Population variance

Sample variance: (n-1) indicates degree of freedom

Page 18: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

18

Measures of variability (cont.)

The standard deviation The most common measure of statistical dispersion Standard deviation of a random variable

Sample standard deviation: N-1 indicates d.o.f.

Page 19: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

19

Shape of a distribution

Asymmetrical distribution Positively (or right) skewed distribution Negatively (or left) skewed distribution

Symmetrical distribution Normal distribution (single modal peak)

mode=median=mean Assumed by many statistical tests in corpus linguistics

Bimodal distribution

Page 20: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

20

Normal distribution

A statistical distribution N(μ, σ) with the following probability density function

Parameters: mean μ and variance σ e is a mathematical constant Density is bell-shaped, symmetric about μ Standard normal distribution: μ=0, σ=1

Page 21: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

21

Central limit theorem

The theorem When samples are repeatedly drawn from a

population, the means of the samples will be normally distributed around the population mean

This occurs even if the distribution of the data in the population is not normal

This makes the normal distribution important The distribution of IQ scores

Page 22: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

22

Properties of the normal curve

Shape of curve defined by μ and σ Important property

For any normal curve, if we draw a vertical line through it at any number of standard deviations away from the mean, the proportions of the area under the curve are always the same

See here

Page 23: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

23

The z score

A measure of how far a given value is from the mean, expressed as a number of s.d.’s

How probable a z score is for any test Measured by proportion of the total area under the

tail of the curve which lies beyond a given z value Consult the z score table

x

z

Page 24: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

24

Example 4

Mean frequency of there in a 1000-word sample written by a given author is 10, σ = 4

A sample contains 17 occurrences of there z score = (17-10)/4= 1.75 The area beyond the z score of 1.75 is 0.0401, or

4.01% of the total area under the curve The probability of seeing a sample with more than 17

occurrences of there is 4.01% or less

Page 25: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

25

Hypothesis testing

Using descriptive statistics as evidence for or against experimental hypotheses

The null hypothesis H0

There is no difference between the sample value and the population from which it was drawn

The alternative hypothesis H1

The is a significant difference between the sample value and the population from which it was drawn

Goal: to reject H0 with a certain level of significance (e.g., 5%)

Page 26: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

26

Hypothesis testing (cont.)

Use of statistical tests Estimates the probability that the claims are wrong Enables us to claim statistical significance for our

results and have confidence in our claims One and two-tailed tests

One-tailed: likely direction of difference known Two-tailed: nature of the difference not specified

If using z-score, proportions in Appendix 1 must be doubled

Page 27: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

Comparing Groups

Xiaofei Lu

APLNG 596D

Page 28: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

28

Outline

Basic concepts Parametric comparisons of two groups Non-parametric comparisons of two groups Comparisons between three or more groups

Page 29: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

29

Basic concepts

Types of scales of measurement Independent and dependent variables Parametric and non-parametric tests Population mean Between-groups and repeated measures design One-sample and two-sample studies

Page 30: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

30

Types of scales of measurement

Ratio scale: units on the scale are the same Measurement in meters

Interval scale: the zero point is arbitrary Centigrade scale of temperature

Ordinal scale: records order only Ranks in a contest

Nominal scale: categorical data Part-of-speech categories

Page 31: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

31

Independent and dependent variables

Independent variables: what do I change? Dependent variables: what do I observe? Controlled variables: what do I keep the

same?

Page 32: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

32

Two examples

Effect of education on income Independent variable: academic degree of the

individual Dependent variable: level of income of the

individual measured in monetary units Effect of sentence complexity on recall

Independent variable: sentence complexity Dependent variable: amount of sentence correctly

recalled

Page 33: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

33

Parametric tests

Dependent variables are ratio-/interval- scored Observations should be independent Often assumes normal distribution of data

Mean an appropriate measure of central tendency Standard deviation an appropriate measure of

variability Works with any distribution with parameters

Page 34: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

34

Non-parametric tests

Do not assume normal distribution of data Best for small samples with no normal distribution

Work with rank-ordered scales and frequencies

Page 35: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

35

Population mean

Sampling distribution of means A distribution made up of group means Describes a symmetric curve Group means within a population closer to each

other than individual scores to group mean Population mean

The average of a group of means

Page 36: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

36

Experimental design

Between-groups design Data comes from two different groups

Repeated measures design Data is the result of two or more measures taken

from the same group

Page 37: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

37

One-sample and two-sample studies

One-sample studies Compare group mean with population mean Determine whether group mean differs from

population mean Two-sample studies

Compare means from two different groups (experimental and control group)

Determine whether these means differ for reasons other than pure chance

Page 38: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

38

Parametric comparison of two groups

The t test for independent samples The matched pairs t test

Page 39: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

39

The t test for independent samples

Tests difference between two groups Normally-distributed interval data Mean and standard deviation good measures of

central tendency and variability Especially useful for small samples (N<30)

Page 40: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

40

One-sample t test

H0: no significant difference between group mean and population mean

Computing the t statistic (in SPSS)

Standard error of the means s: standard deviation of the sample group n: sample size

ns

xtobs

Page 41: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

41

Corpus linguistics example

A balanced corpus Mean verbs per sentence: 2.5; s.d. = 1.2

A 100-sentence specialized subcorpus Mean verbs per sentence: 3.5; s.d. = 1.6

t statistic: (3.5-2.5)/(1.6/10)=6.25

Page 42: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

42

Corpus linguistics example (cont.)

Consult the t table Two-tailed test (non-directional) Degree of freedom: (n-1) = (100-1) = 99

use next lower value – 90

Significance level: go with 0.05 or 0.01 Critical value: 1.987 (for 0.05) or 2.632 (for 0.01) Observed value of t (6.25) is greater than 2.632 Can reject H0 at the 1 percent significance level

Page 43: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

43

Two-sample t test

H0: difference between 2 groups expected for any 2 means in a population due to chance

Show that the difference falls in the extreme left or right tail of the t distribution

Standard error of differences between the mean

ccee

ce

nsns

xxt

22

Page 44: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

44

Corpus linguistics example

Number of errors of a specific type in each of 15 equal-length essays

Control group: 8 essays produced by students learning by traditional methods

Experimental group: 7 essays produced by students learning by a novel method

Page 45: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

45

Corpus linguistics example (cont.)

t=(6-3)/sqrt((2.27*2.27/7)+(2.21*2.21/8))=2.584 Degree of freedom = (8-1)+(7-1)=13 Critical value of t for a two-tailed test at the 5 percent

significance level for 13 d.o.f. is 2.16 Observed t is greater than 2.16; difference is

significant

n Mean Standard deviation

Control 8 6 2.21

Experimental 7 3 2.27

Page 46: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

46

Some caveats

The matched pairs t test should be used for repeated measures designs (correlated samples)

A non-parametric test should be used if data is very skewed and not normally-distributed

A parametric test for comparing 3 or more groups should be used to cross-compare groups

Page 47: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

47

The matched pairs t test

Comparing paired or correlated samples Not independent but closer to each than random

samples A feature observed under 2 different conditions

Same students tested before and after taking class Pairs of subjects matched according to any

characteristic Studying husbands and wives rather than random

samples

Page 48: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

48

The matched pairs t test (cont.)

di denotes the difference between the ith pair N denotes the number of pairs of observations

1

)( 22

N

ddN

dt

ii

i

Page 49: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

49

Corpus linguistics example

Lengths of the vowels produced by 10 speakers in two different consonant environments

t = -2.95; d.o.f. = 9 Critical value of t for a two-

tailed test at the 2 percent significance level for 9 d.o.f. is 2.821

ID E1 E2 d

1 22 26 -4

2 18 22 -4

3 26 27 -1

4 17 15 2

5 19 24 -5

6 23 27 -4

7 15 17 -2

8 16 20 -4

9 19 17 2

10 25 30 -5

Page 50: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

50

Non-parametric comparisons of two groups

Used in two-sample studies where the assumptions of the t test do not hold

Between-group design (independent samples) The Wilcoxon rank sums test

Repeated measures design (correlated/paired samples) The Wilcoxon matched pairs signed rank test

Page 51: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

51

The Wilcoxon rank sums test

Also known as the Mann-Whitney U test Useful for comparing ordinal rating scales

Combine and rank scores for two groups Calculate the sum of ranks in the smaller group (R1) Calculate the sum of ranks in the larger group (R2) U = the smaller of U1 and U2

222

212

111

211

2

)1(2

)1(

RNN

NNU

RNN

NNU

Page 52: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

52

The Wilcoxon rank sums test (cont.)

If N1 ≥ 20 and N2 ≥ 20, can compute z score Let N = N1 + N2

3

)1(

)1(2

21

111

NNN

NNRz

Page 53: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

53

Corpus linguistics example

Questionnaire distributed to 2 student groups Group 1: Computer-taught Group 2: Classroom-taught Question: ‘How hard/useful did you find the task?’ Answer: Likert scale (1-5), 1=very hard; 5=very easy

Data processing Aggregate scores found for each subject Combined scores from 2 groups ranked Average scores given to tied ranks

Page 54: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

54

Corpus linguistics example (cont.)

H0: no difference between 2 groups

Calculate level of significance here

Cannot reject H0

G2 Rank G1 Rank

14 1 10 6

12 2.5 10 6

12 2.5 10 6

11 4 8 10.5

9 8.5 7 12

9 8.5 6 13

8 10.5 - -

R2 37.5 R1 53.5

5.325.372

)17(776

5.95.532

)16(676

2

1

U

U

Page 55: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

55

The Wilcoxon matched pairs signed ranks test

Used on interval level of measurement Ranks differences between pairs of observations Considers both direction and degree of difference

Procedure Obtain matched pairs of scores Calculate difference for each pair Rank differences according to absolute magnitude Find the sum of negative and positive ranks

Page 56: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

56

The Wilcoxon matched pairs signed ranks test (cont.)

Consult a significance table W = smaller of the sum of negative/positive ranks N = number of pairs with a difference W should be smaller than or equal to critical value

If N ≥ 25, can compute z score

24

)12)(1(

4/)1(

NNN

NNWz

Page 57: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

57

Corpus linguistics example

# of errors in translating 2 passages into French

W=6.5 Sum of positive ranks: 6.5 Sum of negative ranks: 38.5

N=9 Critical value is 5 (2-

tailed, p=0.05, N=9) W=6.5>5, H0 holds

Subj A B A-B Rank

1 8 10 -2 -4.5

2 7 6 +1 +2

3 4 4 0 -

4 2 5 -3 -7.5

5 4 7 -3 -7.5

6 10 11 -1 -2

7 17 15 +2 +4.5

8 3 6 -3 -7.5

9 2 3 -1 -2

10 11 14 -3 -7.5

Page 58: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

58

Comparisons between three or more groups

Analysis of variance (ANOVA) A method of testing for significant differences

between means of more than 2 samples H0: samples taken from populations with same mean;

no significant difference between samples Samples not from same population if

Between-groups variance significantly greater than within-groups variance

Page 59: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

59

ANOVA

Between-groups variance Sum of squared difference between each sample

mean and overall mean weighted by sample size Normalized by degree of freedom (what is it?)

Within-groups variance Sum of squared difference between each score in

each sample and the corresponding sample mean Normalized by degree of freedom (what is it?)

Page 60: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

60

ANOVA (cont.)

Consult an ANOVA significance table Degree of freedom in numerator

# of groups -1 Degree of freedom in denominator

# of data items in all groups - # of groups F value smaller than critical value: H0 holds

variancegroupsWithin

variancegroupsBetween ratio F The

Page 61: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

61

Corpus linguistics example

# of words 3 poets fit into a heroic couplet

3 samples with 5 couplets each Overall mean: 240/15=16 Bgv=(5*0+5*1+5*1)/(3-1)=5 Wgv=[(1+4+1+1+1)+(1+1+1+1+

4)+(0+1+4+1+0)]/(15-3)=1.833 F=5/1.83=2.73 Critical value is 3.89 (Df 2&12,

p≤0.10, 2-tailed); H0 holds

S1 S2 S3

1 17 16 17

2 18 14 18

3 15 14 15

4 15 14 18

5 15 17 17

mean 16 15 17

Page 62: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

Describing Relationships

Xiaofei Lu

APLNG 596D

Page 63: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

63

Outline

The chi-square test Correlation

Page 64: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

64

The chi-square test

Dealing with nominal data Facts that can be sorted into categories Measured as frequencies

Significant differences between frequencies? Chi-square test: a non-parametric test of

relationship between frequencies Compare observed frequencies with those expected

on the basis of some theoretical model

Page 65: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

65

The chi-square test (cont.)

Observed value (O) and expected value (E) O: Actual frequency in a cell E: Expected frequency in a cell

Computing the chi-square

E

EOX

22 )(

Page 66: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

66

General caveats

Use of chi-square test is inappropriate if Any expected frequency is below 1; or E<5 in more than 20% of the cells

Yates’ correction factor Applicable if df = 1 If O>E, O=O-0.5; if O<E, O=O+0.5

Page 67: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

67

One-way design

Compare relation of frequencies for one variable Df = (# of cells) - 1 E = (sum of frequencies in all cells)/(# of cells)

Page 68: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

68

Example 1: one-way design

Toss a coin 100 times; H0: Coin is fair

Chi Square Table Critical value is 3.84 (2-tailed, df=1, p=0.05) 0.36 < 3.84; cannot reject H0

36.050

)5047(

50

)5053( 222

X

Heads Tails Total

Observed 53 47 100

Expected 50 50 100

Page 69: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

69

Two-way design

Compare relation of frequencies for two variables Df = (# of columns -1)× (# of rows -1)

Contingency table Tests whether two characteristics are independent

or associated Classifies experiment outcomes according to two

criteria

items of totalgrand

alcolumn tot totalrow E

Page 70: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

70

Two-by-two contingency table

A occurs A does not occur

B occurs a c

B does not occur b d

))()()((

)2|(| 22

dbcadcba

NbcadNX

dcbaN

Shortcut for a two-by-two contingency table

Replace N/2 to remove Yates’s correction Can also compute chi square using normal method

Page 71: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

71

Example 2: 2-by-2 contingency table

Male Female

Believe in CMC romance 36 14

Don’t believe in CMC romance 30 25

418.3)2530)(1436)(2514)(3036(

)30*1425*36(105

105253014362

2

X

N

Page 72: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

72

What is correlation?

Degree to which two variables are related Positive: high values of X associated with high

values of Y Negative: high values of X associated with low

values of Y Correlation coefficient: -1 to +1

+1: perfect positive correlation 0: no correlation -1: perfect negative correlation

Page 73: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

73

Pearson’s correlation coefficient

Assumptions: X and Y are Interval or ratio-type data (continuous) Independent Normally distributed In a linear relationship

Useful terms to know Correlation is covariance of standardized variables

Page 74: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

74

Pearson’s correlation coefficient (cont.)

Computing the coefficient

Standard error of estimation

Partitioning the sums of squares

2222 )()(

),(

iiii

iiii

yxxy

yynxxn

yxyxn

ss

YXCovr

1 2xyyxy rss

chance) todue(not ableother vari by thefor accounted

variableonein varianceof proportion thegives 2xyr

Page 75: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

75

Example 3

Correlation between X and Y X: Number of salespeople Y: Total number of sales r=0.921, N=5

Significance of the correlation coefficient Significance table (critical value=0.878, N=5,

p=0.05, 2-tailed) The t test (if N≥6)

Page 76: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

76

Spearman’s rank correction coefficient

Used with ordinal data that can be ranked X ordinal & Y continuous: convert Y to ranked data

Computing Spearman’s rank correlation coefficient

ii

2

2

xofRank y ofRank

)1(

61

id

NN

d

Page 77: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

77

Example 4

Correlation between X and Y X: rating of product quality (1-4, 4 best) Y: perceived reputation of company (1-3, 3 best) ρ=0.830, N=7

Significance of the correlation coefficient Table (critical value=0.786, N=7, p=0.05, 2-tailed) The t test (if N≥30)

Page 78: Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

78

Resources

Resources to help you learn and use SPSS What statistical analysis should I use?Statistical analyses using SPSS