Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

of 78 /78
Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009

Embed Size (px)

Transcript of Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

  • Statistical Methods for Corpus AnalysisXiaofei LuAPLNG 596DJuly 14, 2009

  • *OverviewDescribing dataComparing groupsDescribing relationships

  • *Basic conceptsProbability experiments jargonExperiment: a situation for which the outcomes occur randomlySample space (): the set of all possible outcomesOutcome (w): a point in the sample spaceEvent: a subset of the sample space

  • *Example 1Experiment: toss a fair die6 outcomes: 1,2,3,4,5,6Sample space = = {1,2,3,4,5,6}An event is any subset of the sample spacean even number is rolled: A = {2,4,6}, P(A) = 1/2 a 3 is rolled: B = {3}, P(B) = 1/6

  • *Example 2Experiment: toss 2 fair diceOutcomes: ordered pairs (x, y); x and y are results of the 1st and 2nd toss respectivelySample space = = set of such ordered pairs = {(x,y)|x = 1, 2,, or 6 and y = 1,2,, or 6}An event is any subset of , e.g., sum is 7A = {(x,y}|x+y=7} = {(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)}

  • *More jargonA=sum is 7; B=first toss is an odd numberUnion of two eventsThe event C that either A or B occurs or both occurIntersection of two events The event C that both A and B occur, C=ABComplement of an eventThe event that A does not occurDisjoint eventsTwo events with no common outcome

  • *IndependenceTwo events A and B are independent if knowing that one had occurred gave no information about whether the other had occurredP(AB) = P(A)P(B)Outcomes of two successive tosses of an unbiased coinP(2 heads)=P(A=1B=1)=P(A=1)P(B=1)=1/4

  • *Random variableRandom variable (X)Essentially a random numberFormally a function from to the real numbersDiscrete random variableA random variable that can take on only a finite or a countably infinite number of possible values

  • *Example 3Experiment: toss a biased coin 3 timesBias: P(Heads) = 0.6 = {hhh, hht, htt, hth, ttt, tth, thh, tht}X = total number of heads in the 3 tossesX is a r.v., a function from to the real numbers with possible values (x) 0, 1, 2, 3

  • *Example 3 (cont.)P(x=0)=0.064P(x=1)=0.288P(x=2)=0.432P(x=3)=0.216

    wX(w)P{w}HHH3(0.6)HHT2(0.6)(0.4)HTH2(0.6)(0.4)HTT1(0.6)(0.4)THH2(0.6)(0.4)THT1(0.6)(0.4)TTH1(0.6)(0.4)TTT0(0.4)

  • *Random variable (cont.)Continuous random variableA random variable that can take an uncountably infinite number of possible values, e.g., heightDefined over an interval of values, e.g., (0,2], and represented by the area under a curve The probability of observing any single value is equal to 0

  • *Probability distributionDescribes the possible values of a random variable and their probabilitiesProbability mass function (discrete)Probability density function (continuous)

  • *Descriptive vs. inferential statisticsDescriptive statisticsSummarize important properties of observed dataMeasures of central tendencyMeasures of variabilityInferential statisticsThe use of statistics to make inferences concerning some unknown aspect of a populationHypothesis testing

  • *Measures of central tendencyThe most typical score for a data setThe modeThe most frequently obtained score in a data set, (2, 4, 4, 7, 8)The medianCentral score in sample with an odd number of items, (2, 4, 4, 7, 8)Average of two central scores in sample with an even number of items (2, 4, 4, 7, 8, 100)

  • *Measures of central tendency (cont.)The meanThe average of all scores in a data set, (2,4,4,7,8)Disadvantage of the meanAffected by extreme values (2,4,4,7,100)What is a more suitable measure in such cases?

  • *Measures of variabilityStatistical dispersion in a r.v. or probability distributionRangeHighest value minus lowest value: (2,4,4,7,8)Affected by extreme scores: (2,4,4,7,100)Inter-quartile range: difference betweenThe value of the way from the top, andThe value of the way from the bottomSemi inter-quartile range: of the IQR

  • *Measures of variability (cont.)The varianceConsiders distance of every data item from meanPopulation variance

    Sample variance: (n-1) indicates degree of freedom

  • *Measures of variability (cont.)The standard deviationThe most common measure of statistical dispersionStandard deviation of a random variable

    Sample standard deviation: N-1 indicates d.o.f.

  • *Shape of a distributionAsymmetrical distributionPositively (or right) skewed distributionNegatively (or left) skewed distributionSymmetrical distributionNormal distribution (single modal peak)mode=median=meanAssumed by many statistical tests in corpus linguisticsBimodal distribution

  • *Normal distributionA statistical distribution N(, ) with the following probability density function

    Parameters: mean and variance e is a mathematical constantDensity is bell-shaped, symmetric about Standard normal distribution: =0, =1

  • *Central limit theoremThe theoremWhen samples are repeatedly drawn from a population, the means of the samples will be normally distributed around the population meanThis occurs even if the distribution of the data in the population is not normalThis makes the normal distribution importantThe distribution of IQ scores

  • *Properties of the normal curveShape of curve defined by and Important propertyFor any normal curve, if we draw a vertical line through it at any number of standard deviations away from the mean, the proportions of the area under the curve are always the sameSee here

  • *The z scoreA measure of how far a given value is from the mean, expressed as a number of s.d.s

    How probable a z score is for any testMeasured by proportion of the total area under the tail of the curve which lies beyond a given z valueConsult the z score table

  • *Example 4Mean frequency of there in a 1000-word sample written by a given author is 10, = 4A sample contains 17 occurrences of therez score = (17-10)/4= 1.75The area beyond the z score of 1.75 is 0.0401, or 4.01% of the total area under the curveThe probability of seeing a sample with more than 17 occurrences of there is 4.01% or less

  • *Hypothesis testingUsing descriptive statistics as evidence for or against experimental hypothesesThe null hypothesis H0There is no difference between the sample value and the population from which it was drawnThe alternative hypothesis H1The is a significant difference between the sample value and the population from which it was drawnGoal: to reject H0 with a certain level of significance (e.g., 5%)

  • *Hypothesis testing (cont.)Use of statistical testsEstimates the probability that the claims are wrongEnables us to claim statistical significance for our results and have confidence in our claimsOne and two-tailed testsOne-tailed: likely direction of difference knownTwo-tailed: nature of the difference not specifiedIf using z-score, proportions in Appendix 1 must be doubled

  • Comparing GroupsXiaofei LuAPLNG 596D

  • *OutlineBasic conceptsParametric comparisons of two groupsNon-parametric comparisons of two groupsComparisons between three or more groups

  • *Basic conceptsTypes of scales of measurementIndependent and dependent variablesParametric and non-parametric testsPopulation meanBetween-groups and repeated measures designOne-sample and two-sample studies

  • *Types of scales of measurementRatio scale: units on the scale are the sameMeasurement in metersInterval scale: the zero point is arbitraryCentigrade scale of temperatureOrdinal scale: records order onlyRanks in a contestNominal scale: categorical dataPart-of-speech categories

  • *Independent and dependent variablesIndependent variables: what do I change?Dependent variables: what do I observe?Controlled variables: what do I keep the same?

  • *Two examplesEffect of education on incomeIndependent variable: academic degree of the individualDependent variable: level of income of the individual measured in monetary unitsEffect of sentence complexity on recallIndependent variable: sentence complexityDependent variable: amount of sentence correctly recalled

  • *Parametric testsDependent variables are ratio-/interval- scoredObservations should be independentOften assumes normal distribution of dataMean an appropriate measure of central tendencyStandard deviation an appropriate measure of variabilityWorks with any distribution with parameters

  • *Non-parametric testsDo not assume normal distribution of dataBest for small samples with no normal distributionWork with rank-ordered scales and frequencies

  • *Population meanSampling distribution of meansA distribution made up of group meansDescribes a symmetric curveGroup means within a population closer to each other than individual scores to group meanPopulation meanThe average of a group of means

  • *Experimental designBetween-groups designData comes from two different groupsRepeated measures designData is the result of two or more measures taken from the same group

  • *One-sample and two-sample studiesOne-sample studiesCompare group mean with population meanDetermine whether group mean differs from population mean Two-sample studiesCompare means from two different groups (experimental and control group) Determine whether these means differ for reasons other than pure chance

  • *Parametric comparison of two groupsThe t test for independent samplesThe matched pairs t test

  • *The t test for independent samplesTests difference between two groupsNormally-distributed interval dataMean and standard deviation good measures of central tendency and variabilityEspecially useful for small samples (N
  • *One-sample t testH0: no significant difference between group mean and population meanComputing the t statistic (in SPSS)

    Standard error of the meanss: standard deviation of the sample groupn: sample size

  • *Corpus linguistics exampleA balanced corpusMean verbs per sentence: 2.5; s.d. = 1.2A 100-sentence specialized subcorpusMean verbs per sentence: 3.5; s.d. = 1.6t statistic: (3.5-2.5)/(1.6/10)=6.25

  • *Corpus linguistics example (cont.)Consult the t tableTwo-tailed test (non-directional)Degree of freedom: (n-1) = (100-1) = 99use next lower value 90Significance level: go with 0.05 or 0.01Critical value: 1.987 (for 0.05) or 2.632 (for 0.01)Observed value of t (6.25) is greater than 2.632Can reject H0 at the 1 percent significance level

  • *Two-sample t testH0: difference between 2 groups expected for any 2 means in a population due to chanceShow that the difference falls in the extreme left or right tail of the t distribution

    Standard error of differences between the mean

  • *Corpus linguistics exampleNumber of errors of a specific type in each of 15 equal-length essaysControl group: 8 essays produced by students learning by traditional methodsExperimental group: 7 essays produced by students learning by a novel method

  • *Corpus linguistics example (cont.)

    t=(6-3)/sqrt((2.27*2.27/7)+(2.21*2.21/8))=2.584Degree of freedom = (8-1)+(7-1)=13Critical value of t for a two-tailed test at the 5 percent significance level for 13 d.o.f. is 2.16Observed t is greater than 2.16; difference is significant

    nMeanStandard deviationControl862.21Experimental732.27

  • *Some caveatsThe matched pairs t test should be used for repeated measures designs (correlated samples)A non-parametric test should be used if data is very skewed and not normally-distributedA parametric test for comparing 3 or more groups should be used to cross-compare groups

  • *The matched pairs t testComparing paired or correlated samplesNot independent but closer to each than random samplesA feature observed under 2 different conditionsSame students tested before and after taking classPairs of subjects matched according to any characteristicStudying husbands and wives rather than random samples

  • *The matched pairs t test (cont.)

    di denotes the difference between the ith pair N denotes the number of pairs of observations

  • *Corpus linguistics exampleLengths of the vowels produced by 10 speakers in two different consonant environmentst = -2.95; d.o.f. = 9Critical value of t for a two-tailed test at the 2 percent significance level for 9 d.o.f. is 2.821

    IDE1E2d12226-421822-432627-141715251924-562327-471517-281620-4919172102530-5

  • *Non-parametric comparisons of two groupsUsed in two-sample studies where the assumptions of the t test do not holdBetween-group design (independent samples)The Wilcoxon rank sums test Repeated measures design (correlated/paired samples)The Wilcoxon matched pairs signed rank test

  • *The Wilcoxon rank sums testAlso known as the Mann-Whitney U testUseful for comparing ordinal rating scalesCombine and rank scores for two groupsCalculate the sum of ranks in the smaller group (R1)Calculate the sum of ranks in the larger group (R2)U = the smaller of U1 and U2

  • *The Wilcoxon rank sums test (cont.)If N1 20 and N2 20, can compute z scoreLet N = N1 + N2

  • *Corpus linguistics exampleQuestionnaire distributed to 2 student groupsGroup 1: Computer-taughtGroup 2: Classroom-taughtQuestion: How hard/useful did you find the task?Answer: Likert scale (1-5), 1=very hard; 5=very easyData processingAggregate scores found for each subjectCombined scores from 2 groups rankedAverage scores given to tied ranks

  • *Corpus linguistics example (cont.)H0: no difference between 2 groups

    Calculate level of significance hereCannot reject H0

    G2RankG1Rank141106122.5106122.5106114810.598.571298.5613810.5--R237.5R153.5

  • *The Wilcoxon matched pairs signed ranks testUsed on interval level of measurementRanks differences between pairs of observationsConsiders both direction and degree of differenceProcedureObtain matched pairs of scoresCalculate difference for each pairRank differences according to absolute magnitudeFind the sum of negative and positive ranks

  • *The Wilcoxon matched pairs signed ranks test (cont.)Consult a significance table W = smaller of the sum of negative/positive ranksN = number of pairs with a differenceW should be smaller than or equal to critical valueIf N 25, can compute z score

  • *Corpus linguistics example# of errors in translating 2 passages into FrenchW=6.5Sum of positive ranks: 6.5Sum of negative ranks: 38.5N=9Critical value is 5 (2-tailed, p=0.05, N=9)W=6.5>5, H0 holds

    SubjABA-BRank1810-2-4.5276+1+23440-425-3-7.5547-3-7.561011-1-271715+2+4.5836-3-7.5923-1-2101114-3-7.5

  • *Comparisons between three or more groupsAnalysis of variance (ANOVA)A method of testing for significant differences between means of more than 2 samplesH0: samples taken from populations with same mean; no significant difference between samplesSamples not from same population ifBetween-groups variance significantly greater than within-groups variance

  • *ANOVABetween-groups varianceSum of squared difference between each sample mean and overall mean weighted by sample sizeNormalized by degree of freedom (what is it?)Within-groups variance Sum of squared difference between each score in each sample and the corresponding sample meanNormalized by degree of freedom (what is it?)

  • *ANOVA (cont.)

    Consult an ANOVA significance tableDegree of freedom in numerator# of groups -1Degree of freedom in denominator# of data items in all groups - # of groupsF value smaller than critical value: H0 holds

  • *Corpus linguistics example# of words 3 poets fit into a heroic couplet3 samples with 5 couplets eachOverall mean: 240/15=16Bgv=(5*0+5*1+5*1)/(3-1)=5Wgv=[(1+4+1+1+1)+(1+1+1+1+4)+(0+1+4+1+0)]/(15-3)=1.833F=5/1.83=2.73Critical value is 3.89 (Df 2&12, p0.10, 2-tailed); H0 holds

    S1S2S311716172181418315141541514185151717mean161517

  • Describing RelationshipsXiaofei LuAPLNG 596D

  • *OutlineThe chi-square testCorrelation

  • *The chi-square testDealing with nominal dataFacts that can be sorted into categoriesMeasured as frequenciesSignificant differences between frequencies?Chi-square test: a non-parametric test of relationship between frequenciesCompare observed frequencies with those expected on the basis of some theoretical model

  • *The chi-square test (cont.)Observed value (O) and expected value (E)O: Actual frequency in a cellE: Expected frequency in a cellComputing the chi-square

  • *General caveatsUse of chi-square test is inappropriate ifAny expected frequency is below 1; orEE, O=O-0.5; if O
  • *One-way designCompare relation of frequencies for one variableDf = (# of cells) - 1 E = (sum of frequencies in all cells)/(# of cells)

  • *Example 1: one-way designToss a coin 100 times; H0: Coin is fair

    Chi Square TableCritical value is 3.84 (2-tailed, df=1, p=0.05)0.36 < 3.84; cannot reject H0

    HeadsTailsTotalObserved5347100Expected5050100

  • *Two-way designCompare relation of frequencies for two variablesDf = (# of columns -1) (# of rows -1)

    Contingency tableTests whether two characteristics are independent or associatedClassifies experiment outcomes according to two criteria

  • *Two-by-two contingency table

    Shortcut for a two-by-two contingency table

    Replace N/2 to remove Yatess correctionCan also compute chi square using normal method

    A occursA does not occurB occursacB does not occurbd

  • *Example 2: 2-by-2 contingency table

    MaleFemaleBelieve in CMC romance3614Dont believe in CMC romance3025

  • *What is correlation?Degree to which two variables are relatedPositive: high values of X associated with high values of YNegative: high values of X associated with low values of Y Correlation coefficient: -1 to +1+1: perfect positive correlation0: no correlation-1: perfect negative correlation

  • *Pearsons correlation coefficientAssumptions: X and Y areInterval or ratio-type data (continuous)Independent Normally distributedIn a linear relationshipUseful terms to knowCorrelation is covariance of standardized variables

  • *Pearsons correlation coefficient (cont.)Computing the coefficient

    Standard error of estimation

    Partitioning the sums of squares

  • *Example 3Correlation between X and YX: Number of salespeople Y: Total number of salesr=0.921, N=5Significance of the correlation coefficientSignificance table (critical value=0.878, N=5, p=0.05, 2-tailed)The t test (if N6)

  • *Spearmans rank correction coefficientUsed with ordinal data that can be rankedX ordinal & Y continuous: convert Y to ranked dataComputing Spearmans rank correlation coefficient

  • *Example 4Correlation between X and YX: rating of product quality (1-4, 4 best)Y: perceived reputation of company (1-3, 3 best)=0.830, N=7Significance of the correlation coefficientTable (critical value=0.786, N=7, p=0.05, 2-tailed)The t test (if N30)

  • *ResourcesResources to help you learn and use SPSSWhat statistical analysis should I use?Statistical analyses using SPSS