Corpora and Statistical Methods Lecture 11

Click here to load reader

download Corpora and Statistical Methods Lecture  11

of 31

description

Corpora and Statistical Methods Lecture 11. Albert Gatt. Part 1. Probabilistic Context-Free Grammars and beyond. Context-free grammars: reminder. Many NLP parsing applications rely on the CFG formalism Definition : CFG is a 4-tuple: ( N, Σ ,P,S) : - PowerPoint PPT Presentation

Transcript of Corpora and Statistical Methods Lecture 11

PowerPoint Presentation

Albert GattCorpora and Statistical MethodsLecture 11Probabilistic Context-Free Grammars and beyondPart 1Context-free grammars: reminderMany NLP parsing applications rely on the CFG formalism

Definition: CFG is a 4-tuple: (N,,P,S):N = a set of non-terminal symbols (e.g. NP, VP) = a set of terminals (e.g. words)N and are disjointP = a set of productions of the form AA N (N U )* (any string of terminals and non-terminals)S = a designated start symbol (usually, sentence)

CFG ExampleS NP VPS Aux NP VPNP Det NomNP Proper-NounDet that | the | aProbabilistic CFGsA CFG where each production has an associated probabilityPCFG is a 5-tuple: (N,,P,S, D):D: P -> [0,1] a function assigning each rule in P a probabilityusually, probabilities are obtained from a corpusmost widely used corpus is the Penn TreebankThe Penn TreebankEnglish sentences annotated with syntax treesbuilt at the University of Pennsylvania40,000 sentences, about a million wordstext from the Wall Street Journal

Other treebanks exist for other languages (e.g. NEGRA for German)

Example tree

Building a tree: rulesS NP VPNP NNP NNPNNP MrNNP VinkenSNPNNPNNPVinkenMrVPNPVBZPPNPNNischairmanINNNNNPofElsevierCharacteristics of PCFGsIn a PCFG, the probability P(A) expresses the likelihood that the non-terminal A will expand as .e.g. the likelihood that S NP VP(as opposed to SVP, or S NP VP PP, or )

can be interpreted as a conditional probability:probability of the expansion, given the LHS non-terminalP(A) = P(A|A)

Therefore, for any non-terminal A, probabilities of every rule of the form A must sum to 1If this is the case, we say the PCFG is consistent

Uses of probabilities in parsingDisambiguation: given n legal parses of a string, which is the most likely?e.g. PP-attachment ambiguity can be resolved this way

Speed: parsing is a search problemsearch through space of possible applicable derivationssearch space can be pruned by focusing on the most likely sub-parses of a parse

Parser can be used as a model to determine the probability of a sentence, given a parsetypical use in speech recognition, where input utterance can be heard as several possible sentencesUsing PCFG probabilitiesPCFG assigns a probability to every parse-tree t of a string We.g. every possible parse (derivation) of a sentence recognised by the grammar

Notation:G = a PCFGs = a sentencet = a particular tree under our grammart consists of several nodes neach node is generated by applying some rule r

Probability of a tree vs. a sentence

simply the multiplication of the probability of every rule (node) that gives rise to t (i.e. the derivation of t)

this is both the joint probability of t and s, and the probability of t alonewhy?

P(t,s) = P(t)But P(s|t) must be 1, since the tree t is a parse of all the words of s

Picking the best parse in a PCFGA sentence will usually have several parseswe usually want them ranked, or only want the n-best parses

we need to focus on P(t|s,G)probability of a parse, given our sentence and our grammar

definition of the best parse for s:

Picking the best parse in a PCFGProblem: t can have multiple derivationse.g. expand left-corner nodes first, expand right-corner nodes first etcso P(t|s,G) should be estimated by summing over all possible derivations

Fortunately, derivation order makes no difference to the final probabilities.can assume a canonical derivation d of tP(t) =def P(d)Probability of a sentenceSimply the sum of probabilities of all parses of that sentencesince s is only a sentence if its recognised by G, i.e. if there is some t for s under G

all those trees which yield sFlaws I: Structural independence Probability of a rule r expanding node n depends only on n.Independent of other non-terminals

Example:P(NP Pro) is independent of where the NP is in the sentencebut we know that NPPro is much more likely in subject positionFrancis et al (1999) using the Switchboard corpus: 91% of subjects are pronouns; only 34% of objects are pronouns

Flaws II: lexical independencevanilla PCFGs ignore lexical material e.g. P(VP V NP PP) independent of the head of NP or PP or lexical head V

Examples:prepositional phrase attachment preferences depend on lexical items; cf:dump [sacks into a bin] dump [sacks] [into a bin] (preferred parse)coordination ambiguity:[dogs in houses] and [cats] [dogs] [in houses and cats]Weakening the independence assumptions in PCFGsLexicalised PCFGsAttempt to weaken the lexical independence assumption.

Most common technique:mark each phrasal head (N,V, etc) with the lexical materialthis is based on the idea that the most crucial lexical dependencies are between head and dependentE.g.: Charniak 1997, Collins 1999

Lexicalised PCFGs: Matt walksMakes probabilities partly dependent on lexical content.

P(VPVBD|VP) becomes: P(VPVBD|VP, h(VP)=walk)

NB: normally, we cant assume that all heads of a phrase of category C are equally probable.

S(walks)NP(Matt)NNP(Matt)MattVP(walk)VBD(walk)walksPractical problems for lexicalised PCFGsdata sparseness: we dont necessarily see all heads of all phrasal categories often enough in the training data

flawed assumptions: lexical dependencies occur elsewhere, not just between head and complementI got the easier problem of the two to solveof the two and to solve become more likely because of the prehead modifier easier

Structural contextThe simple way: calculate p(t|s,G) based on rules in the canonical derivation d of t

assumes that p(t) is independent of the derivation

could condition on more structural context

but then we could lose the notion of a canonical derivation, i.e. P(t) could really depend on the derivation!Structural context: probability of a derivation historyHow to calculate P(t) based on a derivation d?Observation:

(probability that a sequence of m rewrite rules in a derivation yields s)can use the chain rule for multiplication

Approach 2: parent annotationAnnotate each node with its parent in the parse tree.

E.g. if NP has parent S, then rename NP to NP^S

Can partly account for dependencies such as subject-of(NP^S is a subject, NP^VP is an object)S(walks)NP^SNNP^NPMattVP^SVBD^VPwalksThe main pointMany different parsing approaches differ on what they condition their probabilities onOther grammar formalismsPhrase structure vs. Dependency grammarPCFGs are in the tradition of phrase-structure grammars

Dependency grammar describes syntax in terms of dependencies between wordsno non-terminals or phrasal nodesonly lexical nodes with links between themlinks are labelled, labels from a finite listDependency Grammar

GAVEmainIhimaddresssubj:dat:obj:MYattr:Dependency grammarOften used now in probabilistic parsing

Advantages:directly encode lexical dependenciestherefore, disambiguation decisions take lexical material into account directlydependencies are a way of decomposing PSRs and their probability estimatesestimating probability of dependencies between 2 words is less likely to lead to data sparseness problemsSummaryWeve taken a tour of PCFGscrucial notion: what the probability of a rule is conditioned onflaws in PCFGs: independence assumptionsseveral proposals to go beyond these flawsdependency grammars are an alternative formalism