NLP Unit-Ii
NLP Unit-Ii
Then, the outcome of the parsing process would be a parse tree, where sentence is the root, intermediate nodes such as noun_phrase, verb_phrase etc. have
children - hence they are called non-terminals and finally, the leaves of the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are called terminals.
Parse Tree:
A treebank can be defined as a linguistically annotated corpus that includes some kind of syntactic analysis over and above part-of-speech tagging.
A sentence is parsed by relating each word to other words in the sentence which depend on it.
The syntactic parsing of a sentence consists of finding the correct syntactic structure of that sentence in the given formalism/grammar.
Dependency grammar (DG) and phrase structure grammar(PSG) are two such formalisms.
PSG breaks sentence into constituents (phrases), which are then broken into smaller constituents.
DG: syntactic structure consists of lexical items, linked by binary asymmetric relations called dependencies.
Learning: scoring possible dependency graphs for a given sentence, usually by factoring the graphs into their component arcs
Parsing: searching for the highest scoring graph for a given sentence
Syntax-Introduction
• Parsing uncovers the hidden structure of linguistic input.
• From tagging to full parsing, algorithms that can handle such ambiguity have to be carefully chosen.
• Here we explore the syntactic analysis methods from tagging to full parsing and the use of supervised machine learning to deal with ambiguity.
2.1 Parsing Natural Languages
• In a text-to-speech application, input sentences are to be converted to a spoken output that should sound like it was spoken by native speaker of the language.
• Parsing provides a structural description that identifies such a break in the intonation.
• A text-to-speech system needs to know that the first instance of the word lives is a verb and the second instance a noun.
• This is an instance of POS tagging problem.
• The elegant way to approach this task is to first parse the sentence to find the various constituents: where we recursively partition the words in the sentence into
individual phrases such as a verb phrase or a noun phrase.
The output of the parser for the input sentence “Tom ate an apple” is shown in Fig.
• Parsers require some additional knowledge beyond the input sentence that should be produced as output.
• We can write down the rules of the syntax of a sentence as a CFG.
• Here we have a CFG which represents a simple grammar of transitive verbs in English (verbs that have a subject and object noun phrase (NP), plus modifiers
of verb phrases (VP) in the form of prepositional phrases (PP) ).
• The above CFG can produce the syntax analysis of a sentence like: John bought a shirt with pockets
Treebanks: A Data-Driven Approach to Syntax
• A simple list of rules does not consider interactions between different components in the grammar.
• It is difficult to exhaustively list lexical properties of words. This is a typical knowledge acquisition problem.
• One more problem is that the rules interact with each other in combinatorially explosive ways.
• For the input ‘natural language processing’ the recursive rules produce two ambiguous parses.
• For CFGs it can be proved that the number of parsers obtained by using the recursive rule n times is the Catalan number (1,1,2,5,14,42,
132, 429, 1430, 4862,….) of n:
• For the input “natural language processing book” only one out of the five parsers obtained using the above CFG is correct:
Treebanks: A Data-Driven Approach to Syntax
• This is the second knowledge acquisition problem- We need to know not only the rules but also which analysis is most plausible for a given input sentence.
• The construction of a tree bank is a data driven approach to syntax analysis that allows us to address both the knowledge acquisition bottlenecks in one stroke.
• A treebank is a collection of sentences where each sentence is provided a complete syntax analysis.
• The syntax analysis for each sentence has been judged by a human expert.
• A set of annotation guidelines is written before the annotation process to ensure a consistent scheme of annotation throughout the tree bank.
• No set of syntactic rules are provided by a treebank.
• No exhaustive set of rules are assumed to exist even though assumptions about syntax are implicit in a treebank.
• The consistency of syntax analysis in a treebank is measured using interannotator agreement by having approximately 10% overlapped material
annotated by more than one annotator.
• Treebanks provide annotations of syntactic structure for a large sample of sentences.
• Treebanks solve the first knowledge acquisition problem of finding the grammar underlying the syntax analysis because the analysis is directly given instead of a
grammar.
• Each sentence in a treebank has ben given its most plausible syntactic analysis.
• Supervised learning algorithms can be used to learn a scoring function over all possible syntax analyses.
• For real time data the parser uses the scoring function to return the syntax analysis that has the highest score.
• Two main approaches to syntax analysis that are used to construct treebanks are:
• Dependency graphs
• Phase structure trees
• These two representations are very closely connected to each other and under some assumptions one can be converted to the other.
• Dependency analysis is used for free word order languages like Indian languages.
• Phrase structure analysis is used to provide additional information about long distance dependencies for languages like English and French.
NLP is the capability of the computer software to understand the natural language.
Each language has its own structure(SVO or SOV)->called grammar ->has certain set of rules -> determines: what is allowed, what is not allowed.
English languages: S O V
I eat mango
It has the form α → β, where α and β are strings on VN ∪ ∑ at least one symbol of α.
P = Production rules for Terminals as well as Non-terminals.
S -> NP VP
VP -> V NP
V ->hit
NP-> DN
D->the
N->John|bal
Treebanks: A Data-Driven Approach to Syntax
• In the discussion to follow we examine three main components for building a parser:
• The representation of syntactic structure- it involves the use of a varying amount of linguistic knowledge to build a treebank.
• The training and decoding algorithms- they deal with the potentially exponential search space.
• Methods to model ambiguity- provides a way to rank parses to recover the most likely parse.
2.3 Representation of Syntactic Structure
• In dependency graphs the head of a phrase is connected with the dependents in that phrase using directed connections.
• The main difference between dependency graphs and phrase structure trees is that dependency analysis makes minimal assumptions about syntactic structure.
• Dependency graphs treat the words in the input sentence as the only vertices in the graph which are linked together by directed arcs representing syntactic dependencies.
• In dependency trees only where each word depends on exactly one parent either another word or a dummy symbol.
• In dependency trees the 0 index is used to indicate the root symbol and the directed arcs are drawn from head word to the dependent word.
• In the figure in the previous slide [fakulte,N3,7] is the seventh word in the sentence with POS tag N3 and it has dative case.
• Here is a textual representation of a labeled dependency tree:
• A projective dependency tree is one where we put the words in a linear order based on the sentence with the root symbol in the first position.
• The dependency arcs are then drawn above the words without any crossing dependencies.
• Example:
• Let us look at an example where a sentence contains an extra position to the right of a noun phrase modifier phrase which requires a crossing dependency.
• English has very few cases in a treebank that needs such a non projective analysis.
• In languages like Czech, Turkish, Telugu the number of non productive dependencies are much higher.
• Dependency graphs in treebanks do not explicitly distinguish between projective and non-projective dependency tree analyses.
• Parsing algorithms are sometimes forced to distinguish between projective and non-projective dependencies.
• Let us try to setup dependency links in a CFG.
• In the CFG the terminal symbols are x0,x1,x2,x3.
• The asterisk picks out a single symbol in the right hand side of each rule that specifies the dependency link.
• We can look at the asterisk as either a separate annotation on the non-terminal or simply as a new nonterminal in the probabilistic context-free
grammar(PCFG).
• If we can convert a dependency tree into an equivalent CFG then the dependency tree must be projective.
• In a CFG converted from a dependency tree we have only the following three types of rules:
• One type of rule to introduce the terminal symbol
• Two rules where Y is dependent on X or vice-versa.
• The head word of X or Y can be traced by following the asterisk symbol.
• When we convert this dependency tree to a CFG with * notation it can only capture the fact that X3 depends on X2 or X1 depends on X3.
• There is no CFG that can capture the non-projective dependency.
• Projective dependency can be defined as follows:
• For each word in the sentence its descendants form a contiguous substring of the sentence.
• Non-Projectivity can be defined as follows:
• A non-projective dependency means that there is a word in the sentence such that its descendants do not form a contiguous substring of the sentence.
• In non-projective dependency there is a non-terminal Z such that Z derives spans(xi,xk) and (xk+p,xj) for some p>0.
• It means there must be a rule Z->PQ where P derives (x1,xk) and Q derives (xk+p,xj).
• By the definition of CFG this is valid if p=0 because P and Q must be continuous substrings.
• Hence non-projective dependency cannot be converted to a CFG.
2.3.2 Syntax Analysis using Phrase Structure Trees
• A phrase structure syntax analysis can be viewed as implicitly having a predicate argument structure associated with it.
• Now let us look at an example sentence:
• Mr. Baker seems especially sensitive
• The same sentence gets the following dependency tree analysis.
• We can find some similarity between Phase Structure Trees and Dependency Trees.
• We now look at some examples of Phase Structure Analysis in tree banks and see how null elements are used to localize certain predicate-argument dependencies in
the tree structure.
• In this example we can see that an NP dominates a trace *T* which is a null element (same as ϵ).
• The empty trace has an index and is associated with the WHNP (WH-noun phrase) constituent with the same index.
• This co-indexing allows us to infer the predicate-argument structure shown in the tree.
• In this example “the ball” is actually not the logical subject of the predicate.
• It has been displaced due to the passive construction.
• “Chris” the actual subject is marked as LGS (Logical subjects in passives) enabling the recovery of predicate argument structure for the sentence.
• In this example we can see that different syntactic phenomena are combined in the corpus.
• Both the analysis is combined to provide the predicate argument structure in such cases.
• Here we see a pair of examples to show how null elements are used to annotate the presence of a subject for a predicate even if it is not explicit in the sentence.
• In the first case the analysis marks the missing subject for “take back” as the object of the verb persuaded.
• In the second case the missing subject for “take back ” is the subject of the verb promised.
• The dependency analysis for “persuaded” and “promised” do not make such a distinction.
• The dependency analysis for the two sentences is as follows:
• Most statistical parsers trained using Phrase Structure treebanks ignore these differences.
• Here we look at one example from the Chinese treebank which uses IP instead of S.
• This is a move from transformational grammar-based phrase structure to government-binding (GB) based phrase structure.
• It is very difficult to take a CFG-based parser initially developed for English parsing and adapt it to Chinese parsing by training it on Chinese phrase structure treebank.
2.4 Parsing Algorithms
• The analysis will be consistent with the tree-bank used to train the parser.
• Tree-banks do not need to have an explicit grammar.
• Here for better understanding we assume the existence of a CFG.
• Let us look at an example of a CFG that is used to derive strings such as “a and b or c” from the start symbol N:
• An interesting property of rightmost derivation is revealed if we arrange the derivation in reverse order.
• The above sequence corresponds to the construction of the above parse tree from left to right, one symbol at a time.
• There can be many parse trees for the given input string.
• Another rightmost derivation for the given input string is as follows:
2.4.1 Shift-Reduce Parsing
• To build a parser, we need an algorithm that can perform the steps in the above rightmost derivation for any grammar and for any input string.
• Every CFG turns out to have automata that are equivalent to it, called pushdown automata (just like regular expression can be converted to finite-state automata).
• An algorithm for parsing that is general for any given CFG and input string.
• The algorithm is called shift-reduce parsing which uses two data structures:
1. a buffer for input symbols and
2. a stack for storing CFG symbols.
• CFGs in the worst case need backtracking and have a worst case parsing algorithm which run in O(n3) where n is the size of the input.
• Variants of this algorithm are used in statistical parsers that attempt to search the space of possible parse trees without the limitation of left-to-right parsing.
• Example CFG G is rewritten as new CFG Gc which contains up to two non-terminals on the right hand side.
• We can specialize the CFG Gc to a particular input string by creating a new CFG that represents all possible parse trees that are valid in grammar G c for this particular input
sentence.
• For the input “a and b or c” the new CFG Cf that represents the forest of parse trees can be constructed.
• Let the input string be broken up into spans 0 a 1 and 2 b 3 or 4 c 5.
• Here a parsing algorithm is defined as taking as input a CFG and an input string and producing a specialized CFG that represents all legal parsers for the input.
• A parser has to create all the valid specialized rules from the start symbol nonterminal that spans the entire string to the leaf nodes that are the input tokens.
• Now let us look at the steps the parser has to take to construct a specialized CFG.
• Let us consider the rules that generate only lexical items:
• These rules can be constructed by checking for the existence of the rule of the type Nx for any input token x and creating a specialized rule for token x.
• We now can create new specialized rules based on previously created rules in the following way:
• Let Y[i,k] and Z[k,j] be the left hand side of previously created specialized rules
• If the CFG has a rule of type XYZ
• We infer that there must be a specialized rule X[i,j]Y[I,k]Z[k,j]
• Each nonterminal span is assigned a score s, X[I,j]:s
• The highest scoring span for each non terminal is retained X[i,j]=maxsX[i,j]:s
• The previous algorithm is called as CYK algorithm.
• It considers every span of every length, splits up that span in every possible way to see if a CFG rule can be used to derive the span.
• Picking the most likely trees by using supervised learning will take no more than n3 time.
• For a given span i,j and for each non-terminal X we only keep the highest scoring way to reach X[i,j].
• Starting from the start symbol that spans the entire string S[0,n] gives us the best parsing tree.
• In a probabilistic setting where the score can be interpreted as log probabilities, this is called a Viterbi-best parse.
• Each cell contains the log probability of deriving the string w[i,j] starting with nonterminal X, which can be written as Pr(X =>* w[i,j]). (generating)
• The utility of a nonterminal X at a particular span i,j depends on reaching the start symbol S which is captured by the outside probability Pr(S=>* w[0,i-1]Xw[j+1,N]).
(reachable)
• There are ways to speed up the parser by throwing away less likely parts of the search space.
• The three techniques here:
1. Beam thresholding
2. Global thresholding
3. Coarse to fine parsing
• We compare the score of X[i,j] with the score of the current highest scoring entry Y[i,j] and throw away any rule starting with X[i,j] if its score is not good enough. This
technique is called beam thresholding.
• If we want to speed up the parser a little more then we eliminate rules that do not have neighboring rules to combine with it. This technique is called global thresholding.
In the above figure nodes A, B and C will not be thresholded as each is part of a sequence from the beginning to the end of the chart. Nodes X, Y and Z will be
thresholded as none of them is part of such a sequence. This is called global thresholding.
• We parse first with a coarser non-terminal to prune the finer grained non-terminal. (using a VP non-terminal instead of VP^S)
• We now use the score associated with the coarse VP[i,j] to prune the finer grained non-terminals for the same span.
• This approach is called coarse to fine parsing.
• It is useful because the outside probability of the coarse step can be used in addition to the inside probability for more effective pruning.
• The parser can be sped up even further by using A* search rather than the exhaustive search used in the previous algorithm.
• A good choice of heuristic can be very helpful to provide faster times for parsing.
• The worst case complexity remains the same as CYK algorithm.
• In the case of projective dependency parsing CYK algorithm can be used by creating a CFG that produces dependency parsers.
• For dependency parsing the algorithm takes a worst-case of n5.
• For dependency parsing instead of using non-terminals augmented with words we can represent sets of different dependency trees for each span of the input string.
• The idea is to collect left and right dependencies of each head independently and combine them at a later stage.
• Here we have the notion of split head where the head word is split into two:
• One for the left dependents
• One for the right dependents
• In addition to the head word for each span in each item we store the following:
• Whether the head is gathering left or right dependencies.
• Whether the item is complete. (cannot be extended with more dependents)
• In the algorithm the spans are stored in a chart data-structure C e.g. C[i,j] refers to the dependency analysis of span i,j.
• The spans that are grown towards the left are referred to as C<-, and the spans that are grown towards the right are referred to as C->.
• For C<-[i][j] the head must be j and for C->[i][j] the head must be i.
• We assume the unique root node as the leftmost token.
• The score of the best tree for the entire sentence is in Cc->[1][n].
• This algorithm can be extended to provide k-best parses with a complexity of O(n3klogk).
• Finding the optimum branching in a directed graph is closely related to the problem of finding a minimum spanning tree in an undirected graph.
• A prerequisite is that each potential dependency link between words should have a score.
• In NLP minimum spanning tree (MST) is used to refer to the optimum branching problem in directed graphs.
• The scores we have can be used to find the MST, which is the highest scoring dependency tree.
• Nonprojective dependencies can be recovered because the linear order of the words in the input is not considered.
• Let us look at an example sentence “John saw Mary” with precomputed scores.
• The first step is to find highest scoring incoming edges.
• If the step results in a tree we report this as the parse because it is the MST.
• In our example the highest scoring edges result in a cycle.
• We contact the cycle into a single node and recalculate the edge weights.
• The resultant graph is as follows:
• We now run the MST algorithm on the previous graph.
• The resultant graph is as follows:
• The MST that is the highest score dependency parse tree is as follows:
Mary
2.5 Models for Ambiguity Resolution in Parsing
• Here we discuss about the design features and ways to resolve ambiguity in parsing.
• The different issues discussed here are as follows:
• Probabilistic Context-Free Grammars
• Generative models for parsing
• Discriminative Models for parsing
• Let us look at the example we already used to describe ambiguity. The sentence “John bought a shirt with pockets”
• We want to provide a grammar so that the second tree is preferred over the first.
• The original grammar for the parse trees is:
• We can add scores or probabilities to the rules in this CFG in order to provide a score or probability for each derivation.
• The probability of a derivation is the sum of scores or product of probabilities of all the CFG rules used in the derivation.
• The scores are viewed as log probabilities and we use the term Probabilistic Context-Free Grammar (PCFG).
• We assign probabilities to the CFG rule such that for a rule Nα, the probability is P(Nα/N)
• Each rule probability is conditioned on the left hand side of the rule.
• The probability is distributed among all the expansions of the Non-terminals
• The rule probabilities can be derived from a tree bank.
• Consider a treebank with three trees t1,t2 and t3.
• Appropriate changes like replacing the scoring function for each X[i,j] to be the product of the inside and outside probability rather than just using the inside
probability.
• This technique is called max-rule parsing and can produce trees that are not valid PCFG parse trees.
• For input sequence x, the output parse tree y is defined by the sequence of steps in the derivation.
• The probability for each derivation can be:
• The conditioning context in the probability is called the history and corresponds to a partially built parse tree.
• We group the histories into equivalence classes using a function ɸ.
• Using ɸ each history Hi=d1,d2,….di-1 for all x,y is mapped to some fixed finite set of feature functions of the history ϕ1(Hi),…….ϕk(Hi)
• In terms of these k feature functions:
• The definition of PCFGs means that various other rule probabilities must be adjusted to obtain the right scoring parsers.
• The independent assumptions in PCFG which are dictated by the underlying CFG lead to bad models.
• Such ambiguities can be modeled using arbitrary features of the parse tree.
• A conditional random field defines the conditional probability as a linear score for each candidate y and a global normalization term:
• Experimental results in parsing have shown that the simpler global linear model provides the same accuracy compared to the normalized models.
• A perceptron was originally introduced as a single-layered neural network
• During training the perceptron adjusts a weight parameter vector that can be applied to the input vector to get the corresponding output.
• The perceptron ensures that the current weight vector is able to correctly classify the current training examples.
• In the original perceptron learning algorithm:
• The incremental weight updating suffers from overfitting
• The algorithm is not capable of dealing with training data that is linearly inseparable
• A variant of the original perceptron learning algorithm is the voted perceptron algorithm.
• In the voted perceptron algorithm instead of storing and updating parameter values inside one weight vector, its learning process keeps track of all intermediate
weight vectors and these intermediate vectors are used in the classification phase to vote for the answer.
• The voted perceptron keeps a count ci to record the number of times a particular weight parameter vector (wi,ci) survives in the training.
• For a training example, if its selected top candidate is different from the truth, a new count ci+1, being initialized to 1 is used.
• An updated weight vector (wi+1,ci+1) is produced and the original ci and weight vector (wi,ci) are stored.
• The averaged perceptron algorithm is an approximation to the voted perceptron.
• It maintains the stability of the voted perceptron but reduces the space and time complexity.
• Instead of using w the averaged weight parameter vector γ over the m training examples is used for future predictions on unseen data.
• In calculating γ an accumulating parameter is used and updated using w for each example.
• After the last iteration /mT produces the final parameter vector γ.
2.6 Multilingual Issues: What is a Token
• Tokenization, Case and Encoding
• Word Segmentation
• Morphology
• The definition of a word token is well defined given a treebank or parser but variable across different treebanks or parsers.
• For example the possessive marker(Rama’s) and copula verbs(There’s) are treated differently in different treebanks.
• In some languages there are issues with upper case and lower case.
• For example the word Boeing might seem like singing if the first letter is not an upper case.
• Low count tokens can be replaced with patterns that retain case information. Example Patagonia can be replaced by Xxx.
• For languages that are not encoded in ASCII, the different encodings need to be managed.
• There are also issues with the sentence terminator(.), which in some corpora will be in ASCII but in others it will be in UTF-8(Unicode Transformation Format).
• Some languages like Chinese are encoded in different formats depending on the place where the text originates.
2.6.2 Word Segmentation
• The written form of many languages like Chinese lack marks identifying words.
• Word segmentation is the process of demarcating blocks in a character sequence such that the produced output is composed of separate tokens and is meaningful.
• Only if we are able to identify each word in the sentence POS tag can be assigned and syntax analysis can be done.
• Chinese word segmentation has a large community of researchers and has resulted in three different types.
• The non-terminals in the tree that span a group of characters can be said to specify the word boundaries.
• The best word segmentation model creates a pipeline where the parser is unable to choose between different plausible segmentations.
• The parser for CFGs can parse an input word lattice.(represented by a finite state automata)
2.6.3 Morphology
• In many languages the notion of splitting up tokens using spaces is problematic.
• Each word can contain several components called morphemes such that the meaning of the word can be thought of as the combination of the meanings of the
morphemes.
• A word is a representation of a stem combined with several morphemes.
• An example from Turkish treebank shows that the syntactic dependencies need to be aware of the morphemes within words.
• In the above example morpheme boundaries within a word are shown using the + symbol.
• Morphemes and not words are used as heads and dependents.
• Agglutinative languages like Turkish have the property of entire clauses being combined with morphemes to create complex words.
• In languages like Czech and Russian different morphemes are used to mark grammatical case, gender and so on.
• In such languages each inflected word can be ambiguously segmented into morphemes with different analyses.
• Now the parser has to deal with morphological ambiguity also.
• The POS tagger has to produce this complex tag which can be done by proper training.
• In such languages each word is tagged with a POS tag that encodes a lot of information about the morphemes.
• This tag set can be used as features for a statistical parser for a highly inflected language.
• In discriminative models for statistical parsing we can include the morphological information because discriminative models allow the inclusion of a large
number of overlapping features.
• The morphological information associated with the words can be used to build better statistical parsers by simply throwing them into the mix.