0% found this document useful (0 votes)
717 views71 pages

NLP Unit-Ii

Uploaded by

Lalitha Abhigna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
717 views71 pages

NLP Unit-Ii

Uploaded by

Lalitha Abhigna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 71

Unit-II Syntax Analysis

2.1 Parsing Natural Language


2.2 Treebanks: A Data-Driven Approach to Syntax,
2.3 Representation of Syntactic Structure,
2.4 Parsing Algorithms,
2.5 Models for Ambiguity Resolution in Parsing, Multilingual Issues

2.6 Multilingual Issues


 The parsing in NLP is the process of determining the syntactic structure of a text by analysing its constituent words based on an underlying grammar.
Example Grammar:

 Then, the outcome of the parsing process would be a parse tree, where sentence is the root, intermediate nodes such as noun_phrase, verb_phrase etc. have
children - hence they are called non-terminals and finally, the leaves of the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are called terminals.
Parse Tree:
 A treebank can be defined as a linguistically annotated corpus that includes some kind of syntactic analysis over and above part-of-speech tagging.

 A sentence is parsed by relating each word to other words in the sentence which depend on it.

 The syntactic parsing of a sentence consists of finding the correct syntactic structure of that sentence in the given formalism/grammar.

 Dependency grammar (DG) and phrase structure grammar(PSG) are two such formalisms.

 PSG breaks sentence into constituents (phrases), which are then broken into smaller constituents.

 Describe phrase, clause structure Example: NP,PP,VP etc.,

 DG: syntactic structure consists of lexical items, linked by binary asymmetric relations called dependencies.

 Interested in grammatical relations between individual words.

 Does propose a recursive structure rather a network of relations

 These relations can also have labels.


Constituency tree vs Dependency tree
 Dependency structures explicitly represent
- Head-dependent relations (directed arcs)
- Functional categories (arc labels)
- Possibly some structural categories (POS)
 Phrase structure explicitly represent
- Phrases (non-terminal nodes)
- Structural categories (non-terminal labels)
- Possible some functional categories (grammatical functions)

Defining candidate dependency trees for an input sentence

Learning: scoring possible dependency graphs for a given sentence, usually by factoring the graphs into their component arcs

Parsing: searching for the highest scoring graph for a given sentence
Syntax-Introduction
• Parsing uncovers the hidden structure of linguistic input.

• In Natural Language Applications the predicate structure of sentences can be useful.


• In NLP the syntactic analysis of the input can vary from:
• Very low level- POS tagging.
• Very high level- recovering a structural analysis that identifies the dependency between predicates and arguments in the sentence.

• The major problem in parsing natural language is the problem of ambiguity.

• From tagging to full parsing, algorithms that can handle such ambiguity have to be carefully chosen.

• Here we explore the syntactic analysis methods from tagging to full parsing and the use of supervised machine learning to deal with ambiguity.
2.1 Parsing Natural Languages

• In a text-to-speech application, input sentences are to be converted to a spoken output that should sound like it was spoken by native speaker of the language.

• Let us look at the following spoken sentences:


• He wanted to go for a drive in movie.
• He wanted to go for a drive in the country.
• There is a natural pause between drive and in in the second sentence.
• This gap reflects an underlying hidden structure to the sentence.

• Parsing provides a structural description that identifies such a break in the intonation.

• Let us look at another sentence:


• The cat who lives dangerously had nine lives.

• A text-to-speech system needs to know that the first instance of the word lives is a verb and the second instance a noun.
• This is an instance of POS tagging problem.

• Another important application where parsing is important is text summarization.

• Let us look at examples for summarization:


• Beyond the basic level, the operations of the three products vary widely.
• The above sentence can be summarized as follows:
• The operations of the products vary.
• To do this task we first parse the first sentence.
• Deleting the circled constituents PP, CD and ADVP in the previous diagram results in the short sentence.
• Let us look at another example:
• Open borders imply increasing racial fragmentation in EUROPEAN COUNTRIES.
• In the above example the capitalized phrase can be replaced with other phrases without changing the meaning of the sentence.
• Open borders imply increasing racial fragmentation in the countries of Europe.
• Open borders imply increasing racial fragmentation in European states.
• Open borders imply increasing racial fragmentation in Europe.
• Open borders imply increasing racial fragmentation in European Nations.
• Open borders imply increasing racial fragmentation in the European countries.

• The elegant way to approach this task is to first parse the sentence to find the various constituents: where we recursively partition the words in the sentence into
individual phrases such as a verb phrase or a noun phrase.
 The output of the parser for the input sentence “Tom ate an apple” is shown in Fig.

• In NLP syntactic parsing is used in many applications like:


• Statistical Machine Translation
• Information extraction from text collections
• Language summarization
• Producing entity grids for language generation
• Error correction in text
• Knowledge acquisition from language
2.2 Treebanks: A Data-Driven Approach to Syntax

• Parsing recovers information that is not explicit in the input sentence.

• Parsers require some additional knowledge beyond the input sentence that should be produced as output.
• We can write down the rules of the syntax of a sentence as a CFG.

• Here we have a CFG which represents a simple grammar of transitive verbs in English (verbs that have a subject and object noun phrase (NP), plus modifiers
of verb phrases (VP) in the form of prepositional phrases (PP) ).

• The above CFG can produce the syntax analysis of a sentence like: John bought a shirt with pockets
Treebanks: A Data-Driven Approach to Syntax

• Parsing the previous sentence gives us two possible derivations.


• Writing a CFG for the syntactic analysis of natural language is problematic.

• A simple list of rules does not consider interactions between different components in the grammar.

• Listing all possible syntactic constructions in a language is a difficult task.

• It is difficult to exhaustively list lexical properties of words. This is a typical knowledge acquisition problem.

• One more problem is that the rules interact with each other in combinatorially explosive ways.

• Let us look at an example of noun phrases as a binary branching tree.


• N -> NN (Recursive Rule)
• N -> ‘natural’|’language’|’processing’|’book’

• For the input ‘natural language processing’ the recursive rules produce two ambiguous parses.

• For CFGs it can be proved that the number of parsers obtained by using the recursive rule n times is the Catalan number (1,1,2,5,14,42,
132, 429, 1430, 4862,….) of n:

• For the input “natural language processing book” only one out of the five parsers obtained using the above CFG is correct:
Treebanks: A Data-Driven Approach to Syntax

• This is the second knowledge acquisition problem- We need to know not only the rules but also which analysis is most plausible for a given input sentence.

• The construction of a tree bank is a data driven approach to syntax analysis that allows us to address both the knowledge acquisition bottlenecks in one stroke.

• A treebank is a collection of sentences where each sentence is provided a complete syntax analysis.

• The syntax analysis for each sentence has been judged by a human expert.

• A set of annotation guidelines is written before the annotation process to ensure a consistent scheme of annotation throughout the tree bank.
• No set of syntactic rules are provided by a treebank.
• No exhaustive set of rules are assumed to exist even though assumptions about syntax are implicit in a treebank.
• The consistency of syntax analysis in a treebank is measured using interannotator agreement by having approximately 10% overlapped material
annotated by more than one annotator.
• Treebanks provide annotations of syntactic structure for a large sample of sentences.

• A supervised machine learning method can be used to train the parser.

• Treebanks solve the first knowledge acquisition problem of finding the grammar underlying the syntax analysis because the analysis is directly given instead of a
grammar.

• The second problem of knowledge acquisition is also solved by treebanks.

• Each sentence in a treebank has ben given its most plausible syntactic analysis.

• Supervised learning algorithms can be used to learn a scoring function over all possible syntax analyses.
• For real time data the parser uses the scoring function to return the syntax analysis that has the highest score.
• Two main approaches to syntax analysis that are used to construct treebanks are:
• Dependency graphs
• Phase structure trees
• These two representations are very closely connected to each other and under some assumptions one can be converted to the other.
• Dependency analysis is used for free word order languages like Indian languages.
• Phrase structure analysis is used to provide additional information about long distance dependencies for languages like English and French.

 NLP is the capability of the computer software to understand the natural language.

 There are variety of languages in the world.

 Each language has its own structure(SVO or SOV)->called grammar ->has certain set of rules -> determines: what is allowed, what is not allowed.

English languages: S O V
I eat mango

Other languages: S V O (or) O S V

 Grammar is defined as the rules for forming well-structured sentences.


 belongs to VN
 Different Types of Grammar in NLP

1. Context-Free Grammar (CFG)


2. Constituency Grammar (CG) or Phrase structure grammar
3. Dependency Grammar (DG)

Context-Free Grammar (CFG)

 Mathematically, a grammar G can be written as a 4-tuple (N, T, S, P)


 N or VN = set of non-terminal symbols, or variables.

S = Start symbol where S ∈ N


 T or ∑ = set of terminal symbols.

It has the form α → β, where α and β are strings on VN ∪ ∑ at least one symbol of α.
 P = Production rules for Terminals as well as Non-terminals.

• Example: John hit the ball

S -> NP VP
VP -> V NP
V ->hit
NP-> DN
D->the
N->John|bal
Treebanks: A Data-Driven Approach to Syntax

• In the discussion to follow we examine three main components for building a parser:

• The representation of syntactic structure- it involves the use of a varying amount of linguistic knowledge to build a treebank.

• The training and decoding algorithms- they deal with the potentially exponential search space.

• Methods to model ambiguity- provides a way to rank parses to recover the most likely parse.
2.3 Representation of Syntactic Structure

• Syntax Analysis using dependency graphs


• Syntax Analysis using phrase structure trees

2.3.1 Syntax Analysis using Dependency Graphs

• In dependency graphs the head of a phrase is connected with the dependents in that phrase using directed connections.

• The head-dependent relationship can be semantic (head-modifier) or syntactic (head-specifier).

• The main difference between dependency graphs and phrase structure trees is that dependency analysis makes minimal assumptions about syntactic structure.

• Dependency graphs treat the words in the input sentence as the only vertices in the graph which are linked together by directed arcs representing syntactic dependencies.

• One typical definition of dependency graph is as follows:


• In dependency syntactic parsing the task is to derive a syntactic structure for an input sentence by identifying the syntactic head of each word in the sentence.
• The nodes are the words of the input sentence and the arcs are the binary relations from head to dependent.
• It is often assumed that all words except one have a syntactic head.
• It means that the graph will be a tree with the single independent node as the root.
• In labeled dependency parsing the parser assigns a specific type to each dependency relation holding between head word and dependent word.

• In dependency trees only where each word depends on exactly one parent either another word or a dummy symbol.
• In dependency trees the 0 index is used to indicate the root symbol and the directed arcs are drawn from head word to the dependent word.
• In the figure in the previous slide [fakulte,N3,7] is the seventh word in the sentence with POS tag N3 and it has dative case.
• Here is a textual representation of a labeled dependency tree:

• An important notion in dependency analysis is the notion of projectivity.

• A projective dependency tree is one where we put the words in a linear order based on the sentence with the root symbol in the first position.

• The dependency arcs are then drawn above the words without any crossing dependencies.
• Example:
• Let us look at an example where a sentence contains an extra position to the right of a noun phrase modifier phrase which requires a crossing dependency.

• English has very few cases in a treebank that needs such a non projective analysis.

• In languages like Czech, Turkish, Telugu the number of non productive dependencies are much higher.

• Dependency graphs in treebanks do not explicitly distinguish between projective and non-projective dependency tree analyses.

• Parsing algorithms are sometimes forced to distinguish between projective and non-projective dependencies.
• Let us try to setup dependency links in a CFG.
• In the CFG the terminal symbols are x0,x1,x2,x3.
• The asterisk picks out a single symbol in the right hand side of each rule that specifies the dependency link.
• We can look at the asterisk as either a separate annotation on the non-terminal or simply as a new nonterminal in the probabilistic context-free
grammar(PCFG).

• The dependency tree equivalent to the preceding CFG is as follows:

• If we can convert a dependency tree into an equivalent CFG then the dependency tree must be projective.
• In a CFG converted from a dependency tree we have only the following three types of rules:
• One type of rule to introduce the terminal symbol
• Two rules where Y is dependent on X or vice-versa.
• The head word of X or Y can be traced by following the asterisk symbol.

• Now let us look at an example non-projective dependency tree:

• When we convert this dependency tree to a CFG with * notation it can only capture the fact that X3 depends on X2 or X1 depends on X3.
• There is no CFG that can capture the non-projective dependency.
• Projective dependency can be defined as follows:
• For each word in the sentence its descendants form a contiguous substring of the sentence.
• Non-Projectivity can be defined as follows:
• A non-projective dependency means that there is a word in the sentence such that its descendants do not form a contiguous substring of the sentence.
• In non-projective dependency there is a non-terminal Z such that Z derives spans(xi,xk) and (xk+p,xj) for some p>0.

• It means there must be a rule Z->PQ where P derives (x1,xk) and Q derives (xk+p,xj).
• By the definition of CFG this is valid if p=0 because P and Q must be continuous substrings.
• Hence non-projective dependency cannot be converted to a CFG.
2.3.2 Syntax Analysis using Phrase Structure Trees

• A phrase structure syntax analysis can be viewed as implicitly having a predicate argument structure associated with it.
• Now let us look at an example sentence:
• Mr. Baker seems especially sensitive
• The same sentence gets the following dependency tree analysis.

• We can find some similarity between Phase Structure Trees and Dependency Trees.

• We now look at some examples of Phase Structure Analysis in tree banks and see how null elements are used to localize certain predicate-argument dependencies in
the tree structure.

• In this example we can see that an NP dominates a trace *T* which is a null element (same as ϵ).

• The empty trace has an index and is associated with the WHNP (WH-noun phrase) constituent with the same index.

• This co-indexing allows us to infer the predicate-argument structure shown in the tree.
• In this example “the ball” is actually not the logical subject of the predicate.
• It has been displaced due to the passive construction.

• “Chris” the actual subject is marked as LGS (Logical subjects in passives) enabling the recovery of predicate argument structure for the sentence.

• In this example we can see that different syntactic phenomena are combined in the corpus.

• Both the analysis is combined to provide the predicate argument structure in such cases.
• Here we see a pair of examples to show how null elements are used to annotate the presence of a subject for a predicate even if it is not explicit in the sentence.

• In the first case the analysis marks the missing subject for “take back” as the object of the verb persuaded.

• In the second case the missing subject for “take back ” is the subject of the verb promised.
• The dependency analysis for “persuaded” and “promised” do not make such a distinction.
• The dependency analysis for the two sentences is as follows:

• Most statistical parsers trained using Phrase Structure treebanks ignore these differences.

• Here we look at one example from the Chinese treebank which uses IP instead of S.

• This is a move from transformational grammar-based phrase structure to government-binding (GB) based phrase structure.

• It is very difficult to take a CFG-based parser initially developed for English parsing and adapt it to Chinese parsing by training it on Chinese phrase structure treebank.
2.4 Parsing Algorithms

• Introduction to Parsing Algorithms


• Shift-Reduce Parsing
• Hypergraphs and Chart Parsing
• Minimum Spanning Trees and Dependency Parsing

Introduction to Parsing Algorithms

• Given an input sentence a parser produces an output analysis of that sentence.

• The analysis will be consistent with the tree-bank used to train the parser.
• Tree-banks do not need to have an explicit grammar.
• Here for better understanding we assume the existence of a CFG.

• Let us look at an example of a CFG that is used to derive strings such as “a and b or c” from the start symbol N:

• An important concept of parsing is a derivation.


• In the derivation each line is called sentential form.
• The method of derivation used here is called rightmost derivation.

• An interesting property of rightmost derivation is revealed if we arrange the derivation in reverse order.

• The above sequence corresponds to the construction of the above parse tree from left to right, one symbol at a time.
• There can be many parse trees for the given input string.
• Another rightmost derivation for the given input string is as follows:
2.4.1 Shift-Reduce Parsing

• To build a parser, we need an algorithm that can perform the steps in the above rightmost derivation for any grammar and for any input string.
• Every CFG turns out to have automata that are equivalent to it, called pushdown automata (just like regular expression can be converted to finite-state automata).
• An algorithm for parsing that is general for any given CFG and input string.
• The algorithm is called shift-reduce parsing which uses two data structures:
1. a buffer for input symbols and
2. a stack for storing CFG symbols.

• The shift-reduce parsing algorithm is defined as follows:


• For the example “a and b or c” the parsing steps are s follows:
Shift-Reduce Parsing- for dependency parsing
2.4.2 Hyper Graphs and Chart Parsing
• Shift-reduce parsing allows a linear time parser but requires access to an oracle.

• CFGs in the worst case need backtracking and have a worst case parsing algorithm which run in O(n3) where n is the size of the input.

• Variants of this algorithm are used in statistical parsers that attempt to search the space of possible parse trees without the limitation of left-to-right parsing.

• Example CFG G is rewritten as new CFG Gc which contains up to two non-terminals on the right hand side.

• We can specialize the CFG Gc to a particular input string by creating a new CFG that represents all possible parse trees that are valid in grammar G c for this particular input
sentence.
• For the input “a and b or c” the new CFG Cf that represents the forest of parse trees can be constructed.
• Let the input string be broken up into spans 0 a 1 and 2 b 3 or 4 c 5.
• Here a parsing algorithm is defined as taking as input a CFG and an input string and producing a specialized CFG that represents all legal parsers for the input.

• A parser has to create all the valid specialized rules from the start symbol nonterminal that spans the entire string to the leaf nodes that are the input tokens.
• Now let us look at the steps the parser has to take to construct a specialized CFG.
• Let us consider the rules that generate only lexical items:

• These rules can be constructed by checking for the existence of the rule of the type Nx for any input token x and creating a specialized rule for token x.
• We now can create new specialized rules based on previously created rules in the following way:
• Let Y[i,k] and Z[k,j] be the left hand side of previously created specialized rules
• If the CFG has a rule of type XYZ
• We infer that there must be a specialized rule X[i,j]Y[I,k]Z[k,j]
• Each nonterminal span is assigned a score s, X[I,j]:s
• The highest scoring span for each non terminal is retained X[i,j]=maxsX[i,j]:s
• The previous algorithm is called as CYK algorithm.

• It considers every span of every length, splits up that span in every possible way to see if a CFG rule can be used to derive the span.

• The algorithm takes n3 time for an input of size n.

• Picking the most likely trees by using supervised learning will take no more than n3 time.

• For a given span i,j and for each non-terminal X we only keep the highest scoring way to reach X[i,j].

• Starting from the start symbol that spans the entire string S[0,n] gives us the best parsing tree.

• In a probabilistic setting where the score can be interpreted as log probabilities, this is called a Viterbi-best parse.

• Each cell contains the log probability of deriving the string w[i,j] starting with nonterminal X, which can be written as Pr(X =>* w[i,j]). (generating)

• The utility of a nonterminal X at a particular span i,j depends on reaching the start symbol S which is captured by the outside probability Pr(S=>* w[0,i-1]Xw[j+1,N]).
(reachable)

• There are ways to speed up the parser by throwing away less likely parts of the search space.
• The three techniques here:
1. Beam thresholding
2. Global thresholding
3. Coarse to fine parsing
• We compare the score of X[i,j] with the score of the current highest scoring entry Y[i,j] and throw away any rule starting with X[i,j] if its score is not good enough. This
technique is called beam thresholding.
• If we want to speed up the parser a little more then we eliminate rules that do not have neighboring rules to combine with it. This technique is called global thresholding.
In the above figure nodes A, B and C will not be thresholded as each is part of a sequence from the beginning to the end of the chart. Nodes X, Y and Z will be
thresholded as none of them is part of such a sequence. This is called global thresholding.

• Let us assume we have a very complex set of Non-terminals.

• We parse first with a coarser non-terminal to prune the finer grained non-terminal. (using a VP non-terminal instead of VP^S)

• We now use the score associated with the coarse VP[i,j] to prune the finer grained non-terminals for the same span.
• This approach is called coarse to fine parsing.
• It is useful because the outside probability of the coarse step can be used in addition to the inside probability for more effective pruning.
• The parser can be sped up even further by using A* search rather than the exhaustive search used in the previous algorithm.

• A good choice of heuristic can be very helpful to provide faster times for parsing.
• The worst case complexity remains the same as CYK algorithm.

• In the case of projective dependency parsing CYK algorithm can be used by creating a CFG that produces dependency parsers.
• For dependency parsing the algorithm takes a worst-case of n5.

• For dependency parsing instead of using non-terminals augmented with words we can represent sets of different dependency trees for each span of the input string.

• The idea is to collect left and right dependencies of each head independently and combine them at a later stage.
• Here we have the notion of split head where the head word is split into two:
• One for the left dependents
• One for the right dependents
• In addition to the head word for each span in each item we store the following:
• Whether the head is gathering left or right dependencies.
• Whether the item is complete. (cannot be extended with more dependents)

• This provides an n3 worst case dependency parsing algorithm.

• In the algorithm the spans are stored in a chart data-structure C e.g. C[i,j] refers to the dependency analysis of span i,j.

• Incomplete spans are referred to as Ci, complete spans are Cc.

• The spans that are grown towards the left are referred to as C<-, and the spans that are grown towards the right are referred to as C->.

• For C<-[i][j] the head must be j and for C->[i][j] the head must be i.
• We assume the unique root node as the leftmost token.
• The score of the best tree for the entire sentence is in Cc->[1][n].
• This algorithm can be extended to provide k-best parses with a complexity of O(n3klogk).

2.4.3 Minimum Spanning Trees and Dependency Parsing

• Finding the optimum branching in a directed graph is closely related to the problem of finding a minimum spanning tree in an undirected graph.

• A prerequisite is that each potential dependency link between words should have a score.

• In NLP minimum spanning tree (MST) is used to refer to the optimum branching problem in directed graphs.
• The scores we have can be used to find the MST, which is the highest scoring dependency tree.

• Nonprojective dependencies can be recovered because the linear order of the words in the input is not considered.

• Let us look at an example sentence “John saw Mary” with precomputed scores.
• The first step is to find highest scoring incoming edges.

• If the step results in a tree we report this as the parse because it is the MST.
• In our example the highest scoring edges result in a cycle.

• We contact the cycle into a single node and recalculate the edge weights.
• The resultant graph is as follows:
• We now run the MST algorithm on the previous graph.
• The resultant graph is as follows:

• The MST that is the highest score dependency parse tree is as follows:

Mary
2.5 Models for Ambiguity Resolution in Parsing
• Here we discuss about the design features and ways to resolve ambiguity in parsing.
• The different issues discussed here are as follows:
• Probabilistic Context-Free Grammars
• Generative models for parsing
• Discriminative Models for parsing

2.5.1 Probabilistic Context-Free Grammars

• Let us look at the example we already used to describe ambiguity. The sentence “John bought a shirt with pockets”
• We want to provide a grammar so that the second tree is preferred over the first.
• The original grammar for the parse trees is:

• We can add scores or probabilities to the rules in this CFG in order to provide a score or probability for each derivation.

• The probability of a derivation is the sum of scores or product of probabilities of all the CFG rules used in the derivation.

• The scores are viewed as log probabilities and we use the term Probabilistic Context-Free Grammar (PCFG).
• We assign probabilities to the CFG rule such that for a rule Nα, the probability is P(Nα/N)

• Each rule probability is conditioned on the left hand side of the rule.
• The probability is distributed among all the expansions of the Non-terminals
• The rule probabilities can be derived from a tree bank.
• Consider a treebank with three trees t1,t2 and t3.

• Let t1 occur 10 times in the tree bank t2 20 times and t3 50 times.


• The PCFG we obtain from the treebank would be:
• For input a a a a there are two parses using the above PCFG

• The probability is: p1=.125*.334*.285=.01189 p2=.25*.667*.714=.119


• The most likely parse tree does not even occur in the treebank.
• The reason is the context free nature of the PCFG where a non-terminal can be expanded by any rule with that non-terminal on the left hand side.
• To solve this problem the approach in statistical parsers is to augment the node labels in order to avoid bad independent assumptions.
• In the example already discussed we used three methods to augment the node labels:
• Annotate each non-terminal with the label of its parent.
• Learn the non-terminal splits by using an unsupervised learning algorithm
• Lexicalize the non-terminals
• For lexicalized nonterminal PCFG, the unfolding history is created head outward:
• First the head is predicted, then the left sibling is produced and then the right sibling is produced.
• An alternative way is to find the best tree not with highest score but with the highest number of correct constituents.

• Appropriate changes like replacing the scoring function for each X[i,j] to be the product of the inside and outside probability rather than just using the inside
probability.

• This technique is called max-rule parsing and can produce trees that are not valid PCFG parse trees.

2.5.2 Generative Models for Parsing


• Let each derivation D=d1,d2……dn which is a sequence of decisions used to build the parse tree.

• For input sequence x, the output parse tree y is defined by the sequence of steps in the derivation.
• The probability for each derivation can be:

• The conditioning context in the probability is called the history and corresponds to a partially built parse tree.
• We group the histories into equivalence classes using a function ɸ.

• Using ɸ each history Hi=d1,d2,….di-1 for all x,y is mapped to some fixed finite set of feature functions of the history ϕ1(Hi),…….ϕk(Hi)
• In terms of these k feature functions:

• The definition of PCFGs means that various other rule probabilities must be adjusted to obtain the right scoring parsers.

• The independent assumptions in PCFG which are dictated by the underlying CFG lead to bad models.

• Such ambiguities can be modeled using arbitrary features of the parse tree.

• Discriminative methods provide such class of models.

2.5.3 Discriminative models for Parsing


• A framework that describes various discriminative approaches to learning for parsing is called a global linear model.
• Let x be set of inputs and y be set of possible outputs that can be a sequence of POS tags or a parse tree or a dependency analysis.
• Each x ϵ x and y ϵ y is mapped to a d-dimensional feature vector ɸ(x,y), with each dimension being a real number, summarizing partial information
contained in (x,y).
• A weight parameter vector w ϵ Rd assigns a weight to each feature in ɸ(x,y) representing the importance of that feature.
• The value of ɸ(x,y).w is the score(x,y).
• The higher the score, the more plausible it is that y is the output for x.
• The function GEN(x) generates the set of possible outputs y for a given x.
• Having ɸ(x,y), w and GEN(x) specified we would like to choose the highest scoring candidate y* from GEN(x) as the most plausible output. That is,

Were F(x) returns the highest scoring output y* from GEN(x).

• A conditional random field defines the conditional probability as a linear score for each candidate y and a global normalization term:

• A simpler global linear model that ignores the normalization term is :

• Experimental results in parsing have shown that the simpler global linear model provides the same accuracy compared to the normalized models.
• A perceptron was originally introduced as a single-layered neural network

• During training the perceptron adjusts a weight parameter vector that can be applied to the input vector to get the corresponding output.

• The perceptron ensures that the current weight vector is able to correctly classify the current training examples.
• In the original perceptron learning algorithm:
• The incremental weight updating suffers from overfitting
• The algorithm is not capable of dealing with training data that is linearly inseparable
• A variant of the original perceptron learning algorithm is the voted perceptron algorithm.

• In the voted perceptron algorithm instead of storing and updating parameter values inside one weight vector, its learning process keeps track of all intermediate
weight vectors and these intermediate vectors are used in the classification phase to vote for the answer.

• The voted perceptron keeps a count ci to record the number of times a particular weight parameter vector (wi,ci) survives in the training.

• For a training example, if its selected top candidate is different from the truth, a new count ci+1, being initialized to 1 is used.

• An updated weight vector (wi+1,ci+1) is produced and the original ci and weight vector (wi,ci) are stored.
• The averaged perceptron algorithm is an approximation to the voted perceptron.

• It maintains the stability of the voted perceptron but reduces the space and time complexity.

• Instead of using w the averaged weight parameter vector γ over the m training examples is used for future predictions on unseen data.

• In calculating γ an accumulating parameter is used and updated using w for each example.
• After the last iteration /mT produces the final parameter vector γ.
2.6 Multilingual Issues: What is a Token
• Tokenization, Case and Encoding
• Word Segmentation
• Morphology

2.6.1 Tokenization, Case and Encoding

• The definition of a word token is well defined given a treebank or parser but variable across different treebanks or parsers.

• For example the possessive marker(Rama’s) and copula verbs(There’s) are treated differently in different treebanks.
• In some languages there are issues with upper case and lower case.

• For example the word Boeing might seem like singing if the first letter is not an upper case.

• Low count tokens can be replaced with patterns that retain case information. Example Patagonia can be replaced by Xxx.

• For languages that are not encoded in ASCII, the different encodings need to be managed.

• There are also issues with the sentence terminator(.), which in some corpora will be in ASCII but in others it will be in UTF-8(Unicode Transformation Format).

• Some languages like Chinese are encoded in different formats depending on the place where the text originates.
2.6.2 Word Segmentation
• The written form of many languages like Chinese lack marks identifying words.

• Word segmentation is the process of demarcating blocks in a character sequence such that the produced output is composed of separate tokens and is meaningful.

• Only if we are able to identify each word in the sentence POS tag can be assigned and syntax analysis can be done.
• Chinese word segmentation has a large community of researchers and has resulted in three different types.

• One approach to Chinese parsing is to parse the character sequence directly.

• The non-terminals in the tree that span a group of characters can be said to specify the word boundaries.

• The best word segmentation model creates a pipeline where the parser is unable to choose between different plausible segmentations.

• The parser for CFGs can parse an input word lattice.(represented by a finite state automata)

2.6.3 Morphology
• In many languages the notion of splitting up tokens using spaces is problematic.

• Each word can contain several components called morphemes such that the meaning of the word can be thought of as the combination of the meanings of the
morphemes.
• A word is a representation of a stem combined with several morphemes.

• An example from Turkish treebank shows that the syntactic dependencies need to be aware of the morphemes within words.
• In the above example morpheme boundaries within a word are shown using the + symbol.
• Morphemes and not words are used as heads and dependents.
• Agglutinative languages like Turkish have the property of entire clauses being combined with morphemes to create complex words.

• In languages like Czech and Russian different morphemes are used to mark grammatical case, gender and so on.

• In such languages each inflected word can be ambiguously segmented into morphemes with different analyses.
• Now the parser has to deal with morphological ambiguity also.

• The disambiguation problem of morphology can be reduced to a POS tagging task.


• For example a POS of V--M-3---- can indicate:
• Each word can have morphemes that inflect the word along 10 different dimensions.
• In this case the stem is a verb with masculine gender and in third person.

• The POS tagger has to produce this complex tag which can be done by proper training.
• In such languages each word is tagged with a POS tag that encodes a lot of information about the morphemes.

• This tag set can be used as features for a statistical parser for a highly inflected language.

• In discriminative models for statistical parsing we can include the morphological information because discriminative models allow the inclusion of a large
number of overlapping features.

• The morphological information associated with the words can be used to build better statistical parsers by simply throwing them into the mix.

You might also like