Natural Language Processing
Natural Language Processing
The gigantic corpus consisting of everything that has been either uttered or written in English over,
say, the last 50 years. Would we be justified in calling this corpus “the language of modern English”?
“modern English” is not equivalent to the very big set of word sequences in our imaginary corpus.
Speakers of English can make judgments about these sequences, and will reject some of them as
being ungrammatical.
It is easy to compose a new sentence and have speakers agree that it is perfectly good Englis h. For
example, sentences have an interesting property that they can be embedded inside larger
sentences. Consider the following sentences:
b. The Jamaica Observer reported that Usain Bolt broke the 100m record.
c. Andre said The Jamaica Observer reported that Usain Bolt broke the 100m record.
d. I think Andre said the Jamaica Observer reported that Usain Bolt broke the 100m record.
If we replaced whole sentences with the symbol S, we would see patterns like Andre said S and I
think S. These are templates for taking a sentence and constructing a bigger sentence. There are
other templates we can use, such as S but S and S when S. With a bit of ingenuity we can construct
some really long sentences using these templates.
This long sentence actually has a simple structure that begins S but S when S. We can see from this
example that language provides us with constructions which seem to allow us to extend sentences
indefinitely. It is also striking that we can understand sentences of arbitrary length that we’ve never
heard before: it’s not hard to concoct an entirely novel sentence, one that has probably never been
used before in the history of the language, yet all speakers of the language will understand it.
The purpose of a grammar is to give an explicit description of a language. But we think of a grammar
is closely intertwined with what we consider to be a language.
the formal framework of “generative grammar,” a “language” is considered to be nothing more than
an enormous collection of all grammatical sentences, and a grammar is a formal notation that can be
used for “generating” the members of this set. Grammars use recursive productions of the form S →
S and S.
Ubiquitous Ambiguity :
While hunting in Africa, I shot an elephant in my pajamas. How an elephant got into my pajamas I’ll
never know.
take a the ambiguity in the phrase: I shot an elephant in my pajamas. First we need to define a
simple grammar:
Natural Language processing
>>>groucho_grammer=nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
VP -> V NP | VP PP
V -> 'shot'
P -> 'in'
""")
>>>parser = nltk.ChartParser(groucho_grammer)
>>>trees=parser.parse(sent)
print(tree)
out put :
(S
(NP I)
(VP
(S
(NP I)
(VP
Natural Language processing
(V shot)
(NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))
tree.draw()
out put :
Notice that there’s no ambiguity concerning the meaning of any of the words; e.g., the word shot
doesn’t refer to the act of using a gun in the first sentence and using a camera in the second
sentence.
Beyond n-grams :
One benefit of studying grammar is that it provides a conceptual framework and vocabulary for
spelling out these intuitions. Let’s take a closer look at the sequence the worst part and clumsy
looking. This looks like a coordinate structure, where two phrases are joined by a coordinating
conjunction such as and, but, or or. Here’s an informal (and simplified) statement of how
coordination works syntactically.
Coordinate Structure: if v1 and v2 are both phrases of grammatical category X, then v1 and v2 is also
a phrase of category X.
Constituent structure is based on the observation that words combine with other words to form
units. The evidence that a sequence of words forms such a unit is given by substitutability —that is, a
sequence of words in a well-formed sentence can be replaced by a shorter sequence without
rendering the sentence ill-formed. To clarify this idea, consider the following sentence:
EX : The little bear saw the fine fat trout in the brook.
The fact that we can substitute He for The little bear indicates that the latter sequence is a unit. By
contrast, we cannot replace little bear saw in the same way. (We use an asterisk at the start of a
sentence to indicate that it is ungrammatical.)
we have added grammatical category labels to the words we saw in the earlier figure. The labels NP,
VP, and PP stand for noun phrase, verb phrase, and prepositional phrase, respectively.
If we now strip out the words apart from the topmost row, add an S node, Each node in this tree
(including the words) is called a constituent. The immediate constituents of S are NP and VP.
Natural Language processing
Context-Free Grammar :
a simple context-free grammar (CFG). By convention, the lefthand side of the first production is the
start-symbol of the grammar, typically S, and all well-formed trees must have this symbol as their
root label. In NLTK, context free grammars are defined in the nltk.grammar module. we define a
grammar and show how to parse a simple sentence admitted by the grammar.
grammar1 = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
""")
... print(tree)
Syntactic Categories
Developing a simple grammar of our own, using the recursive descent parser application,
nltk.app.rdparser().
If we parse the sentence The dog saw a man in the park using the grammar end up with two trees.
Natural Language processing
A parser processes input sentences according to the productions of a grammar, and builds
one or more constituent structures that conform to the grammar. A grammar is a
declarative specification of well-formedness—it is actually just a string, not a program. A
parser is a procedural interpretation of the grammar. It searches through the space of trees
licensed by a grammar to find one that has the required sentence along its fringe.
A parser permits a grammar to be evaluated against a collection of test sentences, help ing
linguists to discover mistakes in their grammatical analysis. A parser can serve as a model of
psycholinguistic processing, helping to explain the difficulties that humans have with
processing certain syntactic constructions. Many natural language applications involve
parsing at some point.
The recursive descent parser builds a parse tree during the above process. With
the initial goal (find an S), the S root node is created. As the above process
recursively expands its goals using the productions of the grammar, the parse tree
is extended downwards (hence the name recursive descent). We can see this in
action using the graphical demonstration nltk.app.rdparser(). Six stages of the
execution of this parser are
During this process, the parser is often forced to choose between several possible
productions. For example, in going from step 3 to step 4, it tries to find
productions with N on the left-hand side. The first of these is N → man. When this
does not work it backtracks, and tries other N productions in order, until it gets
to N → dog, which matches the next word in the input sentence. Much later, as
shown in step 5, it finds a complete parse. This is a tree that covers the entire
sentence, without any dangling edges. Once a parse has been found, we can get
the parser to look for additional parses. Again it will backtrack and explore other
choices of production in case any of them result in a parse.
... print(tree)
Shift-Reduce Parsing :
A simple kind of bottom-up parser is the shift-reduce parser. In common with all bottom-up parsers,
a shift-reduce parser tries to find sequences of words and phrases that correspond to the righthand
side of a grammar production, and replace them with the lefthand side, until the whole sentence is
reduced to an S.
The shift-reduce parser repeatedly pushes the next input word onto a stack this is the shift
operation. If the top n items on the stack match the n items on the righthand side of some
production, then they are all popped off the stack, and the item on the lefthand side of the
production is pushed onto the stack. This replace ment of the top n items with a single item is the
reduce operation. The operation may be applied only to the top of the stack; reducing items lower in
the stack must be done before later items are pushed onto the stack. The parser finishes when all
the input is consumed and there is only one item remaining on the stack, a parse tree with an S node
as its root.
The advantages of shift-reduce parsers over recursive descent parsers is that they only build
structure that corresponds to the words in the input.
One of the problems with the recursive descent parser is that it goes into an infinite loop
when it encounters a left-recursive production. This is because it applies the grammar
productions blindly, without considering the actual input sentence. A left-corner parser is a
hybrid between the bottom-up and top-down approaches.
Grammar grammar1 allows us to produce the following parse of John saw Mary:
Each time a production is considered by the parser, it checks that the next input word is compatible
with at least one of the pre-terminal categories in the left-corner table.