Natural Language Processing
Natural Language Processing
1. Introduction............................................................................................................................1
2. Approaches used for this work..............................................................................................1
3. Methodology............................................................................................................................2
4. Implementation.......................................................................................................................3
5. Challenges...............................................................................................................................4
6. References...............................................................................................................................5
1. Introduction
Parsing, one of the steps to design a functional NLP application and which can work in
cooperation and as input to other many NLP application like grammar and spell checker, spell
correction, and etc. In parsing the central point involves in manipulation, understanding, and
parsing (breaking down to manageable components), understand their context, relation with each
other to successfully identify their correctness. Sentences are the starting point when we come to
analyzing a written material or documents [1]. Syntax refers to the way words are related to each
other in a sentence. Then we can say that sentence parsing, which is also called syntactic parsing,
is the process of identifying how words can be put together to form correct sentence and
determining what structural role (lexical category) each word plays in the sentence and what
phrases are subparts of what other phrases or what other words modify which words of the
central point of the whole sentence constructed. A sentence parser outputs a parse structure that
could be used as a component in many applications including semantic analysis, machine
translation, information storage and retrieval of textual data etc., [2]. Today, parsers of different
kinds (e.g. probabilistic, rule based) have been developed for languages, which have relatively
wider use nationally and/or internationally (e.g., English, German, Chinese, etc. [3] My project
work is focused on the implementation of Amharic sentence that displays the parse tree for the
sentence. To do sentence parsing there are different methods, some of them are Context free
Grammar (CFG) from rule-based approach and Probability Context Free Grammar (PCFG) from
statistical approach. Hence my work is done using these two approaches, i.e., CFG and PCFG
[4].
PCFG is a context free grammar that associates a probability with each of its productions. It
generates the same set of parses for a text that the corresponding context free grammar does, and
assigns a probability to each parse. The probability of a parse generated by a PCFG is simply the
product of the probabilities of the productions used to generate it [1]. They produce a model of a
language based on real data, and therefore do not have to worry about things like grammatical
mistakes, which occur in real-life situations. Although PCFGs have many advantages, a critical
disadvantage is that context is not taken into account at all. In fact, a tri-gram (sequence of three
words in this case) model of a language would probably achieve better results, even though it
takes no account of internal structures in the language, more applicable to language like Amharic
[3].
3. Methodology
The methodology I used to develop the implementation of Amharic Parse tree is, takes a set of
sample grammars 4 from simple to complex grammar production rules, and assigned those
probabilities for probabilistic approach parsing and draws their parse tree and specifies their
parsing structure based on the grammar.
To develop the implementation, talking source code wise: I have used a collection tools working
and supporting the main application for different purposes [2]. Below I have listed out the
names.
Python 3.7
NLTK 3.2 Python Based Natural Language Processing Toolkit. (www.nltk.org)
KeyMan Keyboard for Unicode Keyboard Writer (Amharic)
PyScripter 3.7 for an interactive IDE for python.
In order to Setup my implementation, on a local environment, first python 3.7 must be installed
and then download NLTK 3.2 and install it under the python directory, because this used as
library inside a python code. Then you need to download NLTK data using python itself.
4. Implementation
The first sample implementation of my work is the CFG approach for Amharic sentence parsing
tree. The source code and the output of the implementation is as follows: An example of a CFG
is given below. For a Sentence Like "አበበ የ ሰዉ አጥር ላይ ሆኖ አየ" can be represented using the
following grammar.
S -> NP VP
VP -> V NP | V NP PP | NP V
PP -> P NP | P P
V -> "አየ" | "በላ" | "ተራመዳ"
NP -> "አበበ" | "ከበደ" | "ጫላ" | Det N| Det N N | Det N PP | N N | Det N N PP
Det -> "የ" | "ለ"
N -> "ሰዉ" | "ውሻ" |"አጥር"| "ድመት" | "መናፈሻ"
P -> "በ" | "ላይ" | "በኩል"|"ሆኖ"| "ከ"
The Syntax Parse Structure for the above example and its Parse Tree Using the developed
application looks like the following respectively: (S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር)
(PP (P ላይ) (P ሆኖ))) (V አየ)))
Output is:
And the second implementation of my work is PCFG approach for Amharic sentence parsing
tree. The source code and the output of the implementation is as follows:
Example of PCFG grammar is shown below and, the approach is explained in a topic below the
figure.
S -> NP VP [1.0]
VP -> V NP [0.2] VP -> V NP PP [0.3] VP -> NP V [0.1] VP -> NP Adj V [0.4]
PP -> P NP [0.2] PP -> P P [0.8]
V -> "አየ" [0.8] V -> "በላ" [0.1] V -> "ተራመደ" [0.1]
NP -> "አበበ" [0.2] NP -> "ከበደ" [0.1] NP ->"ጫላ" [0.1] NP -> Det N [0.1] NP -> Det N N [0.1]
NP -> Det N PP [0.1] NP -> N N [0.1] NP -> Det N N PP [0.2]
Det -> "የ" [0.9] Det -> "ለ" [0.1] N -> "ሰዉ [0.4]
N -> "ውሻ" [0.1] N -> "አጥር" [0.2] N -> "ድመት" [0.1] N -> "መናፈሻ" [0.1]
P -> "በ" [0.1] P ->"ላይ" [0.4] P -> "በኩል" [0.1] P ->"ሆኖ" [0.3] P ->"ከ" [0.1]
Adj ->"ትንሽ" [1.0]
The Syntax Parsed Structural Output using Viteberi algorithm using the above grammar is shown
below, with a final summed up probabilistic value.
viterbi_parser = nltk.ViterbiParser(grammer)
sent = "አበበ የ ሰዉ አጥር ላይ ሆኖ ትንሽ አየ".split()
print (viterbi_parser.parse(sent))
(S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር) (PP (P ላይ) (P ሆኖ))) (Adj ትንሽ) (V አየ)))
(p=8.84736e-05)
5. Challenges
There are some challenges that occurred when doing the projects.
1. This study uses a very small sample prepared for the purpose of the work due to lack of
time and finding well organized corpus, machine editable dictionary, POS tagged words
and unable to find specially a POS tagger application for Amharic.
2. The prototype developed in the report/study parses is assumed to be supporting a 10 and
more composed -word Amharic sentences but, the to gain the real outcome of the
prototype developed, again due mainly to time constraint, lack of linguistic ability to
possibility determine grammar rules and probabilistic rules.
3. This report does not incorporate more advanced topic like ambiguity resolution, but showed
sample parsing using probabilistic approaches.
6. References
[1] A. Alemu, "Automatic Sentence Parsing For Amharic Text An Experiment Using
Probabilistic Context Free Grammars," A Thesis Submited In Partial Fulfilment Of The
Requirement For The Degree Of Master Of Scinece In Information Science, 2002.
[2] "Natural language processing toolkit" Accessed from http://www.nltk.org/.
[3] Daniel Jurafsky & James H. Martin, "Speech and Language Processing: An introduction
to natural language processing, Computational linguistics, and speech recognition", 2007.
[4] Abiyot Bayou, "Design and Development of Word Parser for Amharic Language",
Masters Thesis, Addis Ababa University. 2000.