Application of Search & Sorting Techniques - in Natural Language Processing
Application of Search & Sorting Techniques - in Natural Language Processing
Application of Search & Sorting Techniques - in Natural Language Processing
18 www.erpublication.org
Application of Search & Sorting Techniques – in Natural Language Processing
algorithms and is an area that could be modified in the future and will populate it on the screen to help the user. And the
to further increase the efficiency of the parser by adding the same has been represented in the below data structure with an
grammar associated with any natural language for processing. example as well. Below is the Hash map function structure,
data structure that interprets the word complexity and
predictive word data structures.
A. Design Approach
To apply the ranked weightage ordering, the list was
maintained in a sorted manner by their probability scores and
the pointer simply moved along the list, as more words were
used to combine with other words. The state being pointed to
by the pointer, which was the state being used to combine with
other words of higher scores, was called the pivot state. By
combining the pivot state with states of higher scores, the
algorithm guaranteed that resulting state of the combination
would be equal or lower scored than the pivot state. This
allowed for a simple algorithm for maintaining the ordered
list. In the initial phase, we determine the ‘Chandassu
(grammar)’ for a given input (i.e. given Telugu poem). We
will parse the inbound data feed, calculate the complexity of Word Complexity Data Structure
the word and store it in the Lexicon. A letter code is assigned Complexity Simple Medium Complex
based on complexity - simple with ‘S’, medium with ‘M’, and Word – U I I S M C
complex with ‘C’. And the syntactical analysis will be
conducted simultaneously for the same feed based on the
Predictive Word Data Structure
predefined set of business rules, whereas the syntax for the
parsed string can be generated from the rules database. The Word Chandassu
rules database has been designed by using the set of rules evvani UII
based out of ‘Telugu’ grammar. The notation is similar to the
one used to generate the grammar (i.e. chandassu) for a given
poem in Telugu literature and the underlying data structure Unicode
example model is defined below. ఎవ్వని
B. Hashing & Predictive Analytic Techniques
U C S M Complexity
Upon analyzing and implementing the above mentioned
design approach & following algorithms for parsing and 1 0 0 0 M
improving search time, we determine the predictability of
words to construct a poem with an appropriate Chandassu. Pre-Optimization
Predictability of the words is achieved with the help of Once the rules database is formed, we need to update the
predictive techniques that encompass a variety of statistical generated syntax into the words database formed using the
techniques to analyze and to pick up the appropriate word to lexicon. The next step would be to generate the Unicode for
construct a poem. By using the Sequential & Binary Search the grammar. Here, in this step if an error occurs in the syntax,
technique, we search the words with appropriate chandassu then the error will be sent to the error log. The ranking
from the lexicon and will be made available for the user to weightage algorithm is essentially embodied by the following
select the suitable word. Following is the sequential search pseudo-code:
structure that was embedded in the application design. 1. Populate the list with grammar for every letter in
the word.
2. Sort the list by their probability scores.
3. Set pointer at the first word in the list.
4. While the list contains un-combined words
5. Set pivot as the next most probable word.
6. Return if pivot state is a terminal state.
7. Combine pivot with all adjacent words with
Here, we have used a hash map (i.e. a data structure) that uses higher probability.
a hash function to map identifying values, known as keys, 8. Insertion sort all newly created words or states in
(e.g., a poem pattern) to their associated values (e.g., their to the list.
respective VRUTTA). The hash function is used to transform 9. Return failure
the key into the index (the hash) of an array element
(the slot or bucket) where the corresponding value is to be With the application of this ordering, the algorithm allowed
sought. Here, we will derive the Chandassu of each word and for early termination of the search, since the newly created
will store the same in the Lexicon by using a predefined data words (being of equal or lesser probability) must be inserted
structure i.e. data structure will contain the word, Chandassu, below the pivot state due to the cascading effect of the product
Unicode, VRUTTA’s and word complexity. As and when a of the probability. Any terminal state found later would have a
given input matches to any of the predefined VRUTTAS i.e. lower probability than the first one that was found, so the
‘U’, ‘C’, ‘S’ & ‘M’, we’ll retrieve the same from the Lexicon algorithm guarantees the retrieval of the most probable state
19 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869 (O) 2454-4698 (P), Volume-7, Issue-2, February 2017
without having to exhaustively search all possible stage. However, because the algorithm loses track of the
combinations. ranking of the words, the algorithm must exhaustively
Post Optimization combine all states to determine the most probable parse.
By only using a single list to maintain all possible derivation On Pre & Post optimization comparison, it is fairly easy to
of the words, traversals and maintenance of the ordering of the see the improvements of the developed algorithms, but for
list used up a lot of valuable time. To counteract this, we the task of NLP, it is probably more important to look at a per
re-introduce a charting behavior as the second improvement sentence comparison, especially if it is in an environment
to the algorithm. We implemented a table, called the indexed where human interaction is required. Figure 1 indicates the
table, in which all the words or lines of the poem that were in relationship between the parsing time and the number of
the used section were placed, rather than keeping them in the words in the sentence for the exhaustive, optimal and the
same list. The table also grouped together the words that suboptimal search algorithms. There is a huge reduction in
occupied the same starting and ending positions, to simplify the parse time from the algorithm with the optimal algorithm,
the decision process in determining which words were and an even greater reduction from the suboptimal algorithm
adjacent to the pivot state. The ranked weightage list was
replaced by a table, which we called the sorted table that
handled the push and pop manipulations to simplify and to
modularize the algorithm for future use. The third major step
involved the use of a critical score, which is the score of the
currently most probable terminal state in the sorted table. By
not operating on states that are going to produce a lower
probability than the critical score, it allowed for a large
pruning of the search tree, weeding out states with very low
probability that would not contribute to the most probable
terminal state. The algorithm also provides a pre-processing
stage before a combination between states took place, which
contributed to a little overhead, but managed to cut down the Figure 1: Number of words in the sentence versus parsing
amount of unnecessary combinations and avoided the lengthy time on the post optimization for the exhaustive, optimal and
combination stage of two words. The experimental tree-climb the suboptimal algorithm.
algorithm used here shows an impressive parse time and huge
reduction in the search space & time, but has slight Context Based Search
inaccuracies parse compared to the other algorithms, which The goal of a context-based search implementation (or
can be seen in Table 1 disambiguation) process is to find the most relevant search
result(s), T, given a main source query term, S, with the help
Exhaustive Optimal Sub-Optimal of L/R contexts. Intuitively, S and T tend to be a relevant
Pre-Implementation % query-answer pair if many contexts are “matched”. The target
Parse Time 100 15.2 1.7 object, T*, with highest matching score (or probability) will
Search Time 100 4.9 0.3
be the most possible target that S is referring to in the contexts
of <Ls, Rs>.
Most probable 100 100 84
Post
Implementation % The degree of matching can be measured in terms of different
“matching strength” or “matching score” contributed by the
Parse Time 100 10.4 0.7
contexts. Normally, exact string match between two terms in
Search Time 100 2.1 0.1 S/T or L/R contexts, such as “the Big Apple” vs. “the Big
Most probable 100 100 66.7 Apple”, has the strongest match. But it is least robust since
Table 1: Statistics of parsing of the optimal and suboptimal S/T/L/R might be described in terms of other synonymous
algorithms for both Pre and Post Optimization. form. Partial or fuzzy match, like “Big Apple” vs. “the Big
Apple”, provides some flexibility for matching. But it may
The parse time and the search space are represented as the also introduce noise such as matching “the Big Apple” against
proportionality compared to the exhaustive algorithm and the “Big Apple Pie”. The most robust and flexible way for
percentage that the algorithm retrieved the most probable matching S/T and L/R contexts might be to assign a higher
parse is indicated in the last row. The optimal algorithm is the matching score to a term pair if they are known to be
combined algorithm of all the algorithms that provided synonyms or highly related terms in the ontology. The idea
benefits to the parsing speed without the loss of accuracy and behind the current work is an extension of the thoughts
the suboptimal algorithm is the tree-climb algorithm, which explained in the above research work areas. In the work
provided the fastest and also a reasonably accurate result from presented here, instead of using just a dictionary based search,
all tested suboptimal algorithms. The optimal search the search is conducted using the Meta data, a comprehensive
algorithm returns the most probable parse tree, but sometimes rule base, which will generate the result.
varies in the tagging and bracketing of the parse due to the
cases when multiple parses have the same probability. The SYSTEM OUTPUTS
tree-climb algorithm’s performance in the accuracy domain is The following snapshots interpret the various steps of the
relatively poor, but some of the loss in the accuracy can be process involved in this application.
recovered by altering the amount of states used in the seeding
20 www.erpublication.org
Application of Search & Sorting Techniques – in Natural Language Processing
REFERENCES
[1] Krishna, N. S., & Murthy, H. A. (2004, October). A new prosodic
phrasing model for Indian language Telugu. In INTERSPEECH.
Fig 3: Poetic Pattern [2] Madhavi Ganapathiraju and Lori Levin, 2006, TelMore:
Morphological Generator for Telugu Nouns and Verbs. Proc.
Second International Conference on Universal Digital Library, Vol
Alexandria, Egypt, Nov 17-19, 2006.
[3] Takeshi Matsumoto, David M.W.Powers & Geoff Jarad -
Application of Search algorithms to NLP
[4] S. Rajendran, Ph.D. August 2006, PARSING IN TAMIL
–PRESENT STATE OF ART –LANGUAGE IN
Fig 4: Poetic Analysis – Grammar Display INDIA, Strength for Today and Bright Hope for
Tomorrow, Volume 6 : 8
[5] ‘Andhra Mahabhagavatamu’ – Bammera Pothana, Potti Sreeramulu
In the age of smart devices, the need for systems to be Telugu University Press.
accessible everywhere as well as at all the times is considered
and the application is delivered to meet these high-availability AUTHOR PROFILE
and high-accessibility needs.
21 www.erpublication.org