Information Extraction From Hindi Texts: Kamlesh Dutta, Saroj Kaushik, Nupur Prakash

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Information Extraction from Hindi Texts

Kamlesh Dutta*, Saroj Kaushik**, Nupur Prakash***


*National Institute of Technology, Hamirpur (HP) INDIA 177005
[email protected]
** Indian Institute of Technology, New Delhi INDIA
[email protected]
*** IndiraGandhi Institute of Technology, New Delhi, INDIA
[email protected]

Abstract

The paper presents an information extraction system that takes input from Hindi texts and improves the information content retrieved
by using anaphor/pronoun resolution mechanism. The information extraction system developed consists of three major modules: The
language Parser, Resolution System and Information Extractor. The language parser used is HPSG (Head-Driven Phrase Structure
Grammar) based that provides both syntactic and semantic information to the anaphor resolution system. HPSG was chosen because it
provides a set of constraint on the co-referential structures in the language, which bounds the search for an antecedent to a more
precise location in the discourse. The semantic information included in its parsing may be helpful for removing ambiguity in
anaphor/pronoun resolution. The anaphor resolution system uses few heuristic rules to resolve intrasentential references while
centering theory is used for intersentential resolution
bounded according to the phrase of occurrence of the
Introduction anaphors.
Information Extraction (IE) has been used to develop
specific software’s for automatic summary generation,
email processing to answer database queries in Natural Anaphora/Pronoun Resolution
Language [4] and simple question answer systems. In the Using the semantic and syntactic information from the
Indian context, for the successful deployment of Case Frame Structure, the Anaphor resolution system tries
Information technology, there is a need to develop tools to link the pronouns and anaphors to their referents. Each
for Indian language. It can prove a major pitfall in the IT noun/pronoun object in the case frame structure has an
vision of our country if we let ourselves lag behind in a INDEX field and a REFERENT_INDEX field. In the
promising field of IE. This is more so because of very low input the NPs/anaphors are assigned a unique index
level of English literacy in our country. For Hindi number by the parser. During the passes of resolution
Language (Dave and Bhattacharya 2001) have used system, the REFERENT_INDEX field of referring objects
Universal Network Language for knowledge extraction is set to the index of referent. Some anaphors/pronouns
from Hindi text which preserves the predicate till the end. may remain unresolved, which are assigned a
The result of the analysis is the semantic net like structure. REFERENT_INDEX value of zero.

The Information Extraction system developed in the The anaphor resolution system consists of two major jobs:
present work for Hindi has following main modules: Anaphor Resolution: According to the Binding theory for
• Language Parser HPSG grammar an anaphor must be bound in its
• Anaphor/Pronoun Resolution System Governing Category (Grosz et al 1995). Heuristic rules
• Information extractor have been developed to identify the referent object.
Simple Heuristics like Gender and Number may be used
to resolve and verify the link if there are more than one
Hindi Language Parser prospective object (Pirkola and Jarvelin1994) and also
The Language Parser reads the input Hindi text provided shown by ( Sobha, and Patnaik 2000) for anaphor
and checks it for error. The parser is based on Head resolution in Hindi Language. Centering theory provides
Phrase Structured Grammar (Pollard and Sag, 1994) . The the list of most probable NPs which may be the referents
output of HPSG parser will contain parts of speech of a pronoun in a given sentence. At any occurrence of a
tagging, the number, gender and case specification and pronoun, the probable list is considered in decreasing
semantic information for phrases. This information is order of importance and various heuristics are applied to
arranged as a Case Frame for every sentence, which resolve the pronoun to one of the NP in probable list
contains the syntactic and semantic information of the
sentence. Information Extractor
The output of the Anaphora Resolution System is a
HPSG structure used for representation is ideally suited Mapping Table that links the anaphors/Pronouns to their
for anaphor resolution because of its constraint-based referents. This module then attempts to infer the meaning
nature and since it divides the sentence into a hierarchical of the sentences. It finds the logical relations that hold
arrangement of Head and phrase as shown by (Pollard and between the object, the events that occur and the object
Sag, 1994),( Pollard 1996). This hierarchy is useful to taking part in the events. The information can be
resolve anaphors and pronouns whose antecedents are represented as prolog predicates, which can then be used

1911
for various purposes like Machine Translation and After all individual words have been parsed, identify the
Reasoning. arguments for the head of various constructs according to
e.g. consider the following discourse the constraints and index them accordingly. The head of
Shyam is a student. He goes to college. Name of phrase will automatically have the words as arguments
his college is NIT Hamirpur. while the head of sentence will have various phrases as its
This set of sentences would yield us following arguments thus completing the hierarchical structure.
information:
Student(Shyam). Hindi has large number of pronouns. They cannot be
Goes(Shyam, College). clearly identified just on the basis of the word. Same word
College( NIT Hamirpur). can be used as a pronoun in one place while it can be a
Belongs_to(NIT Hamirpur, Shyam). demonstrative adjective at other place. This anomaly is
assumed to have been sorted by the parser before
HPSG is a constraint-based grammar. It mainly defines anaphora / pronoun resolution.
the syntactic and semantic rules to be followed by any Eg. Vah pustak meri hai. ( vah is demonstrative
grammatical construct. HPSG structures may be of adjective)
following types Vah mej par rakkhi hai. ( vah is pronoun)
{ sign, word, phrase, category, Head (=part of speech), While certain pronouns are pure anaphors according to
list, set, content, case, index, verb_form etc}. Binding theory ie they have their antecedent within their
domain, others may be bounded outside their own domain.
Consider the verb ‘rakkhi’ in the following sentence Eg. words like ‘Apna’ and ‘swayam’ are anaphors while
kitaab mej par rakkhi hai. ‘vah’,’jisne’,’usne’ etc are pronouns.
Ram ney use bulaya. ( pronoun use used for
The HPSG structure for this verb is as shown in fig. 1: some second person)
Ram ney apne bhai ko bulaya. (anaphor apne
HEAD Verb
Non_Aux used for Ram)
CATEGORY Hindi sentences can be represented according to HPSG
requirements. We have chosen to represent the
VALANCE SUBJ :<NP[nom, inanimate]::[1]> information of the sentences in a case frame structure
COMP :<NP[locative,place] ::[2] ,
PP[locative] ::[3]> which conveys sufficient information for anaphor
resolution and information extraction systems and is also
Rakkhi simpler. It has flatter organisation of participating VP, NP,
OBJECT [1] adjectives, adverbs and phrases/subsentences as compared
CONTENT LOCATION [2]
to deep-rooted hierarchical organisation of HPSG.
The case frame structure of a sentence has following
fields:
TOKEN: The text of the sentence for purpose of
Fig1: HPSG Representation of Hindi Verb reproducing the sentence if required.
ID : A unique identifier of a sentence assigned by
From this example we see that in order to develop HPSG the parser. Each sentence and subsentence related to them
specification for Hindi we must have following have a unique ID.
information: VERB:
1. A lexicon containing the atomic objects and their The verb phrase further contains fields to specify
properties. i.e. we must store in our knowledge base the type and property of the main verb in the sentence. In
the proper and common nouns, different type of some cases where there are auxiliary verbs also present ,
pronouns, prepositions, conjunctions. The basic they are clubbed with the main verb because the sentence
characteristics in various contexts should also be can always be rearranged in such a manner as to convey
stored eg. Whether a noun is of type animate, the same meaning, preserve the syntactic structure and
inanimate, place, time, event etc.. have a single combined verb.
2. All verbs must be stored in the lexicon with e.g. Ram khana kha_kar mandir gaya.
information about the number and type of arguments Ram mandir kahana kha_kar gaya.
required by them. The verb phrase further has following fields to provide
3. All semantic actions and events must be listed with more information:
the objects (and their types), which take part in the • TOKEN : Specifies the exact word as its
event with various roles. occurance in the sentence
Parsing would then consist of • ROOT: tells the basic form of the verb.
1. Reading the words in sentences and identify the root • TYPE of verb i.e. its transitivity etc.
word. • TENSE describes the tense of the event/action
2. According to the form of root word used in the in the sentence.
sentence, construct the HPSG word structure for this • List of adverbs : it is a list of all adverbs of the
instance with the help of the structure template stored main verb
in knowledgebase. These are the features currently incorporated because they
are of help in the anaphor resolution system. But it is very

1912
flexible and new information fields can be included • SEQUENCE : Though Hindi is largely a word order
without affecting the present logic for anaphor resolution. free language but sometimes the placement of
In the future implementation the verb phrase may contain pronouns and noun is significant for their relation. Eg.
more semantic knowledge as specified by HPSG. if a NP occurs in Genetive case then it must be
We can have a knowledge base storing the number and followed by the NP it is linked to.
type ( animate, inanimate, place, event etc.) of noun Eg.
phrases which are needed as argument to every verb. A 1 2 3 4 5
di-transitive verb for example will have two noun phrases. Ram ney Shyam ko [uski NP] [purani pustak NP ] di.
The parser must recognize the verb in the sentence, find
its transitivity and the corresponding noun phrases. This Sequence value shows the position of occurrence of NP
information must then be entered into the case frame and VPs in the sentence. Only NP, pronouns, and main
structure for further analysis. verb have a sequence value. The sequence value of the
Such information will specially be required by the connector ( conjunction, and few words like ‘ki’, ‘kyonki’,
Information extraction system which must know the ‘jabki’ etc.) specifies the position of the beginning of a
transitivity and the arguments of a verb. subsentence.
A typical verb phrase representation in case frame is as • INDEX : Index value of a NP is a unique
follow: identifier assigned by the parser. It is used to refer to
VERB: [TYPE:[di-transitive], TOKEN:[bulaya], the NP in the argument list of verb. The pronouns
TENSE:[past] SEQUENCE:[3], ] and anaphors are resolved by mapping their INDEX
values with that of their antecedent.
SUBJECT: Further improvements in this representation of NP are
Subject is a noun phrase, which is the cause, or possible to include more semantic and world knowledge.
the initiator of the event described in the sentence. It is the For example the type of object ie whether it is animate,
main focus Entity of the sentence. Every sentence must inanimate, event, property, place etc. Also the same NP
have a subject. If there is no explicit subject that can be may have different meaning in different sentences
recognized by the parser then the sentence has a Zero depending on the context. eg. ‘kal’ refers to ‘Tomorrow’
anaphor. The Parser must in that case insert a dummy NP or ‘Yesterday’ when referring to time while it also means
as subject. ‘Machine Parts’ when used in context of machinery. So a
e.g. Ram ney kitab uthai aur [ _dummy_ ] knowledgebase must store various meanings of the NP in
school chala gaya. different contexts. The parser can identify the context and
A NP ‘sunderta’ in the following sentence is represented meaning by analyzing the Verb and other constraint as
in Case frame as follows: specified in HPSG structure of the VP in sentence.
Tajmahal ki sunderta adbhut hai.
SUBJECT: [ TYPE:[abstract], TOKEN:[sunderta], Subsentences:
ROOT:[sunder],NUMBER:[singular], GENDER:[female], Subsentences in compound or mixed sentences are
CASE:[nominative], EXTENSION:[], SEQUENCE:[5], represented under the CONNECTOR construct. A
INDEX : [3] ADJ:[ TYPE:[qualitative], TOKEN:[adbhut], subsentence is modeled exactly like a simple sentence
NUMBER:[singular], GENDER:[male] ] ] except that it is embedded at a lower hierarchy than its
main sentence. The subsentence has its own ID different
Every noun phrase representation including that of the to the parent sentence’s identification number. There
subject contains following information : may be more than one subsentences in a sentence which
• TOKEN : the exact occurance of the word in the will be represented as sequence of CONNECTOR
sentence. constructs.
• TYPE : the type of noun ( proper, collective, An example of a Compound Sentence representation using
abstract..) if it is a noun or its value is set to indicate case frame structure is as follow:
that it is a pronoun.
• ROOT : the basic form of the noun or pronoun [TOKEN:[Tajmahal ek sunder bhavan hai,] ID:[1] VERB:
without any number, gender, or case [ TYPE:[?], TOKEN:[hai], TENSE:[present]
induced change. SEQUENCE:[3], ], SUBJECT: [ TYPE:[proper],
• NUMBER: it can have value ‘singular’ or ‘plural’ as TOKEN:[Tajmahal],ROOT:[Tajmahal],
identified by the parser. NUMBER:[singular], GENDER:[male],
CASE:[nominative], EXTENSION:[], SEQUENCE:[1],
• GENDER: It can have value ‘male’ or ‘female’. In
INDEX[1]],OBJECT:[TYPE:[common],TOKEN:
case of pronouns or nouns whose gender cannot be
[bhavan],ROOT:[bhavan],NUMBER:[singular],
inferred by the word or the verb of sentence, it is set to
GENDER: [?], CASE:[objective] EXTENSION:[],
null.
SEQUENCE:[2], INDEX : [2], ADJ:[ TYPE:[number],
• CASE : Defines the relation of the noun phrase with
TOKEN:[ek], NUMBER:[singular], GENDER:[male]],
the main verb of the sentence. It can have one of eight
ADJ: [ TYPE:[qualitative], TOKEN:[sunder],
values given in Table 8.
NUMBER:[singular], GENDER:[male]]] CONNECTOR:
• EXTENSION : Extension of a noun includes those [ TOKEN:[ jiski sunderta adbhut hai.] ID:[2], VERB:
words which are used only for further description of [TYPE:[?], TOKEN:[hai], TENSE:[present]
the noun and are not covered by any other part of SEQUENCE:[6],],SUBJECT:[TYPE:[abstract],TOKEN:
speech.

1913
[sunderta], ROOT:[sunder], NUMBER:[singular], Evaluation
GENDER:[female], CASE:[nominative], The anaphor approach used is tested over 10 short stories
EXTENSION:[], SEQUENCE:[5], and following accuracy was observed:
INDEX:[3]ADJ:[TYPE:[qualitative],
TOKEN:[adbhut], NUMBER:[singular], Correct resolution: 63%
GENDER:[male] ] ], OBJECT: TYPE:[pronoun], Correct third person pronoun resolution: 69.2%
TOKEN:[jiski], ROOT:[jo], NUMBER:[singular], Correct Definite pronoun resolution: 0%
GENDER: [male],CASE:[genative], EXTENSION:[], Correct Zero Pronoun resolution: 100%
SEQUENCE:[4], INDEX : [4] ] ] ] Correct Inter-sentential Pronoun Resolution: 54.5%
Correct Intra sentential Pronoun Resolution: 87.5%
Representation of Sentences
The results suggested that the use of pronoun resolution
The case frame structure used in our representation of
improves the information content to be extracted which
Hindi sentences can be directly modeled into
otherwise be ignored. However the algorithm has to be
corresponding C language structures. The structures used
tested on different categories of texts also.
to represent Noun Phrases, Verb Phrases, Adjectives and
In Hindi a verb may have up to eight different types of
Objects related to it. i.e. the transitivity of the verb can be
Conclusion
up to eight. At least one of the objects must be Subject. The Information extraction system for Hindi texts
Eg. developed here uses heuristic approach to resolve the
Ram ney Ravan ko marney ke liye rath sey utarkar teer anaphors and pronoun. The rules used are applicable for
sey Ravan ke sir par mara. most occurrences of pronouns in natural Hindi text. This
In the case frame structure the Subject is represented as a is especially useful in descriptive texts, which have fewer
separate NP and all other argument object of the verb are occurrences of first and second person pronouns, which
listed as sequence of NPs. This arrangement suggests the are not covered by the heuristics suggested. HPSG will
use of an Array of NP objects along with an integer to add more semantic information and semantic constraints
specify the actual number of objects. into the representation making the resolution more
So a Sentence Object (structure) will contain a VP object, accurate.
a NP object as subject, an Array of NPs for other Objects, The next step after Information extraction in Hindi texts is
and an integer to specify the actual number of Objects to extend it for web related text. A major problem in this
present. regard is the absence of any standard encoding for Hindi
The sub-sentences or phrases are implemented as a link to alphabets. Various websites use proprietary font families
another Sentence Object. So each Sentence Object also to display same text. ( eg Amarujala.com uses ‘au’ font
contains a pointer to its sub-sentence object. family while dainikjagran.com uses ‘jagran’ family of
Such a data structure mirrors the Case frame structure and font).
preserves the hierarchical arrangement of adjectives with
their NP, Object and Subject with their VP and References
subsentences with their parent sentence. Dave Shachi and Bhatacharyya P.(2001) Knowledge
Extraction from Hindi. Journal of Institution of
Electronic and Telecommunication Engineers, vol. 18,
Discourse Specification: no. 4, July, 2001.
A discourse consists of number of related sentences. Grosz, B.J, Joshi, A.K., and Weinstein, S. (1995).
Centering: A framework for modelling the local
According to Centering theory(Grosz et al 1995), a coherence of discourse. Computational Linguistics, 21;
discourse, utterances are connected to (related to) each (pp. 203-225)
other semantically by the center concept (or centers). In Pirkola, A. and Jarvelin,K. (1994) The Effect of Anaphora
our implementation of discourse, the sentences are and Ellipsis Resolution on Proximity Searching in a
represented by a list of sentences. They are implemented Text Databases. University of Tampere, Department of
by an array of pointers to Sentence Objects in sequence of Information Studies, RN-1994-1, 25 p.
their occurrence. The centering information is stored in a Pollard, C.(1996). HPSG: An Overview and some work in
Data structure that contains a list of recent objects (NPs) Progress. Pacific Asia conference on Languages,
and the weights assigned to them. The weights represent Information and Computation. Kyung Hee University,
the relative importance of the NPs in the discourse. The South Korea
NP is added into the list as and when they occur in the Pollard,C and Sag,I.E.(1994) Head-Driven Phrase
discourse. If the NP already exists then the weight of that Structure Grammar. University of Chicago Press and
NP in the sentence is added to the existing weight of NP Stanford: CSLI Publications
in the List. After processing every sentence, the existing Prasad R. and Strube M.(2000) Discourse Salience and
weight of the objects is decreased by a factor. We have Pronoun Resolution in Hindi. In Penn working Papers
chosen the factor to be 40%. The reason for this is not in Linguistics, Vol 6.3(pp. 189-208)
theoretical rather practical because it has worked well Sobha, L. and Patnaik, B.N.(2000) Vashisht: An anaphora
with most commonly occurring discourse in Hindi (Prasad resolution System for Malayalam and Hindi. In
International Conference ACIDCA’2000, Monastir,
and Strube 2000). A NP is removed from the list if its
Tunisia
weight falls below a specified minimum value.

1914

You might also like