Development of Algorithms and Computational Grammar For Urdu
Development of Algorithms and Computational Grammar For Urdu
Development of Algorithms and Computational Grammar For Urdu
COMPUTATIONALGRAMMARFORURDU
PAKISTANINSTITUTEOFENGINEERINGANDAPPLIEDSCIENCES
NILOREISLAMABAD45650PAKISTAN
March2007
DEVELOPMENTOFALGORITHMSAND
COMPUTATIONALGRAMMARFORURDU
SYEDMUHAMMADJAFARRIZVI
THESISSUBMITTEDTO
DEPARTMENTOFCOMPUTERANDINFORMATIONSCIENCES
INPARTIALFULFILLMENTOFREQUIREMENTSFORTHEDEGREEOF
DOCTOROFPHILOSOPHY
PAKISTANINSTITUTEOFENGINEERINGANDAPPLIEDSCIENCES
NILOREISLAMABAD45650PAKISTAN
March2007
aa
a
aa
a aa
a a a
a a
aa
a a a
( )
( i )
CERTIFICATE
Certified that the work contained in this thesis is carried out by
Mr. Syed Muhammad Jafar Rizvi under my supervision.
Submitted Through
( ii )
DEDICATION
This document is dedicated to Professor Dr. Atta-ur-Rahman, TI, SI, HI, NI,
Chairman, Higher Education Commission, Pakistan. The research and development
(R&D) activity in our country was suffering due to paucity of funds for the higher
education as well as due to lack of priority and interest. However, presently plenty of
funds have been made available by Dr. Atta-ur-Rahman through HEC. He took
numerous intitiatives including indigenous and foreign scholarship schemes, human
rescource development, expatriate and foreign faculty hiring, quality assurance, etc.
He sponsored a number of research proposals submitted by various Universities of
Pakistan. The contribution of Dr. Atta-ur-Rahman will always be remembered for
promoting research activities in Pakistan.
This study has been sponsored by HEC under the
indigenous Ph.D. Scholarship Scheme 2001.
( iii )
ACKNOWLEDGEMENTS
I thank my supervisor Dr. Mutawarra Hussain (DCS, PIEAS) for guiding me as a
Ph.D. student. Whenever, he was busy in office hours, he gave me time after office
hours. He helped me in gathering books, papers, software tools, etc. Being in the first
batch of the indigenous Ph.D. program in Pakistan, I had to face difficulties and to
solve difficulties he was always with me. At many times, I felt that I was not able to
show good performance and to focus in a particular direction, he always encouraged
me in all those difficult times.
I thank Dr. Miriam Butt (Professor, Universitt Konstanz, Germany), who is
helpful to me since the start of choosing the topic for Ph.D. work. She continuously
gave feedback to my work through email contact despite her busy schedules. She sent
me study material which otherwise were not available. Most of my work in this thesis
is based on her papers, books, handouts, email conversations and her lectures at
Lahore.
I thank Dr. Ron Kaplan (Xerox, USA), Dr. Tracy Holloway King (Xerox, USA)
and Dr. Miriam Butt for helping and providing me linguistic software: Xerox
Linguistic Environment (XLE), Xerox Finite State Tool (XFST) and Two level Rule
Compiler (TWOLC). These software tools proved useful during my research work.
I thank Dr. Mohammad Abid Khan (Chairman, Department of Computer
Science, University of Peshawar) for his guidance and study material. The study of
his Ph.D. thesis gave me insight of the topic. Discussions with him on machine
translation were useful in clarifying my point of view.
I thank Dr. Sarmad Hussain (Head CRULP, FAST NUCES, Lahore) for
providing me study material for readings and inviting me for the various study
sessions held at the CRULP, Lahore. His guidance helped me narrow down my
direction.
I thank Dr. Khalid Ibrahim and Dr. Saeed Ahmad Durrani, who took time from
their busy schedules for thoroughly reading my thesis, they gave valuable
suggestions. I thank Dr. Sikander Majid Mirza (DCS, PIEAS) for chatting about my
topic and for nice advises. I thank Dr. Abdul Jalil, Dr. Anila Usman, Dr. Muhammad
Arif as well as all other faculty/ staff personnel at DCIS and other departments of
PIEAS. I thank all my colleague Ph.D. scholars at PIEAS, who encouraged me during
Ph.D. studies. I thank my employer for the grant of study leave for the Ph.D. Studies.
I thank all my family members, relatives and friends for their well wishes.
( iv )
ABSTRACT
This work presents the linguistics-based grammar modeling of Urdu language
under the framework of Lexical Functional Grammar (LFG) and at places under
Head-driven Phrase Structure Grammar (HPSG). The grammar modeling has been
done by considering two interlinked parts: the morphology and the syntax.
Urdu has a rich verb morphology comprising 60 basic verb forms categorized
into infinitive, perfective, repetitive, subjunctive and imperative forms. The 60 forms
are not enough to represent all the features of Urdu verbs. Various verb features are
composed when verb auxiliaries and/or light verbs combine with these verb forms.
Linguistically, verb auxiliaries are needed to combine at the syntactic level. However,
this work shows that the grammar model is simplified and the complex agreement
requirements can be avoided if auxiliaries are lumped with verb forms at the lexical
level. The work proposes the analysis of perfective, progressive, repetitive and
inceptive aspects as well as the analysis of declarative, permissive, prohibitive,
imperative, capacitive, suggestive, compulsive, presumptive and subjunctive moods.
The structure of a passive is analyzed by assuming a default argument.
This work, based on difference in grammar modeling and conceptualization,
classifies Urdu case markers and post-positions into noun forms, core case markers,
functional case markers, possession markers and post-positions. Noun forms are
modeled morphologically using lexical transducers, possession markers require two
noun phrases, post-position appear as adjuncts, while core and functional case
markers appear in the argument structure of verbs.
To classify core and functional case markers the use of semantic features has
been proposed. The semantic features based classification particularly demonstrated
better taxonomy of different instrumental cases in Urdu. This classification of
instrumental case exposed the presence of indirect subjects for Urdu causative
verbs which further suggested that some causative verbs are tetravalent because the
argument structure of these verbs has four arguments.
The study of case-markers reveals that the agreement between a noun and a case
marker is difficult to handle. It is argued that the head of phrase should be a noun
because the resultant is a noun phrase, but features of the case marker also transfer to
the resultant phrase, therefore, a modification to head-feature rule is proposed. The
same argument also helped to reaffirm that Urdu case markers are different from Urdu
possession markers, which require a different rule needing two noun phrases as a
( v )
( vi )
TABLE OF CONTENTS
Dedication .................................................................................................................... iii
Acknowledgements.......................................................................................................iv
Abstract ..........................................................................................................................v
Table of Contents.........................................................................................................vii
List of Tables ................................................................................................................xi
List of Figures .............................................................................................................xiv
Symbols and Abbreviations .......................................................................................xvii
PART I: INTRODUCTION AND REVIEW
Chapter 1 Research Objectives ......................................................................................1
1.1 Objectives Statement ...........................................................................................1
1.2 Domain of Investigation ......................................................................................1
1.3 Organization of Thesis.........................................................................................2
Chapter 2 Introduction To Machine Translation ...........................................................6
2.1 Machine Translation (MT)...................................................................................6
2.2 Challenges for Machine Translation....................................................................6
2.2.1 Lexical Ambiguity ........................................................................................7
2.2.2 Syntactic or Structural Ambiguity ................................................................9
2.2.3 Combined Lexical and Syntactic Ambiguity..............................................12
2.2.4 Semantic Ambiguity ...................................................................................13
2.2.5 Reference or Anaphoric Ambiguity............................................................14
2.3 Historic Landmarks............................................................................................15
2.4 Machine Translation Architectures....................................................................16
2.4.1 Direct Words Transfer ................................................................................16
2.4.2 Syntactic Transfer .......................................................................................17
2.4.3 Semantic Transfer .......................................................................................18
2.4.4 Interlingua ...................................................................................................18
2.5 Machine Translation Phases ..............................................................................18
2.5.1 Analysis.......................................................................................................18
2.5.2 Generation...................................................................................................19
2.6 Machine Translation Paradigms ........................................................................19
2.6.1 Linguistics Based Approaches to Machine Translation..............................19
2.6.2 Non-Linguistics Approaches to Machine Translation ................................20
2.6.3 Artificial Intelligence Based Approaches to Machine Translation.............21
2.6.4 Hybrid Paradigms .......................................................................................22
2.6.5 Other Paradigms..........................................................................................22
2.7 MT Route Followed in this Thesis.....................................................................22
Chapter 3 Grammar Modeling .....................................................................................24
3.1 Lexical Functional Grammar (LFG)..................................................................28
3.1.1 Lexical Items and A-Structure ....................................................................29
3.1.2 C-Structure..................................................................................................30
( vii )
( ix )
( x )
LIST OF TABLES
Table 2.1: Some Lexically Ambiguous English Words.................................................7
Table 2.2: Some Lexically Ambiguous Urdu Words.....................................................7
Table 2.3: Some Polysemic English Words...................................................................8
Table 2.4: Some Polysemic Urdu Words.......................................................................8
Table 2.5: Lexical Ambiguity in Urdu due to Absence of Diacritical Marks in Written
Urdu ...................................................................................................................8
Table 2.6: Lexical Ambiguity in Urdu due to the same Middle Shape of two different
Vowels ...............................................................................................................9
Table 3.1: Free SOV Phrase Order in Urdu ..............................................................41
Table 4.1: Some Intransitive Verbs in Urdu ................................................................54
Table 4.2: Some Transitive Verbs in Urdu ..................................................................55
Table 4.3: Some Original and Compound Ditransitive Verbs in Urdu........................55
Table 4.4: Some Divalent Verbs Derived from Univalent Verbs................................56
Table 4.5: Some Divalent Verbs Derived Irregularly from Univalent Verbs..............57
Table 4.6: Some Trivalent Verbs Derived from Univalent Verbs ...............................57
Table 4.7: Some Trivalent Verbs Derived from Divalent Verbs .................................58
Table 4.8: Some Tetravalent Verbs Derived from Divalent Verbs .............................58
Table 4.9: Infinitive Forms for Few Urdu Verbs.........................................................59
Table 4.10: Repetitive Forms for Few Urdu Verbs .....................................................60
Table 4.11: Regular Perfective Forms for Few Urdu Verbs........................................61
Table 4.12: Irregular Perfective Forms for Few Urdu Verbs.......................................61
Table 4.13: Subjunctive Forms for Few Urdu Verbs...................................................62
Table 4.14: Imperative Forms for Few Urdu Verbs ....................................................63
Table 4.15: Sixty Forms of Verb Read in Urdu ........................................................64
Table 4.16: Sixty Forms of Verb Read with Morphological Information ................65
Table 4.17: Tenses in Reichenbachian Concept Relations ..........................................68
Table 4.18: Auxiliaries for Representing Tense in Urdu.............................................68
( xi )
( xii )
Table 8.9: The Dependence of (a) Verb Morphemes (b) Auxiliary for the Object
Agreement......................................................................................................142
Table 8.10: The Pattern of the Present-Perfect Tense using Perfective Auxiliary ....148
Table 8.11: The Attributes Associated with the Aspectual Auxiliary Morphemes for
the Agreement with a Nominative Subject ....................................................148
Table 8.12: The Urdu Imperative Verb Forms for the Imperative Mood..................163
Table 9.1: Parsing by Chunking Results....................................................................191
( xiii )
LIST OF FIGURES
Figure 2.1: Machine Translation Architectures ...........................................................17
Figure 3.1: Phrase Structure of a Sentence Haamed ney ketaab xareedee ...............25
Figure 3.2: Phrase Structure of a Sentence Haamed ney naawel xareedee ..............26
Figure 3.3: Phrase Structure using CFG in (32) ..........................................................27
Figure 3.4: C-Structure of Sentence Haamed ney ketaab xareedee..........................31
Figure 3.5: C-Structure to F-Structure Employing Mapping Function ....................33
Figure 3.6: C-Structure Nodes Numbered from Leaves to Top...................................34
Figure 3.7: C-Structure Schemata with F-Structure Labels.........................................34
Figure 3.8: F-Structure derived from C-Structure .......................................................35
Figure 3.9: C-Structure of an Incorrect Sentence Haamed ney naawel xareedee.....36
Figure 3.10: Inconsistent F-Structure of Haamed ney naawel xareedee ..................37
Figure 3.11: Incomplete F-Structure of Sentence Haamed ney xareedee.................38
Figure 3.12: Incoherent F-Structure of Sentence Haamed jaagaa ketaab................38
Figure 3.13: F-Structure Transferred to English from Urdu........................................40
Figure 3.14: Correctly Mapped English F-Structure from Urdu .................................41
Figure 3.15: F-Structure of Sentences in (64)..............................................................42
Figure 3.16: F-Structure of Sentences in (65)..............................................................42
Figure 3.17: An Instance of AVM Sign in HPSG .......................................................44
Figure 3.18: Part of Inheritance Hierarchy of Signs in HPSG.....................................44
Figure 4.1: Finite State Network for Urdu Verb Morphological Forms......................64
Figure 4.2: Acyclic Deterministic Finite State Automata Representing Various
Morphological Forms of Few Urdu Verbs.......................................................67
Figure 6.1: A Trie for Representing Urdu Words........................................................85
Figure 6.2: An acyclic DFSA for Urdu Words ............................................................86
Figure 6.3: A Minimal Acyclic DFSA for Urdu Words ..............................................86
Figure 6.4: A Path in a Lexical Transducer for Urdu Noun laRkaa .........................87
( xiv )
( xv )
( xvi )
ACTION
ADJ
ASPECT
BASELANG
CASE
CONJ
GEND
N-CLASS
N-CONCEPT
N-FORM
N-TYPE
NUM
OBJ
PERS
PRED
PCASE
SEM
SPEC
SUBJ
TENSE
V-FORM
V-MOOD
V-VOICE
Values:
accusative
dative
ergative
feminine
first person
genitive
locative
masculine
no, not present
nominative
oblique
plural
second person
singular
third person
yes, present
vocative
acc
dat
erg
fem
1st
gen
loc
masc
nom
obl
pl
2nd
sg
3rd
+
voc
( xvii )
Adj
Adv
CM
PM
N
PP
V
VB
VM
CJC
CJS
CJR
IJ
PNS
PNO
PNP
PNR
PNI
NM
QM
FM
TM
TLE
NO
NC
AUX
APA
APrA
ARA
AIA
ACoM
ACaM
ASM
ADM
APeM
APrM
S
PPP
NP
VP
AJP
PART I
INTRODUCTION
AND
REVIEW
Chapter 1
RESEARCH OBJECTIVES
1.1 Objectives Statement
The objective of the Ph.D. work carried out was to develop a computational
grammar by investigating the formation of Urdu words and sentences; to find some
suitable mathematical formalism that can handle various constructions of Urdu
grammar in a universal manner; to determine formulations of grammar rules under the
selected framework; to investigate and develop associated algorithms.
While developing the computational grammar, the main application under
vision was Machine Translation (MT) between English and Urdu languages.
However, the computational grammar thus developed may be utilized for various
other Natural Language Processing (NLP) applications. Some of the applications
among many others are: grammar checker, machine translation, text summarization,
text categorization, information extraction, speech processing and knowledge
engineering.
1.2 Domain of Investigation
Mainly the linguistics-based and statistical-based approaches are used for the
development of computational grammar. However, in this study, the linguistics-based
grammar theories have been investigated. The linguistics based Natural Language
Processing (NLP) employs human knowledge of word and sentence structures to
formulate rules or equations for representing acceptable structures. The statistical
NLP, on the other hand, employs statistical pattern matching and other training
algorithms on the given data to learn the structure of the language.
The study investigates Urdu sentences composed of individual basic characters
in the text format as opposed to the sentence as a single image, thus this study is not
related to image processing or optical character recognition. The study is divided into
two parts: the study of the structure of word formation, i.e., morphology, and the
study of the structure of sentence formation, i.e., syntax.
The word classes investigated under morphology are verbs, nouns and
adjectives. Xerox finite state lexicon compiler LEXC and Xerox finite state tool
XFST are used for morphological analysis of Urdu words. The lexicon compiler
LEXC has its own language for entering the lexical data and morphological
information, and it builds a finite state network usually referred to as a lexical
transducer. The lexical transducer looks-up surface morphological form of a word
into a lexicon and finds lexical form of a word and looks-down lexical word and
gives corresponding morphological form.
For modeling the Urdu syntax, sentences from frequently used constructions in
Urdu are investigated. Lexical Functional Grammar (LFG) is used for the
mathematical formulation of Urdu grammar. At places, the formulation is carried out
under Head-driven Phrase Structure Grammar (HPSG). Both of these grammarmodeling theories are linguistic based extensions to Context Free Grammar (CFG).
Although both are different in details, yet both have evolved from a single base and
both have attributes and values associated with lexical entries. The well-known CFG
parsing algorithms work with linguistics based constraints and rules to achieve
linguistic criteria. Xerox Linguistic Environment XLE is used for the testing and
validation of Urdu grammar formulation using LFG, which has interface with
morphological tool LEXC and has parsing and unification algorithms required for
LFG.
The hash-table and deterministic finite-state automata (DFA) minimization
algorithms for the implementation of Urdu lexicon were explored and programmed.
Work on shallow parsing algorithms that utilize closed word classes in Urdu was also
carried out using a novel ordered context free grammar, which has additional
attributes order and type associated with each CFG production rule. The algorithm
has been implemented such that it utilizes the advantages of object oriented paradigm.
1.3 Organization of Thesis
The thesis is organized in three main parts. The Part I (Chapter 13), comprises
introduction, review and preliminary information on grammar modeling that forms a
context for further discussion in the next chapters. In Part II (Chapter 46), the work
on Urdu morphology is presented. The characteristics and morphology of verbs,
nouns and adjectives in Urdu are investigated. The features necessary to model lexical
categories are identified. The algorithms for computational lexicon representation
were reviewed and implemented. In Part III (Chapter 79), the work on Urdu syntax
is presented. The modeling of nominal and verbal structure is carried out under the
framework of LFG by proposing novel ideas. A chunking based parsing algorithm for
Urdu language is proposed that utilizes ordered context free grammar.
In Chapter 1, an objective statement is given, the domain of investigation for the
work carried out is defined and the organization of the thesis is described.
In Chapter 2, an introduction to the field of machine translation is given. The
ambiguities involved at various stages in machine translation have been described
with reference to English and Urdu languages. The data is presented to show that
Urdu has two more reasons for lexical ambiguities in addition to two sources of
lexical ambiguities in English language. Some examples are presented to show that
attachment of prepositional phrase, which is the basic reason of syntactic ambiguity
in English, is rarely a cause of ambiguity in Urdu. However, the Urdu language has
some other sources for syntactic ambiguities such as attachment of a participle
adjunct, modifier scope with the noun phrase and conjunction scope. Various
machine translation paradigms have been briefly reviewed. Linguistics-based
approaches typically employ manual investigation of language features in comparison
with non-linguistic approaches, which employ computational methods to extract
features automatically.
In Chapter 3, a brief review of grammar modeling is presented. Among context
free phrase structure grammar modeling and linguistics based grammar modeling it is
found that linguistics based grammar modeling is a better solution. A brief review of
popular grammar modeling theories like Lexical Functional Grammar (LFG)
formalism is presented with examples from Urdu language to determine suitability of
the framework for the modeling of the Urdu language grammar. Head driven Phrase
Structure Grammar (HPSG) is another popular theory for the grammar modeling of
natural languages, the newer version of which appeared in 2004. The chapter presents
some basic features of HPSG theory and explores its usage to model the noun-case
agreement, the noun-adjective agreement and the possession marking for the Urdu
language. The HPSG has the advantage of having object-oriented hierarchical
inheritance based architecture. However, it will be explored in forthcoming chapters
that the grammar modeling using LFG is more language-neutral than by using HPSG.
Moreover, LFG covers linguistic variations across world languages in a more natural
manner.
In Chapter 4, Urdu verb morphology and characteristics have been investigated.
Urdu, like some other languages, has intransitive, transitive and ditransitive verbs.
Urdu has three stem forms named as the root form, the causative form 1 and the
causative form 2. Each of these three stem forms are further divided into 20 verb
forms under five categories, i.e., infinitive, perfective, repetitive, subjunctive and
imperative verb forms. Hence, three stem forms, further divided into 20 forms, make
60 verb forms of a single Urdu verb. A finite-state-automaton is presented to represent
these 60 forms. The tags necessary to distinguish person, gender, number, respect,
tense, aspect, and mode, are also tabulated.
In Chapter 5, Urdu noun morphology and characteristics are investigated. A
noun in Urdu has gender attribute for all nouns, but very few nouns in Urdu have
overt gender morpheme. The nouns have nominative form if they appear without a
Urdu may be separated into two lexical parts: (i) the root or stem of a verb, which
carries the principal meaning of a verb and contains information about the transitivity
and argument-structure; (ii) the inflectional morphemes and auxiliary verbs, which
carry information about tense, mood and aspect. The computational equations are
simpler using this approach, however, other approach of combining verbs and
auxiliaries at syntactic level has other advantages. The perfect, progressive, repetitive
and inceptive aspects in Urdu are modeled under LFG. The declarative, permissive,
prohibitive, imperative, capacitive and suggestive moods in Urdu are modeled under
LFG by presenting c-structures and f-structures.
In Chapter 9, the parsing by chunking is explored based on morphologically
closed word classes in Urdu and using a novel Ordered Context Free Grammar
(OCFG). The proposed OCFG rules have additional attributes, i.e., order and type
associated with each rule. The order of a rule employs linguistic features of words to
make chunks with neighbor words, e.g., the case-marker make chunks with nouns to
make noun phrases. The final parse is achieved after chunks of basic phrases have
been made. While chunking and parsing drive parse tree (i.e., c-structure), the features
unification may be carried out simultaneously to improve the proposed method.
In Chapter 10, the summary and conclusions of the work done in this thesis are
described. The applications of the work done and future directions are discussed.
In Appendix A, a roman-script is proposed, which is used for the transcription
of Urdu sentences in this thesis. The characters of this roman-script are selected in
such a way that computerized transfer of text to this roman-script from Urdu-script is
possible and vice versa. It is also taken care that the mapped characters in these scripts
be phonetically the same or as close as possible. In Appendix B, algorithms for
lexicon representation used for lexicon implementation comparison in Chapter 6 have
been given. In Appendix C, sample sentences for chunking based parsing described in
Chapter 9 have been listed. In Appendix D, constituent-structures corresponding to
feature-structures given in Chapter 7 have been included. In Appendix E, Urdu
grammar implementation in the coding format of Xerox tools have been listed. The
morphology implementation code is in the format of LEXC. The morphology-syntax
interface code is used by XLE for porting the morphology information to syntax. The
listed syntax rules have been coded in XLE format, which generate c-structures and fstructures for the Urdu sentences.
Chapter 2
INTRODUCTION TO MACHINE TRANSLATION
Natural languages are used by humans for communication among themselves in
contrast to programming languages, which are used for the communication between
humans and machines. Natural Language Processing (NLP) is the field that deals with
the computer processing of natural languages, mainly evolved by people working in
the field of Artificial Intelligence (AI). Computational Linguistics (CL) deals with the
computational aspects of natural languages and this discipline is primarily evolved by
linguists. Currently there are many branches of NLP like Machine Translation (MT),
speech processing, information retrieval, text summarization, etc. Although the
computational grammar developed in this work can be utilized for various NLP
applications, yet machine translation is the main application targeted while
developing the grammar.
2.1 Machine Translation (MT)
Machine Translation is the transfer of text from one natural language, known as
source language, to another natural language, known as target language, by means of
a computer program or a machine (Arnold, Balkan et al. 1994; Khan 1995; Hutchins
and Somers 1997; Trujillo 1999).
2.2 Challenges for Machine Translation
Machine Translation is a challenging problem. The challenge for machine
translation is to develop a grammar formulation for handling different kinds of
ambiguities that are present in a source and a target language. These ambiguities
sometimes arise due to the inability of robust formulation of grammar under any
modeling theory and sometimes these are naturally present in the sentences and
require knowledge of semantics and pragmatics for their resolution. Natural languages
are multifaceted, if one language is expressing some concept using one way, other
language uses another way of representing the same concept. Modeling of a natural
language under any linguistics based grammatical theory is still a challenge.
Multiword units like idioms and collocations found in languages are difficult to
handle (Arnold, Balkan et al. 1994; Hutchins and Somers 1997). Anaphora and
cataphora resolution in discourse is a complex problem (Khan 1995). Review of some
Chapter 2: Introduction
basic ambiguity related problems is described below along with examples from
English and Urdu languages.
2.2.1 Lexical Ambiguity
Ideally, each word in a language should have a unique meaning, but for natural
languages, many words have two or more interpretations. When a sentence becomes
ambiguous due to a word then this type of ambiguity is called lexical ambiguity. The
lexical ambiguity may arise due to two main reasons: (i) one word belongs to two or
more lexical categories (ii) one word has more than one interpretation.
The lexical ambiguity, in which a word belongs to more than one lexical
category, causes the word to have a different meaning due to the difference of
category. The different meanings of the same word make the word ambiguous. These
words are multinational in the lexicons world. In such a case of lexical ambiguity,
performing the syntactical analysis normally resolves the ambiguity. Table 2.1 shows
a few examples of such English words and Table 2.2 shows a few examples of Urdu
words.
Table 2.1: Some Lexically Ambiguous English Words
fly
use
can
novel
today
noun
noun
noun
noun
noun
an insect
the use of a knife
a can of juice
book, story
today is eid
fly
use
can
novel
today
verb
verb
auxiliary
adjective
adverb
I want to fly
do not use a knife
I can write now
new, original
well go today
xattaa, mistake
saonaa, gold
galaa, throat
aetefaaq, unity
gaanaa, song
khaanaa, food
noun
noun
noun
noun
noun
noun
xattaa, to miss
saonaa, to sleep
galaa, a softened/ cooked state
aetefaaq, by coincidence
gaanaa, to sing
khaanaa, to eat
verb
verb
adjective
adverb
verb
verb
The lexical ambiguity in which a word has different meanings within the same
lexical category is pure lexical in nature and this ambiguity cannot be resolved by the
syntactic analysis. This property of words is often termed as polysemy. Semantic and
contextual knowledge of the word usage is required for the ambiguity resolution.
Table 2.3 lists some polysemic English words, while Table 2.4 shows some polysemic
Urdu words.
Chapter 2: Introduction
Table 2.3: Some Polysemic English Words
bank
table
film
cricket
mouse
ground
a financial institution
a tabulated information
a movie, a picture
a game
a tiny animal
earth, soil, land
bank
table
film
cricket
mouse
ground
a side of a river
a wooden furniture
a layer, a coating
an insect
a computer instrument
reason, base
Jed, opposite
Sehat, correctness
taareex, date
kal, tomorrow
haar, necklace
kaan, ear
AarJ, width
fem. noun
fem. noun
fem. noun
masc. noun
masc. noun
masc. noun
masc. noun
Jed, stubbornness
Sehat, health
taareex, history
kal, yesterday
haar, defeat
kaan, mine, excavation
AarJ, request
fem. noun
fem. noun
fem. noun
masc. noun
fem. noun
fem. noun
fem. noun
Actual
Written
bel, hole of insects
bekree, sale
aen, these
aes, this
jeldee, of skin
Aaalam, world
Noun
Noun
Pronoun
Pronoun
Adjective
Noun
Actual
There is another lexical ambiguity in Urdu due to two yey vowel shapes in
Urdu, namely the big yey, , and the small yey, . When these yey appear as
middle shape in a word, then both of these assume a single shape having two dots
noqtah below. The ambiguity of two vowels sounds permit two different words to
be written the same. To illustrate this ambiguity, some examples of such ambiguous
words are shown in Table 2.6. These different words have different meaning but as a
written word, these are the same. The same shape of these vowels represents a
Chapter 2: Introduction
a
a
a
a
a
a
a
a
a
sheyr, a lion
keyaa, what
bayn, whine
chayn, calmness
meyraa, mine
xayr, all right, fine
feys, face
beys, base
beyk, bake
Noun
Verb
Noun
Noun
Noun
Noun
Noun
Adj.
Noun
a
a
a
a
a
a
a
a
sheer, milk
keeaa, did
been, musical instrument
cheen, China
meeraa, a name
kheer, a sweet desert
fees, fee
bees, twenty
bayk, back
The English sentence shown in (1) has a prepositional phrase with a telescope
which may be attached either to the verb saw to make phrase saw something with a
telescope or to the object noun phrase an astronomer to make a noun phrase an
astronomer with a telescope. Due to attachment with different syntactic units, it
results in the following two interpretations:
(1-a)
a a
a a
a a
mayN ney xalaabaaz kao deykhaa jes key paas daorbeen thee
I [[saw]V [[an astronomer ]NP [with a telescope]PP]NP ]VP
I saw an astronomer, who is having a telescope; or
(1-b)
a a a a
a a
mayN ney xalaabaaz kao daorbeen sey deykhaa
I [[saw]V [an astronomer]NP [with a telescope]PP]VP
Using a telescope, I saw an astronomer.
The romanization / transcription system used throughout in this thesis for Urdu script is described
in Appendix A.
Chapter 2: Introduction
(2)
10
Similarly, for the English sentence shown in (2), we may have the following
two interpretations as shown in (2-a) and (2-b). In (2-a) student with an umbrella is
taken as a noun phrase, while in (2-b) with an umbrella is attached to the verb hit
as an adjunct, which made the umbrella an instrument for hitting the student.
(2-a) a
a a a
a a a
a a
aostaad ney ttaaleb Aelam kao, jes key paas chatree thee, maaraa
A teacher hit a student who had an umbrella;
or,
(2-b)
a
a a a
a a
aostaad ney ttaaleb Aelam kao chatree sey maaraa
A teacher hit a student by the use of an umbrella.
(3)
a a a a a
a
a a a a a
waseem ney karaachee kaa safar karekeT kheylney key leeey moltawee kar
deeaa
Waseem cancelled the trip to Karachi because he is to play cricket; or
(3-b)
a a a a a a
a
a a a a a a
waseem ney karaachee kaa safar, jao karekeT kheylney key leeey thaa,
moltawee kar deeaa
Waseem cancelled the trip which was for playing cricket at Karachi.
(4-a)
a
a a a a
a a a a a
mayN bhool gayaa hooN keh ach.chey joos kaa Zaaeyqah kaysaa hay
I forgot [how [good juice] tastes]
(4-b)
a a a a a a a a a a
mayN bhool gayaa hooN keh ketnaa ach.chaa joos kaa Zaaeyqah hay
I forgot [[how good] juice tastes].
(5)
Chapter 2: Introduction
(5-a)
a a a
a a a
a a
aesey aakthar khaaney sey tom maotey hao jaa-ao gey
[Eating this] [often] will make you fat
(5-b)
a a a
a a a
a a
aetnee dafAah khaaney sey tom maotey hao jaa-ao gey
[Eating] [this often] will make you fat.
11
(6-a)
a
a a a a
a a aa
reshtah daaraoN key ghar jaaney sey baoreeyat hao saktee hay
[Visiting]V [relatives]N [can be boring]; or
(6-b)
a
a a a aa a a a
ghar aa.ney waaley reshtah daaraoN sey baoreeyat hao saktee hay
[[Visiting]ADJ relatives]NP [can be boring].
(7)
(7-a)
a
a aa
a a a a
maaAeyyat kao Saaf karnaa noqSaan deh hao saktaa hay
[Cleaning]V [fluids]N [can be dangerous]; or
(7-b)
a
a aa
a a a
Saafaaee key maaAeyyat noqSaan deh hao saktey hayN
[[Cleaning]ADJ fluids]NP [can be dangerous].
(9)
aaaaa aaaaa
[plastic] [cup holder] -or- [plastic cup] [holder]
The syntactic ambiguity due to conjunction scope is shown in sentence (10):
Chapter 2: Introduction
(10)
12
Small rats and mice can squeeze into holes or cracks in the wall.
a
aa
a
aaaaa aaaaa
aa
a
[Small [rats and mice]] can squeeze into holes or cracks in the wall.
[[Small rats] and [mice]] can squeeze into holes or cracks in the wall.
It is known that syntactic ambiguity exists in many sentences, which are not
simple, and hence it makes parsing difficult. Different languages have different
sources of syntactic ambiguity. Urdu does not have prepositions, instead it has
postpositions. For most of the sentences, the position of post-positions in the Urdu
sentence determines the syntactic unit to which it is to attach and thus enables to
resolve the syntactic ambiguity. However, the syntactic problem may be seen in the
following sentences of Urdu. Sentence (11) is ambiguous but sentence (12) is not,
because position of post-positional phrase resolves the ambiguity.
(11)
a a a a a a a a
a
a a
waseem ney karekeT kheylney key leeey karaachee kaa safar moltawee kar
deeaa
Waseem cancelled a trip to Karachi which was scheduled to play cricket.
(12)
a a a a a
a
a a a a a
waseem ney karaachee kaa safar karekeT kheylney key leeey moltawee kar
deeaa
Waseem cancelled a trip to Karachi because he is to play cricket (say, at
Lahore).
(13)
a
a
a
a a a a
waseem ney aakram kao Aaynak lagaatey hooey deykhaa
Waseem while putting glasses on saw Akram, or
Waseem saw Akram who was wearing glasses.
Chapter 2: Introduction
(16)
13
a aa
a a a
a a
aos ney khaanaa khaa leeaa keeoN-keh woh teyyaar thaa
He ate the meal because he/it was ready
If we compare sentences (18) and (19), then (18) has only one interpretation that
we are ready for eating. The sentence (19) is similar to (18) both lexically and
syntactically, but it could have two different interpretations. It may mean that we may
start eating chickens, which are ready and cooked for us to eat. Second meaning of
this sentence, like the meaning we get from sentence (18), is that chickens are ready
to eat food and waiting, if we give them food the chickens will eat that food.
(18)
a a a a
a
ham khaaney key leeey teyyaar hayN
We are ready to eat
(19)
a a a a
a
morgeeaN khaaney key leeey teyyaar hayN
The chickens are ready to eat food, or
The (cooked) chickens are ready (for someone) to eat
Sentence (20) has two interpretations. One is there is no women who can drive
a car and the second is not all women can drive a car, but some can drive a car.
(20)
a a a a a
saaree AorateyN gaaRee naheeN chalaa sakteeN
All women cannot drive a car
Sentence (21) has logical interpretation that each car is in a separate house, and
there are as many houses as many cars are but the sentence in (22) has logical
interpretation that each car is in the same parking or there are many cars in one
Chapter 2: Introduction
14
parking. The lexical and syntactic structure of these sentences is the same, but these
require semantic or real world knowledge for interpretation.
(21)
a a a a a
har gaaRee ghar meyN khaRee hay
Each car is parked in a house. (The cars are parked in the houses).
(22)
a a a
a a
har gaaRee paarkeng meyN khaRee hay
Each car is parked in a parking. (The cars are parked in a parking).
Akram was hungry and Ajmal was late from his work. He entered a restaurant.
The sentences in (23) have two pronouns his and he, which need to refer to a
noun. The pronoun his may refer to Ajmal which is the only masculine noun in the
same clause. However, for the resolution of he we need to refer to previous
sentence. Both Ajmal and Akram are good candidates for binding, but if we go in
semantics then only Akram could be referred to by he in the second sentence,
because Ajmal was already late from his office and has no reason to enter a restaurant.
However, Akram was hungry and he had a reason to enter a restaurant. Still Ajmal
could be a candidate for binding if he works in a restaurant.
(24)
The sentences in (24) have three pronouns he, they and they which need
binding with nouns. The first pronoun he could be bound to Raheem easily as it is
the only masculine noun. The pronoun they refers to two or more nouns, so it could
refer to all three nouns, i.e., Raheem, Maria and the nikah-khwan or to any two of
them. If somehow we capture semantic knowledge that marriage is between a male
and a female, then we are left with two combinations for binding with pronoun they
Raheem-Maria and Maria-nikah-khwan. Again, there is a question whether the same
persons who got married went for honeymoon. It will require the world knowledge
about the relationship between a marriage and a honeymoon. Thus binding of
pronouns or anaphora resolution is deeply rooted into semantic knowledge base or
ontology network available for a particular area under discussion of a language.
Chapter 2: Introduction
(25)
15
1939
1949
1952
1954
1960
1964
1966
1967
1968
1969
1970
1976
Chapter 2: Introduction
1982
1983
1987
1988
1991
1992
1993
16
1994
1997
17
Chapter 2: Introduction
(26)
This is a book
yeh hay aeyk ketaab
a a a
INTERLINGUA
Semantics Level
Transfer
POS
Tokens
Transfer
on
An
aly
s
is
i
rat
ne
Ge
Syntactic Level
Direct
Input Sentence
Output Sentence
SOURCE
LANGUAGE
TARGET
LANGUAGE
It is depicted in the pyramid figure that the difference between the source
language and the target language is lesser at the syntactic level as compared with the
Chapter 2: Introduction
18
direct word transfer level. Therefore, the results of machine translation are expected to
be better at syntactic level as compared to direct words transfer level.
2.4.3 Semantic Transfer
If the transfer between the source language and the target language is made after
the semantic analysis of the source language has been performed, and semantic
information in the form of a knowledge representation structure of the source
language has been transferred to a semantic structure of the target language, then it
can be seen from the pyramid diagram that difference between the source and the
target languages at the semantic level is even lesser than the difference at the syntactic
level. At the semantic level if we see the difference between the source and the target
languages and the effort required to go to the next interlingua level, then we may
conclude that machine translation at semantic level is acceptable for most of our MT
requirements.
2.4.4 Interlingua
In an ideal MT process or architecture, the source language is fully translated to
an intermediate language, called interlingua, which is supposed to represent every
meaning of both source and target languages. As we go up the pyramid in Figure 2.1
the gap between source and target languages decreases, while the effort involved in
analysis and generation increases. For most of the MT applications, it is found that
syntactic or semantic transfer approach is acceptable.
2.5 Machine Translation Phases
The machine translation process is divided into two phases: the analysis phase
and the generation phase.
2.5.1 Analysis
The tokenization, syntactic analysis and semantic analysis phase, up to the
interlingua, shown in Figure 2.1, is the analysis phase of machine transfer. In this
phase, the sentence is tokenized into words. The words are categorized into lexical
categories known as part of speech, POS. The morphological analysis is performed to
find various forms of the same word. The syntactic analysis is performed to find the
structure of grouping of words into larger syntactic units, called phrases. The valid
grouping of phrases to form a sentence is checked. The semantic analysis of the
source language text is performed to extract meaning from the words and structural
units of the text. For the interlingua process, the semantic structures are converted to
interlingua.
Chapter 2: Introduction
19
2.5.2 Generation
The generation means conversion of a computational representation, i.e.,
interlingua, semantic structure or syntactic structure into a sentence in the target
language. The grammar for this phase is called the generation grammar. The
generation is a reverse of analysis process as shown in Figure 2.1.
2.6 Machine Translation Paradigms
According to the handling or modeling of the problem, machine translation
paradigms are broadly classified into linguistics based, non-linguistics and artificial
intelligence based machine translation approaches. Recently, hybrid approaches,
which are a combination of basic approaches, are becoming popular. Although the
classification presented here does not have a clear boundary, and concepts seem to
overlap, yet the given classification is based on the primary approach involved for
accomplishing the machine translation.
2.6.1 Linguistics Based Approaches to Machine Translation
The approaches, which incorporate strong linguistic knowledge to drive the
modeling process, are classified as linguistics based approaches to machine
translation. These approaches heavily enforce universal grammatical features in the
modeling of natural language grammars. Emphasis is on modeling of analysis,
transfer and generation grammars based on knowledge that human posses about a
language. There are many distinct theories for the modeling of grammars for various
world languages, each one presents its own way of modeling language, and hence a
separate route to machine translation. The modeling of Urdu language based on
grammar theories will be discussed in the next chapter. Some of them, which are
stronger and more popular in describing various natural language requirements, are
briefly introduced here:
Transformation Based Linguistics Approaches
The transformation based linguistics approaches consider that there is a basic
structure of the sentences in a language and this basic structure can be generated by
context free grammar rules and the given lexicon. If there are other valid sentences in
the language, then those can be transformed to basic structures using transformational
grammar. There exist transformations in transformational linguistics that can
convert a normal sentence into a question sentence or into a passive sentence.
Initially presented by Chomsky, the earlier versions of transformational
generative grammar (19601990) have changed significantly. Yet, the basic nature of
transformational rules that map base/deep phrase structures to surface phrase
Chapter 2: Introduction
20
structure remains intact. The changes to framework are recorded (Chomsky 1993) as
follows:
19551964
19651970
19671974
19671980
1980date
Chapter 2: Introduction
21
It utilizes conditional probability theory and particularly uses the famous Bayes Rule
to find conditional probabilities of word sequences for a sentence of a source
language sentence to the corresponding word sequences for a sentence of the target
language (Manning and Schtze 2003).
Example Based Machine Translation (EBMT)
The Example Based Machine Translation (EBMT) system employs the parallel
corpora of the bilingual text to find a correspondence between the source and the
target language sentences and phrases. It captures a database of example patterns of
sentences and phrases of the source and the corresponding sentences and phrases of
the target language. For translation it searches for the source language sentence
pattern in the database, if found it gives translation using corresponding target
language pattern available in the database.
2.6.3 Artificial Intelligence Based Approaches to Machine Translation
The main features of the AI based approach for MT include the application of
semantic parsing (based on semantic categories, e.g. human, liquid, etc.), the
building of semantic (or conceptual) representations of the meanings of texts, and the
use of knowledge databases to assist in the interpretation of texts. Typically included
in the latter are representations of conventional event schemata (e.g. what happens
when going to a restaurant), normal inference patterns, and common sense
expectations. It employs techniques, which primarily utilize established AI techniques
like semantic networks, expert systems, neural networks, predicate logic. For AI
persons language understanding is a key to building a good MT system.
Knowledge Based Machine Translation (KBMT)
The system or network to represent knowledge is the base for KBMT. The
knowledge is extracted from the input sentences and used during analysis and
generation phases. During 1980s at Carnegie Mellon University natural language
understanding systems were developed with the help of AI community. AI
communitys effort to find language independent knowledge representations resulted
in AI based interlingua for knowledge representation. They considered MT beyond
pure linguistics information. Many attempts were made in various Universities around
the world using this paradigm.
Neural Network Based Machine Translation
Work has been done with neural network technology for machine translation
chores, such as, parsing, lexical disambiguation and learning of grammar rules. The
incorporation of neural networks and connectionist approaches into machine
Chapter 2: Introduction
22
translation systems is a relatively new area of investigation. Most of the work carries
out some tests with small vocabularies of the words and handles simple syntax.
Handling large vocabularies and grammars significantly inflates the size of the neural
networks and the training set, as well as the training time. In contrast with the other
approaches described here, no realistic MT Systems have been built based solely on
neural network technology. This technology is thus more of a technique than a system
approach (Dorr 2000).
2.6.4 Hybrid Paradigms
Recent trend has been to make use of different mixes of goods in each paradigm
and to avoid difficulties of each one of them. For example, the recent data oriented
parsing technique (Bod, Scha et al. 2003) employs statistical techniques with
linguistics grammars. Moreover, the statistical techniques are not good in analyzing
long distance dependencies, while linguistics techniques have formulations for those.
Similarly, example base machine translation has difficulties with complex sentence
constructions (Dorr 2000).
2.6.5 Other Paradigms
Shake & Bake Machine Translation (Beaven 1992) and Generate & Repair
Machine Translation (Naruedomkul and Cercone 2002) paradigms are similar to each
other. The basic approach followed is not to spend much on analysis of the source
language. After tokenization of the source language text, the text is transferred to the
target language using direct words translation method by using bi-lingual dictionary.
In shake (or generate) step the target language words are reordered in a new sequence
under the generation grammar rules of the target language. The new words, like
preposition or auxiliaries are added or word forms are replaced in the bake (or repair)
step until a valid sentence is produced. If a valid target language sentence is not
produced in the bake (or repair) step then shake and bake (generate and repair)
continues until a valid sentence is produced.
2.7 MT Route Followed in this Thesis
This thesis does not develop a complete machine translation system, however,
the computational grammar of Urdu developed in this work could be used in
developing an MT system. For developing a computational grammar, the constraint
based linguistics grammar development approach for the grammar-modeling of Urdu
language is adopted due to the following main reasons:
Statistical language modeling techniques employ various sampling techniques
on large corpora of textual data. When this research work was initiated, the Urdu text
corpora were not available. Text corpora are the basic requirement for non-linguistics
Chapter 2: Introduction
23
based approaches. Recently Urdu text is becoming available through BBC, Jang
newspaper websites in Unicode and books written in inpage software, which can
now be employed for statistical based analysis. Still a lot of work is needed to build
parallel bilingual corpora of Urdu and English before statistical algorithms can be
utilized.
The linguistics-based grammar modeling tries to capture the actual phenomena
in the language as known to humans by studying various constructions in the
language. The structure is studied by comparing different instances of valid and
invalid sentences, which is a manual observation process. Based on various
constructions a grammar rule is developed under the grammar theory. This manual
comparison procedure of finding language structure is difficult to model, but if
modeled it is expected to be more accurate and reliable.
The linguistics-based grammar development takes a long time for given
language under consideration, but the phenomenon captured can be reused across the
whole range of natural language processing applications. The statistics-based and
example-based language modeling techniques, on the other hand, employ
computational techniques instead of manual comparison to capture features necessary
for the given application at hand for the given data, and porting this to other NLP
applications and using other data reduces its accuracy significantly.
The constraint based lexicalist approaches handle wide variety of natural
language phenomenon in a uniform manner without altering the surface structure of
the given sentences. These approaches are good for comparing structure of words and
sentences in different languages. These could be used to build parallel grammars for
different languages, which could be employed to achieve machine translation. The
Lexical Functional Grammar based transfer between source and target languages at fstructure level is more reliable because it is close to interlingual approach.
LFG based grammar development need not change if the language pair for MT
is changed, i.e., if we develop LFG grammar for MT between Urdu and English, the
grammar will be the same if we add another language, say, German.
Chapter 3
GRAMMAR MODELING
The contemporary linguists approach is that a sentence is acceptable if native
speakers say it sounds good. Thus, if a majority of native people accept a sentence to
be valid, then the sentence is considered good. In the view of formal language
theorists, the sentence is grammatical, with respect to a grammar under consideration,
if the grammar permits it by generating the parse tree of the sentence. The grammar
should not only accept good sentences but also reject bad sentences. A grammar is
good if it accepts good sentences, rejects bad sentences, has fewer rules and the parse
tree generated by it is compact.
Mathematical modeling of the grammar of a natural language is one of the
solutions for artificial comprehension by a machine. The mathematically simplest
representation of a natural language grammar is a set of all the valid sentences in the
given natural language. As infinite number of sentences can be generated for any
given natural language, so this approach is clearly an infeasible solution. To make a
large set of valid sentences will not only require a huge storage space but also
searching for valid or invalid sentences will be time expensive. Thus, the solution is
neither feasible for space nor for time requirements.
Next, we consider formal grammar theory proposed by Chomsky. The simplest
formal grammar in the Chomsky hierarchy is the regular grammar (Hopcroft and
Ullman 1979; Martin 1991). Although regular grammar can be used for modeling
morphotactics for words in the lexicon of the natural language and thus can handle
morphology requirements, but phrase structure and syntax is beyond the descriptive
power of this class of grammars. The fact is proved in books of formal grammar
theory under the heading pumping lemma for regular grammars. Therefore, using
regular grammar for modeling natural language syntax is similar to modeling a circle
using a single straight line.
The languages defined by context-free-grammars (CFG) rules are one class
higher in the Chomsky hierarchy from the class of regular-languages. The CFGs
descriptive power is similar to modeling a circle using many straight lines, which
means that CFG can model natural languages using large set of rules, but still it
approximates the actual phenomenon. However, the CFG rewriting rules are fully
capable of representing programming languages. Other problems of CFG based
24
25
modeling of natural languages will be given at the end of this section, after
introducing some linguistics properties of natural languages. We start with the small
fragment of phrase structure rules for the Urdu grammar based on CFG as shown in
(28) and lexicon entries corresponding to this grammar are shown in (29):
(28)
S
NP* V
NP N CM
NP N
(29)
N
N
N
CM
V
V
Each production in (28) consists of a rewrite rule. Each symbol on the left hand
side of arrow () called non-terminals can be replaced with symbols on the right
hand side of the arrow. The Kleene star (*) denotes zero or more repetitions. The
Symbol S stands for sentence, NP for noun-phrase, CM for case-marker and V for
(maSdar)
verb. The verb V in Urdu is usually a derived form from the basic
form in Urdu using predefined Urdu rules of morphology. It contains information
about tense, gender and number involved. In Urdu, it may be a complex-predicate
, , , and
are
construction (Butt 1995). The words or lexical items like
terminals. Each non-terminal must be replaced with some terminal to generate a
sentence in represented language. Using bottom-up parsing technique, the phrase
structure tree (also called parse tree) of sentence is shown in (30). The resultant
parsed tree is shown in Figure 3.1:
(30)
a
Haamed ney ketaab xareedee
Hamid bought the book.
S
V
NP
N
NP
CM
26
Parse tree assigned proper grammatical categories to the respective lexical items.
However, the same CFG rules can be used for the parsing of the incorrect sentence
(31) as shown in parse tree Figure 3.2:
(31)
a
*
*Haamed ney naawel xareedee
Hamid bought the novel.
S
V
NP
N
NP
CM
To handle gender and number agreement through CFG we can change the
grammar given in (28) by incorporating more specific categories of verbs and nouns
as given in (32), which is not covering full agreement problem in Urdu, but just the
object-verb agreement, without case marking:
(32)
NP_sg_fem N_sg_fem
NP_pl_masc N_pl_masc
NP_pl_fem N_pl_fem
NP N CM
NP N
(33)
N
N_sg_fem
N_sg_masc
(34)
CM
V_sg_fem
V_sg_masc
The asterisk symbol (*) is used to represent a grammatically incorrect sentence or a syntactic unit.
27
The incorrect sentence (31) is corrected in (34) and the parse tree of correct
sentence based on CFG, given in (32), and lexicon, given in (33), is shown in Figure
3.3. The parse tree of incorrect sentence cannot be generated for the modified CFG.
S
V_sg_masc
NP_sg_masc
N_sg_masc
NP
CM
Verb form needs to agree with third person subject noun in present tense.
Verbs have different transitivity and require different number and type of
complements or modifiers.
Coordination between phrases requires phrases of the same nature.
Some of the Urdu grammar modeling requirements, in addition to the abovementioned English modeling requirements, are as follows:
28
Verb form needs to agree sometimes with subject noun and sometimes
with object noun in various tenses/aspects.
Nouns in Urdu bear gender; therefore, gender agreement with verb is also
required, which also has dependency on tense/aspect.
Noun-case agreement is required for perfective verb forms. The verb
agrees with highest nominative noun phrase, if there is any nominative
noun phrase in the sentence, otherwise verb gets default singularmasculine agreement.
Nouns appear in different forms like nominative, oblique and vocative,
which need agreement.
Adjectives sometimes require agreement with nouns and sometimes they
do not.
Free phrase order may occur in Urdu sentences.
29
In LFG, lexical items are stored along with their syntactic category and
functional schemata. The following list in (35) presents few of the lexical entries used
as LFG based lexicon:
N
(35)
(K
(K
(K
(K
(K
(K
(K
PRED) =
PERS) = 3rd
NUM) = sg
GEN) = masc
PRED) =
NUM) = sg
GEN) = fem
30
CM
(K PRED) =
(K NUM) = sg
(K GEN) = masc
(K PRED) =
<(K SUBJ) , (K OBJ) >
(K TENSE) = Past
(K OBJ NUM) = sg
(K OBJ GEN) = fem
(K PRED) =
<(K SUBJ) , (K OBJ) >
(K TENSE) = Past
(K OBJ NUM) = sg
(K OBJ GEN) = masc
(K CASE) = erg
(SUBJ K)
The symbol K refers to the predicate under which current entry is found. Each
noun and verb entry has information about its number and gender. The verb entry has
is the basic maSdar verb form for the
normal predicate form, e.g., xareednaa
as well as for the singular
singular masculine perfective form xareedaa
. The angle brackets enclose the
feminine perfective verb form xareedee
argument structure. The argument structure <(K SUBJ), (K OBJ)> for the predicate
xareednaa
indicates that the current predicate requires both subject and object
noun phrases as required arguments.
3.1.2 C-Structure
NP
NP
( SUBJ) = ( OBJ) = =
N
CM
NP
= (SUBJ )
N
NP
=
The symbol K refers to f-structure of the mother node, while the symbol L refers
to f-structure of the current node. The resultant c-structure for the sentence (30)
reproduced for clarity as sentence (37) is shown in Figure 3.4.
(37)
a
Haamed ney ketaab xareedee
Hamid bought the book.
31
S
V
(K=L)
NP
(KOBJ=L)
N
(K=L)
xareedee
ketaab
NP
(KSUBJ=L)
CM
N
(SUBJ K) (K=L)
ney
Haamed
3.1.3 F-Structure
a1
a
2
v1
v 2
(39)
a1
a
2
a 4
a
8
[ a 3 v3 ]
a 5 [ a 6 v 6 ]
v7
a 7
[ a 9 v9 ]
[ a10 v10 ]
v1
The attributes a1, a3, a6, a7, a9 and a10 have simple values. The attributes a2, a4,
and a5 have f-structures as values. The attribute a8 has a set of two f-structures as
value. The set values are represented using curly brackets: { and }. The set can
contain one or more values of simple or f-structure type.
32
A process in which two or more f-structures are combined to form a single fstructure is called unification. The operator is used for unification operation. The
unification contains attributes from each combining f-structures according to the
following two rules:
Rule 1: If combining f-structures have different attributes, each attribute will be
added to unified f-structure with the corresponding value.
(40)
a1
a
2
v1
[a 3
[a 4
v3 ]
a1
v 4 ] = a 2
a 4
v3 ]
v1
[a 3
v4
Rule 2: If combining f-structures have one or more of the same attributes, each
of these attributes will unify only if either (i) they have identical values or (ii)
the attribute is of type set, which can hold different values of the same type.
(41)
a1
a
2
v1
[a 3
[ a1
v3 ]
a1
a 2
{v1}
[a 3
a1
v 3 ]
a1 v1
v1 ] =
a 2 [ a 3 v 3 ]
v1
a1
{v 4 } = v4
a 2 [ a 3 v 3 ]
However, the following unification shown in (42) results in inconsistent fstructure as the attribute a1 has multiple values.
(42)
a1
a
2
v1
[a 3
[ a1
v 3 ]
a1
v 4 ] = a1
a 2
[a 3
v1
v4
inconsistent f-structure
v3 ]
If there are nested f-structures, they may have the same attribute in the inner and
outer f-structure, which may have the same or different values. The same attribute a1
in (43) has different values v1 and v3, but is valid because the attribute is a member of
the separate f-structures.
(43)
a1
a
2
v1
[ a1
[ a1
v3 ]
a1
v1 ] =
a 2
v1
[ a1
v3 ]
If there are nested f-structures, they may have the same attribute in the inner and
outer f-structure having the same value. For example, attribute a1 in (44) has the same
values v1. Usually, such a common value in an f-structure is shown only for one
33
attribute, while for the other attribute, it is represented using the same number in the
box at both places or by drawing an arrow.
(44)
a1
a
2
a 4
v1
[ a1
v4
v1 ]
By co-indexing:
a1
a 2
a 4
By drawing an arrow:
a1
a
2
a 4
1v1
a1
v4
v1
[ a1
v4
f2
f4
( A) =
f5 = f3
f3
=
[A
f4 ]
f1
f0
f 2 = f 0 f1
To drive f-structure from c-structure we start from the leaf nodes. Each leaf
node in c-structure is labeled with a unique number representing f-structure of the
corresponding node. The leaf nodes get values of attributes from lexicon entries. For
the c-structure shown in Figure 3.6, the N node will get attribute values from lexical
entry for Haamed, CM from ney, N from ketaab and V from xareedee.
Each up arrow (K) in Figure 3.6 is then replaced with numbered name of mother
f-structure, while each down arrow (L) is replaced with numbered name of the current
node, and the result is shown in Figure 3.7. The values of leaf f-structures f0, f1, f2 and
f3 constructed from LFG based lexicon shown in (35) are shown in (45) to (48)
34
f6 S
f3
V
(K=L)
f5
NP
(KOBJ=L)
f2
N
(K=L)
xareedee
f4
f1
NP
(KSUBJ=L)
CM
(SUBJ K)
ketaab
ney
f0
N
(K=L)
Haamed
f6 S
V
(f6=f3)
f3
f5
f2
xareedee
NP
(f6 OBJ=f5)
N
(f5=f2)
f1
ketaab
f4
NP
(f6 SUBJ=f4)
CM
(SUBJ K)
ney
f0
N
(f4=f0)
Haamed
PRED
PERS
=
NUM
GEND
3rd
sg
masc
(45)
f0
(46)
f1 = [ CASE erg ] and a constraint that f1 is the value of the attribute SUBJ of
= NUM sg
GEND fem
(47)
f2
(48)
PRED
TENSE
f3 =
OBJ
past
NUM sg
GEND fem
35
f 4 = f 0 f1
f5 = f 2
(50)
f 6 = f3
( f6
( f6
SUBJ ) = f 4
OBJ ) = f 5
PRED
PERS
f1 = NUM
GEND
CASE
3rd
sg
masc
erg
(51)
f4 = f0
(52)
f3
PRED
TENSE
SUBJ
OBJ
past
PERS 3rd
NUM sg
masc
GEND
CASE erg
NUM sg
GEND fem
The derived f-structure must fulfill the consistency, completeness and coherence
conditions for the well-formed sentences (Bresnan 2001; Dalrymple 2001)
36
a
*
*Haamed ney naawel xareedee
Hamid bought the novel.
S
V
(K=L)
xareedee
NP
NP
(K SUBJ=L)
(K OBJ=L)
(K=L)
CM
(SUBJ K)
(K=L)
naawel
ney
Haamed
The attributes GEND of the object of the verb xareedee will get value fem.
A part of f-structure of the verb with gender attribute is shown in (54).
(54)
OBJ
[GEND
fem ]
OBJ
[GEND
masc ]
These f-structures attributes in (54) and (55) are clearly inconsistent because
one attribute GEND has two different values masc and fem and therefore these f-
37
structures cannot unify, as shown in Figure 3.10, depicting the f-structure of the
sentence. The source sentence (53) is rejected through consistency condition and
declared grammatically incorrect, because the f-structure of a sentence in this case has
inconsistent values for gender attribute. Similarly, other verb agreement requirements
with object noun or with subject noun, like agreement for number, person and case
could be checked.
PRED
TENSE
SUBJ
OBJ
past
PERS 3rd
NUM sg
GEND
masc
CASE erg
NUM sg
GEND fem
GEND
masc
a
*
*Haamed ney xareedee
Hamid bought
38
TENSE
SUBJ
past
PRED
'
Haamed
'
PERS 3rd
NUM sg
GEN
masc
CASE erg
a
*
* Haamed jaagaa ketaab
Hamid woke up book
PRED
TENSE
SUBJ
OBJ
past
NUM sg
GEND
masc
CASE erg
GEND fem
39
(58)
xareedee
(59)
TENSE past
NUM sg
with constraint OBJ
GEND fem
(60)
xareedee
(61)
PRED
TENSE
f3 =
OBJ
(K
(K
(K
(K
past
NUM
sg
GEND fem
40
TENSE
SUBJ
OBJ
past
PERS 3rd
NUM sg
GEND
masc
NUM sg
GEND masc
It should be noticed that in English f-structure shown in Figure 3.13 the gender
for the book is superfluous and it should be discarded. A sentence generated using
English f-structure of Figure 3.13 would be:
(62)
41
from the noun the correctly mapped f-structure is generated, which is shown in Figure
3.14. The generation of English sentence from this correctly mapped f-structure will
result in correct English sentence, which is shown in (63):
PRED
TENSE
SUBJ
OBJ
past
PRED
'
'
Hamid
PERS 3rd
NUM sg
GEND masc
NUM sg
SPEC a
(63)
a a a
(64)
a a
a a
a a
a
a
a
a
a
a a
a a a
a
Roman Script
[Haamed ney]S [ketaab]O [xareedee]V
[Haamed ney]S [xareedee]V [ketaab]O
[ketaab]O [Haamed ney]S [xareedee]V
[ketaab]O [xareedee]V [Haamed ney]S
[xareedee]V [ketaab]O [Haamed ney]S
[xareedee]V [Haamed ney]S [ketaab]O
English
Hamid bought
the book.
42
(65)
a
a a
a
a
a
a
a a
a a
a a
a a
a a
a
Hamid called
Hameed.
PRED
TENSE
SUBJ
OBJ
past
PERS 3rd
NUM sg
GEN
masc
CASE erg
NUM sg
GEN fem
CASE
nom
PRED
TENSE
SUBJ
OBJ
past
PERS 3rd
NUM sg
masc
GEN
CASE erg
NUM sg
GEN masc
acc
CASE
43
nominative, i.e., nom. The entries for case markers are shown in (66) which appear
in proposed Urdu lexicon based on LFG.
(66)
(K CASE) = erg
(SUBJ K)
CM
(K CASE) = acc
(OBJ K)
the absence of case marker means that CASE = nom
CM
The Head Driven Phrase Structure (HPSG) is closely related to LFG in various
features, but there are many differences between LFG and HPSG. HPSG has been
greatly influenced by the Generalized Phrase Structure Grammar (GPSG). HPSG was
formulated and proposed in two works of Carl Pollard and Ivan Sag (Pollard and Sag
1987; Pollard and Sag 1994), which remained reference books in the field until 2004.
In 2004, a revised version of HPSG appeared (Sag, Wasow et al. 2004). The key
features of HPSG 2004 are listed below:
1. Constraint based generative grammar this means that constraints are
applied to the phrase structure grammar.
2. Non-derivational surface oriented approach it means that it has no
transformations to change the actual structure of the sentence. It analyzes
the actual order of words of the given sentence.
3. Unification based approach the features of mother in a phrase structure
are related to its daughters through unification, which is achieved by
observing constraints and certain principles.
4. Highly lexicalist theory contains information in the lexicon and this
information is even richer than LFG.
5. Signs feature structure are known as signs, which are attribute-valuematrices AVMs. Signs are nodes in a phrase structure rules.
6. Inheritance signs follow an inheritance hierarchy. The sub-classes
inherit attributes and their values from their super classes.
7. Head each phrase is driven by a sign, known as head of the phrase.
8. Principles various principles are applied during unification.
3.4.1 Signs and Inheritance
The attribute value matrices, AVMs, in HPSG are called signs (Sag, Wasow et
al. 2004). Signs follow notion of object orientation. Each sign belongs to a specific
type or class. A sign can be derived from another sign through inheritance. The
derived sign inherits all the features of its base classes and can add more features to
44
the inherited features. Figure 3.17 shows an example of sign in HPSG; each sign has a
particular type and contains feature-value pairs in the form of a matrix.
sign type
sign
features
word
PHON
HEAD
VAL
haamed
noun
3rd
PERS
AGR NUM sg
GEND masc
[ ]
sign
values
Figure 3.18 shows a part of inheritance hierarchy of signs in HPSG. Each sign
is a feature-structure, which contains feature-value pairs. The expression derived
directly from feature-structure sign contains a HEAD attribute and a VAL (valance)
attribute. Thus, word and phrase inherit HEAD and VAL features from expression.
The word sign adds feature ARG-ST (argument-structure). The part-of-speech has
AGR (agreement) feature, which is inherited to the derived classes like verb, noun.
feature-structure
expression
HEAD
VAL
word
phrase
[ ARG-ST ]
part-of-speech
[ AGR ]
verb
noun
case-marker adjective
[ FORM ] FORM
CASE
...
In HPSG, the features can take only specified type of values, unlike LFG where
any type of value or f-structure or even set of values can be assigned to attributes. In
HPSG, features or attributes cannot take any undefined value, for example, the HEAD
feature can take values of type part-of-speech. VAL feature can take values of type
valance-category, which contains features COMPS (complements) and SPR
(specifier). The HPSG is thus strictly typed as compared with LFG.
45
The lexical entries of HPSG are quite large to display on paper. These contain
phonetic, syntactic and semantic information related to a word. As an introduction
here bare syntactic information is given. Lexical entries for three nouns in Urdu are
shown in (67). First is a proper name, Haamed (Hamid). The AGR (agreement)
feature of HEAD contains information about PERS (person), NUM (number) and
GEND (gender). The noun has no valance requirements. The other two entries are for
two nouns book and novel having gender masculine and feminine, respectively.
(67)
HEAD
VAL
word
PHON
HEAD
VAL
word
PHON
HEAD
VAL
Haamed
noun
AGR
[ ]
agr -cat
PERS
NUM
GEND
ketaab
noun
AGR
FORM
agr -cat
NUM
GEND
nom
[ ]
naawel
noun
AGR
FORM
[ ]
agr -cat
NUM
GEND
nom
3rd
sg
masc
sg
fem
sg
masc
46
and vocative form. The form feature requires agreement with that of case marker that
will be discussed later in Chapter 7. This feature FORM is not included in the agr-cat,
because it requires separate agreement.
For words having HEAD of type verb, the HEAD feature contains agreement
(AGR), FORM and CASE features. The values that verb FORM features take in Urdu
are different from those of English. Similarly, CASE feature is additional from that of
English. The CASE feature is not put as AGR value because CASE and AGR require
separate agreements as shown in (68). The ergative CASE must match with the noun
phrase of the SPR (specifier), while AGR (agreement) features of NUM (number) and
GEND (gender) must match with the COMPS (complements) noun phrase.
(68)
HEAD
VAL
xareedaa
verb
NUM sg
AGR
1
GEND masc
CASE 2 erg
FORM perfect
NP
SPR
CASE 2
NP
COMPS
AGR 1
word
PHON xareedee
verb
NUM sg
AGR
1
HEAD
GEND fem
CASE 2 erg
FORM perfect
NP
SPR
CASE 2
VAL
NP
COMPS
AGR 1
Lexical entries of HPSG take a large space to show on paper. In fact, each entry
contains even more features in a fully specified HPSG entry. The fully specified entry
is one, which shows values of all the features, even some of the features take default
47
values and may not be important for current discussion. There is a division of SYN
(syntax) and SEM (semantic) features within each expression. The HPSG lexicon
contains much information, even greater than lexical functional grammar.
3.4.3 Phrase Structure Rules
HPSG phrase structure rules are CFG based regenerative rules and thus can
utilize the same CFG parsing algorithms. However, the terminal and non-terminal
symbols used in CFG are not just symbols in HPSG but are AVM based signs, which
contain syntactic and semantic information. There are some generalized rules and
principles on phrase structures in HPSG, which restrict and control the formation of
tree based on linguistic requirements such as agreement, transitivity, etc. Therefore,
HPSG enforces control through feature structures and principles for well formed
sentences.
In Head driven phrase structure grammar, as the name implies, one node in the
phrase may act as the head node, which drives and controls the phrase. The head node
may be any of the daughters in the phrase structure rule, know as head daughter.
Urdu is predominantly a head final language. In HPSG based rules for Urdu, head
daughter is usually the last daughter. As shown in (69), verb (V) is head of sentence,
post-position (P) is head of post-positional phrase and case marker (C) is head of case
phrase (KP). The head daughter node is marked with capital letter H.
(69)
S NP* H V
PP N H P
In any headed phrase, the HEAD value of the mother and the HEAD value of
the head daughter must be identical.
(Sag, Wasow et al. 2004)
The HEAD feature takes value of type part-of-speech (pos), which contains
AGR (agreement) feature. With the use of Head Feature Principle (HFP), the
agreement requirements of head daughter are transferred to the mother. By expanding
symbols in (69) to signs, we get rules as shown in (71). The head value of mother is
the same as that of head daughter marked with letter H by the use of HFP
48
(71)
phrase
phrase
phrase
word
word
HEAD verb
word
HEAD pp
It is arguable in Urdu, if noun (N) and case marker (CM) make a case phrase
(KP) such that CM is the head daughter as shown in (72) or these make a noun phrase
(NP) such that noun is the head of phrase as shown in (73).
(72)
KP N H CM
phrase
HEAD cm
(73)
NP
word
word
H N CM
phrase
HEAD noun
word
HEAD noun
word
HEAD cm
It is later shown in Chapter 7, that HEAD of both noun and case marker impart
feature to mother HEAD, and although noun be marked as head daughter, the
agreement of noun is selected by case marker and the resultant mother must have head
value as noun as shown in (73). Based on the above-described consideration a
modification in the HFP for Urdu is being proposed as shown in (74).
(74)
In any headed phrase, the HEAD value of the mother and the HEAD value of
the head daughter must be identical, unless specified otherwise.
Valance Feature
The VAL (valance) feature is used to show that one grammatical category
requires others for completion. Thus, transitivity requirements for the verbs,
requirement of noun for adjectives, and determiner requirement for nouns are handled
through the valance feature. The VAL feature contains two main features, the SPR
(specifier) and COMPS (complements). In an English sentence, the specifier noun
phrase of verb represents subject, while verb complements represent object
requirements represented by verb transitivity. Since English is a SVO language, the
verb splits subject and objects. In HPSG, the linear order is taken into account and if a
noun phrase comes before a verb then it is taken as a subject, and other noun phrases
which appear after verb are taken as objects and are also known as complements of
verb in HPSG. The values of SPR and COMPS features are represented as lists so
these can hold multiple values. A value or more in the valence list represents need of
such an item for the completion, while empty list signals that there is no requirement
for the completion.
49
(75)
The head complement rule, in the form of regenerative rule, is shown in (76),
which states that if a head daughter requires n complements and all n are identified
as sisters to head daughter, then the complement requirements of the mother are
satisfied.
(76)
VAL
phrase
COMPS
VAL
SPR
COMPS
1, " , n
1 " n
The head specifier rule requires that item(s) specified in the list of SPR feature
must be identified as sister to head daughter to satisfy the requirement and, thus,
completing the mother phrase.
(77)
phrase
VAL
SPR
SPR
H VAL
COMPS
The subject noun phrase acts as specifier for HEAD verb and determiner acts as
specifier for HEAD noun. In Urdu, where phrase order of sentence daughter phrases
is relatively free, SPR and COMPS features have no difference, but in order to keep
the correspondence with English based HPSG, the SPR is used for subject and
COMPS are used for objects and other noun phrases in Urdu.
3.4.4 Specifier Head Agreement Constraint
Verbs and common nouns in English HPSG are specified as shown in (78),
which shows that AGR of verb or common noun must match AGR of its own
50
(78)
HEAD
VAL
AGR 1
SPR AGR
PART II
MORPHOLOGICAL ANALYSIS
AND
LEXICAL ATTRIBUTES
51
Chapter 4
URDU VERB CHARACTERISTICS AND
MORPHOLOGY
Words are the building blocks of the grammar of a language. Morphology, also
known as Aelm-e-Sarf a , is a branch of linguistics that deals with the internal
structure of words. Morphemes are smallest building blocks that make words in a
language. Morphemes express concepts or relationships. A morpheme that could be a
meaningful whole word or a morpheme could be sequence of character(s), which is
not directly meaningful until it is joined with another morpheme or a word. For
example, car, table, anti, re, s, ing are morphemes. Morphemes which are not
meaningful word, normally convey information about syntactic features, like number
(singular, plural), tense (present, past, future) and gender (masculine, feminine). For
example, in word flower the single morpheme flower is recognized as the morph
flower to form the word flower. However in word flowers the word morpheme
and the plural morpheme are recognized as flower and s respectively, which
combine to form the word flowers. Allomorphs are the different forms of the same
morpheme. For example, the plural morpheme in English has two allomorphs, es, s.
The gerund form in English has three allomorphs, ing (as in playing), ing with edeletion (as in saving), and gemination (as in planning, jogging).
Free morphemes are those that can stand on their own as individual words, like
book, knock and soft. Bound morphemes are those that need to be attached to some
host morphemes to be realized as individual word. For example, the following
affixes are bound morphemes, e.g., re, s, ed, ly, which cannot occur as
standalone, but these impart meaningful information in words: reshape, books,
knocked, softly.
If a word has various word forms and these word forms belong to a single
grammatical category then these word forms are referred to as having the same
lexeme. For example, the words flower and flowers refer to the same noun
flower. The words run, runs, running refer to the same verb run. Thus,
flower and run are lexemes for their respective forms. Free morphemes are thus
usually lexemes.
Inflection morphology is the process of adding inflectional morphemes to a
word. The inflectional morpheme adds some type of grammatical information, i.e.,
52
53
case, number, person, gender, mood, mode, tense and aspect. Inflectional morphology
does not change grammatical category of the word and thus the inflected words refers
to the same lexeme.
Derivational morphology, in contrast, adds derivational morphemes, which
create a new word from an existing word, sometimes by simply changing grammatical
category, i.e., changing a noun to a verb. Words generally do not appear in
dictionaries with inflectional morphemes. However, they often do appear with
derivational morphemes. For instance, English dictionaries list words readable and
readability, which has been derived from the root read. However, most of English
dictionaries do not list book as one entry and books as another. Similarly, English
dictionaries do not list jump and jumped as two different entries.
Derivational morphology is thus the creation of new words out of other words
and morphemes. The new words formed normally belong to a different part of speech,
but not always. The words possible, possibly, impossible are made by using
derivational morphology. Similarly happy and happiness, inform, informer and
information are the words formed through derivational morphology.
Root is a lexical content morpheme having no affix. Root cannot be analyzed
into further smaller meaningful parts. Root is common to set of all derived or
inflected forms, when all of the affixes are removed. Root morpheme carries the main
fraction of meaning, e.g., in words: disestablish, establishment, establishments, the
word establish is a root to which various derivational and inflection morphemes are
attached.
A stem is the root or roots of a word, together with any derivational affixes, to
which inflectional affixes can be added. For example, both tie and untie are stem,
to which inflectional s can be added to form ties and unties
Compounding is the formation of new words, which is made by combining two
or more words. Each unit that combines in compounding is a lexeme in itself.
Examples are: blackbird, firefighter, hardhat, water-hose, rubber-hose, and fire-hose.
Morphological analysis means finding information associated with the given
word. For example, the word plays is analyzed as noun play in the plural form or
as a verb play which can be used with 3rd person, singular noun in present tense.
Morphological generation is the reverse of analysis, which means given the
information and the root word, generate the inflected or derived word.
4.1 Verb Transitivity and Valency
Verb Transitivity is the number of object noun phrases, in addition to subject
noun phrase, required by a verb in order to make a well-formed sentence. At least one
noun phrase, i.e., the subject, usually accompanies with the verb, which is not counted
in the verbs transitivity. Similarly, other adverbial and post-positional phrases are
54
treated as adjuncts and are not counted in transitivity. The verbs requiring zero, one
and two object noun phrases are termed as intransitive, transitive and ditransitive
verbs respectively.
Verb Valency is the total number of arguments required by the verb. Thus,
valency counts subject noun phrase, object noun phrases and other adverbial or postpositional phrases. Only those phrases are counted in valency, which are controlled by
verb, thus adverbial or post-positional phrases that are not governed by the verb are
treated as adjuncts. Valency is thus a general term that may apply to any other
grammatical category, such as English noun, which requires a determiner and Urdu
case marker, which requires a noun.
4.1.1 Intransitive Verb
hans-naa, to laugh
sao-naa, to sleep
khaans-naa, to cough
rao-naa, to weep
baol-naa, to speak
mar-naa, to die
daoR-naa, to run
ger-naa, to fell
jaag-naa, to wake up
chheenk-naa,
to sneeze
chhop-naa, to hide
aa-naa, to come
chal-naa, to walk
bhaag-naa, to sprint
belbelaa-naa, to mumble
aoktaa-naa, to exhaust
55
lekh-naa, to write
chakh-naa, to taste
soongh-naa, to smell
bolaa-naa, to call
taoR-naa, to break
xareed-naa, to buy
beych-naa, to sell
beyl-naa, to squeeze
a
a
paRh-naa, to read
chhoo-naa, to touch
pee-naa, to drink
deykh-naa, to see
leyT-naa, to lie
bayTh-naa, to sit
paydaa kar-naa, to give birth
baor kar-naa, to bore
4.1.3 Ditransitive
A ditransitive verb (
a
a ) is a term, which describes a verb or clause
that takes two arguments or objects. The original ditransitive verbs in Urdu are only
few. Either most of the ditransitive verbs are morphologically derivable from the
intransitive and transitive verbs or they are N-V compound verbs/ complex predicates.
The valency of ditransitive verbs is three. Table 4.3 shows some original and
compound ditransitive verbs.
Table 4.3: Some Original and Compound Ditransitive Verbs in Urdu
Original Ditransitive
dey-naa, to give
ley-naa, to take
baat-naa, to tell
bheyj-naa, to send
a
a
a
Compound Ditransitive
xareed dey-naa, to buy & give
peysh kar-naa, to present
bheyj dey-naa, to send
56
to recognize a root is to separate the suffix naa from the dictionary form of a verb
(the infinitive form). The remaining portion of an infinitive form is a root, also called
maadah ( ) verb and the root form can be used to make other forms of verb by
adding suffixes through morphology rules.
4.3.2 Causative Stem Forms
In Urdu and Hindi, it is well known that causative verbs are morphologically
formed by the addition of suffixes to root form (Abdul-Haq 1991; Bhatt and Embick
2003; Butt 2003). The causative formation normally increases the valency or
transitivity of a verb. The higher valency transitive and ditransitive verb forms, known
as the causative verb forms or transitivitized verb forms are derived from lower
valency verb roots by adding suffixes: aa, waa to the root form of the original verb.
The causative verb forms are called stem forms, because all the morphology that can
be applied to base or root form can also be applied to stem forms to make other forms
of the verb. The causative stems are sort of new verbs as these have different,
although related, meaning to the original root verbs. By causative verbs, an agent
causes or forces someone, known as a patient or an intermediate agent, to do some
action or change of state. Thus Urdu has a morphological causative formulation as
compared to English, which engages idiomatic use of verbs like make, get, have,
let or help for causatives.
To certain root forms, we can add suffix aa to form causative form 1.
Similarly, to certain root forms we can add suffix waa to form causative form 2,
using the morphology rules shown in (79). There are verb roots to which both
causative forms morphemes could be appended.
(79)
CausativeForm 1 = RootForm + aa
CausativeForm 2 = RootForm + waa.
Table 4.4: Some Divalent Verbs Derived from Univalent Verbs
Univalent Verbs
daoR-naa, to run
chal-naa, to walk
hans-naa, to laugh
ger-naa,
to fell
The above-mentioned causative formation rules can be used with many original
verbs in Urdu to form higher valency causative verbs by adding the suffix aa to the
root form. A few of causative verbs are listed in Table 4.4. It may be seen that derived
57
divalent verbs, although have related meaning to the one from which they are derived,
but their actual meaning and argument structures are different.
Moreover, the above-mentioned rule for transitivization of verbs is regular for
most of the verb roots. However, the root of verb is changed in some cases, especially
when the root form ends in a vowel or aag. Examples of irregular morphology are
shown in Table 4.5 below; see how root of verb is changed.
Table 4.5: Some Divalent Verbs Derived Irregularly from Univalent Verbs
Univalent Verbs
rao-naa, to weep
sao-naa, to sleep
see-naa, to sew
jaag-naa, to wake up
bhaag-naa, to sprint
Divalent Verbs
To drive ditransitive/ trivalent verbs from intransitive/ univalent verbs the suffix
waa is added to root form of a verb. For some verbs, verb root has irregular form for
making trivalent verb by the addition of suffix waa. In Table 4.6, both regular and
irregular formation of trivalent verbs, derived from univalent verbs, is shown.
Table 4.6: Some Trivalent Verbs Derived from Univalent Verbs
Univalent Verbs
ger-naa, to fell
chal-naa, to walk
hans-naa, to laugh
aoTh-naa, to standup
sao-naa, to sleep
rao-naa, to weep
daoR-naa, to run
jaag-naa, to wake up
bhaag-naa, to sprint
To drive ditransitive/ trivalent verbs from transitive verbs, the same suffixes
aa, waa is added to the root form of the verb. There are many verbs formed by
adding suffix waa to the root form of divalent verb, which take four arguments and
thus function as tetravalent verbs. Table 4.7 shows regular and irregular formation of
trivalent verbs from divalent verbs. Table 4.8 shows regular and irregular formation of
tetravalent verbs from divalent verbs.
58
see-naa, to sew
bolaa-naa, to call/invite
pee-naa, to drink
khaa-naa, to eat
deykh-naa, to see
soongh-naa, to smell
chakh-naa, to taste
son-naa, to listen
samjh-naa, to understand
paRh-naa, to read
lekh-naa, to write
son-naa, to listen
pee-naa, to drink
khaa-naa, to eat
deykh-naa, to see
soongh-naa, to smell
chakh-naa, to taste
son-naa, to listen
samjh-naa, to grasp
The dictionary form of the verb in Urdu is infinitive form, called maSdar
), which contains suffix naa. The infinitive form acts as a verbal-noun and it
(
can be used in place of a noun. The normal infinitive form ends in masculine suffix
naa. The suffix or morphemes for feminine infinitive form and oblique infinitive form
are nee, ney respectively. The infinitive appears in the masculine, the feminine and
the oblique forms as shown in Table 4.9. It is worth to note here in Table 4.9 that
59
feminine infinitive form does not appear for intransitive verbs, because feminine form
is only used for object agreement and intransitive verbs do not allow object to be
associated with them. With all root forms or stem forms of the verb, we can use the
following rules to generate infinitive forms of verb.
(80)
bol
sao
paRh
xareed
deykh
dey
English
Transitivity
laugh
intransitive
speak
intransitive
sleep
intransitive
read
transitive
buy
transitive
look
transitive
give
ditransitive
Masculine
hans-naa
bol-naa
sao-naa
Feminine
x
x
x
Oblique
hans-ney
bol-ney
sao-ney
paRh-naa
paRh-nee
paRh-ney
xareed-naa
deykh-naa
dey-naa
xareed-nee
deykh-nee
dey-nee
xareed-ney
deykh-ney
dey-ney
Although combining words is a topic of syntax that will be covered later, yet it
is worth to note here that feminine plural repetitive form is never used in
combination with feminine plural auxiliary. It means if a sentence has feminine
plural subject and for agreement requirements if feminine plural auxiliary is used
60
then feminine singular repetitive verb form is used instead of plural form. Compare
sentences (82) and (83), which require subject-agreement which is feminine plural,
and the sentence that take feminine singular verb form with feminine plural
auxiliary verb is correct while that uses feminine plural verb form is incorrect.
However, the feminine plural verb form is used without auxiliary verb when there is
a series of sentences in a narration. An example is shown in (84).
Table 4.10: Repetitive Forms for Few Urdu Verbs
Root
hans
bol
sao
paRh
xareed
deykh
dey
English
Transitivity
laugh
intransitive
speak
intransitive
sleep
intransitive
read
transitive
buy
transitive
look
transitive
give
ditransitive
Masculine
Singular
Feminine
Singular
Masculine
Plural
Feminine
Plural
hans-taa
hans-tee
hans-tey
hans-teeN
bol-taa
bol-tee
bol-tey
bol-teeN
sao-taa
sao-tee
sao-tey
sao-teeN
paRh-taa
paRh-tee
paRh-tey
paRh-teeN
xareed-taa
deykh-taa
dey-taa
xareed-tee
deykh-tee
dey-tee
xareed-tey
deykh-tey
dey-tey
xareed-teeN
deykh-teeN
dey-teeN
(82)
a
a
a aa
[anjom aor batool]
ketaab-eeN xareed-tee
th-eeN
[Anjom and Batool].fem.pl Book-fem.pl buy-repeat.fem.sg be.past-fem.pl
Anjom and Batool were used to buy books.
(83)
a
a
a aa *
ketaab-eeN xareed-teeN
th-eeN
* [anjom aor batool]
[Anjom and Batool].fem.pl Book-fem.pl buy-repeat.fem.pl be.past-fem.pl
(84)
a a a
a aa
a
a aa
[anjom aor batool] ketaab-eeN xareed-teeN aor aonheyN bayg meyN rakh
ley-teeN
[Anjom and Batool].fem.pl book-fem.pl buy-repeat.fem.pl and those-pl bagfem.pl put-base take-repeat.fem.pl
Anjom and Batool were used to buy books and to put those in a bag.
The perfective form is formed by just adding number and gender agreement
suffix, aa, ee, ey, eeN , to the root or stem form. For verb roots that end in
vowels, the morphology is not regular. The regular perfective verb forms are shown in
Table 4.11 and the irregular perfective verb forms are shown in Table 4.12. With most
61
of the root forms or stem forms, the following rules can be used to generate perfective
forms.
(85)
PerfectiveForm = StemForm + aa
PerfectiveForm = StemForm + ee
PerfectiveForm = StemForm + ey
PerfectiveForm = StemForm + eeN
However, for causative stem forms the rule for singular masculine perfective
form needs to add morpheme yaa instead of regular morpheme aa for root form.
The morpheme yaa is irregular form of the perfective morpheme aa and this
requires special phonological rules (Kaplan and Kay 1994) and may be handled using
Xerox tool TWOLC, which is short for two level rule compiler.
(86)
bol
paRh
xareed
deykh
English
Transitivity
laugh
intransitive
speak
intransitive
read
transitive
buy
transitive
look
transitive
Masculine
Singular
Feminine
Singular
Masculine
Plural
Feminine
Plural
hans-aa
bol-aa
hans-ee
hans-ey
hans-eeN
bol-ee
bol-ey
bol-eeN
paRh-aa
xareed-aa
deykh-aa
paRh-ee
xareed-ee
deykh-ee
paRh-ey
xareed-ey
deykh-ey
paRh-eeN
xareed-eeN
deykh-eeN
rao
jaa
khaa
kar
see
dey
ley
New
Root
English
Transitive
sleep
intransitive
ga
weep
intransitive
Go
intransitive
Eat
transitive
kee
Do
transitive
Sew
transitive
Give
ditransitive
Take
ditransitive
dee
lee
Masculine
Singular
Feminine
Singular
Masculine
Plural
Feminine
Plural
sao-yaa
rao-yaa
sao-ee
rao-ee
sao-ey
rao-ey
sao-eeN
rao-eeN
ga-yaa
ga-ee
ga-ey
ga-eeN
khaa-yaa
kee-aa
see-aa
dee-aa
lee-aa
khaa-ee
kee
see
dee
lee
khaa-ey
kee-ey
see-ey
dee-ey
lee-ey
khaa-eeN
kee-N
see-N
dee-N
lee-N
62
The subjunctive form is formed by just adding person and number agreement
suffix, ooN, ey, ao, eyN, to the root or stem form. The gender variation has no
effect on the subjunctive form. The subjunctive mood expresses feelings, opinions,
suggestions, desires, hopes, wishes. It is used to explain unclear, imaginary events and
future happenings. The subjunctive form is used with appropriate future auxiliary to
make future tense. With most of the Urdu root and stem forms, the following rules
may be used to generate subjunctive forms:
(87)
SubjunctiveForm = StemForm + ao
SubjunctiveForm = StemForm + ooN
SubjunctiveForm = StemForm + ey
SubjunctiveForm = StemForm + eyN
Table 4.13: Subjunctive Forms for Few Urdu Verbs
Root
English Valency
laugh
speak
read
buy
xareed
look
deykh
hans
bol
paRh
Form 1
hans-ooN
bol-ooN
paRh-ooN
xareed-ooN
deykh-ooN
Form 2
Form 3
Form 4
hans-ao
hans-ey
hans-eyN
bol-ao
bol-ey
bol-eyN
paRh-ao
xareed-ao
deykh-ao
paRh-ey paRh-eyN
xareed-ey xareed-eyN
deykh-ey deykh-eyN
It is worth to note that the perfective morpheme eeN and the subjunctive
morpheme eyN are ambiguous in Urdu script because these are written with the same
characters but the pronunciation of these are different because of the difference in two
vowel sounds, i.e., between the baRee yey and the chhotee yey:
(88)
a
a a
aonhooN=ney ketaab-eyN xareed-eeN
They.pron.pl=erg Book-fem.pl buy-perf.fem.pl
They bought books.
(89)
a
!a
Come! ketaab-eyN xareed-eyN
Come! Book-fem.pl buy-subj.form4
Come! Let us buy books.
63
imperative mood, which is normally used for second persons. With most of the Urdu
root and stem forms, the following rules may be used to generate imperative forms:
(90)
ImperativeForm = StemForm
ImperativeForm = StemForm + ao
ImperativeForm = StemForm + eyN
ImperativeForm = StemForm + eeey
Table 4.14: Imperative Forms for Few Urdu Verbs
Root
English Valency
laugh
speak
read
buy
xareed
look
deykh
hans
bol
paRh
Frank
Formal
Polite
More Polite
(or Rude) (or Familiar) (or Respect) (or Request)
hans
bol
paRh
xareed
deykh
hans-ao
hans-eyN
hans-eeey
bol-ao
bol-eyN
bol-eeay
paRh-ao
xareed-ao
deykh-ao
paRh-eyN
paRh-eeey
xareed-eyN
deykh-eyN
xareed-eeey
deykh-eeey
64
verb
dokh
beh
root1
xareed
bandh
root3
baol
hans
root2
aa
aa
caus1
naa, nee,
ney
dey
ley
root4
waa waa
caus2
aa, ao,
eyN, eeey
stem
infinitive
imperative
taa, tee,
tey, teeN
repetitive
ooN, ey,
ao, eeN
subjunctive
aa, ee,
ey, eeN
perfect
Figure 4.1: Finite State Network for Urdu Verb Morphological Forms
Table 4.15: Sixty Forms of Verb Read in Urdu
paRh-
Root Form
Infinitive
Repetitive
Perfective
Subjunctive
Imperative
paRh-nee
Repetitive
Perfective
Subjunctive
Imperative
paRh-naa
paRh-teeN
paRh-tee
paRh-tey
paRh-taa
paRh-eeN
paRh-ey
paRh-aa
paRh-eyN
paRh-ee
paRh-ooN
paRh-ey
paRh-eeey
paRh-eyN
paRh-ao
paRh-ao
paRh-
Causative
Stem 1
Infinitive
paRh-ney
paRh-aa
paRh-aa-nee
paRh-aa-ney
paRh-aa-naa
paRh-aa-teeN
paRh-aa-tee
paRh-aa-tey
paRh-aa-taa
paRh-aa-eeN
paRh-aa-ey
paRh-aa-eyN
paRh-aa-ee
paRh-aa-ooN
paRh-aa-yaa
paRh-aa-ao
paRh-aa-eeey
paRh-aa-eyN
paRh-aa-ey
paRh-aa-ao
paRh-aa
Causative
Stem 2
Infinitive
Repetitive
Perfective
Subjunctive
Imperative
paRh-waa-nee
paRh-waa-teeN
paRh-waa-eeN
paRh-waa-eyN
paRh-waa-eeey
65
paRh-waa
paRh-waa-ney
paRh-waa-naa
paRh-waa-ee
paRh-waa-ey paRh-waa-yaa
paRh-ooN
paRh-aa-ey
paRh-waa-ao
paRh-waa-eyN paRh-waa-ao
paRh-waa
These sixty forms are shown with complete morphological information in Table
4.16. For verb forms mostly there is no ambiguity, however the subjunctive
morpheme eyN has three different tags sets and the same morpheme appears also in
imperative form. Similarly, the root form and the imperative rude form are the same
due to the existence of a null morpheme for the imperative rude form. The verb forms
are considered different if they lie in different categories, i.e., infinitive, perfective,
repetitive, subjunctive and imperative forms in the categorization shown in Table
4.16. Therefore, the subjunctive verb that ends with morpheme eyN having three
different tags is considered one form, while imperative form having the same
morpheme is considered a separate form.
Table 4.16: Sixty Forms of Verb Read with Morphological Information
Sr. No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Urdu script
Transliteration
paRhpaRh-naa
paRh-nee
paRh-ney
paRh-teeN
paRh-tee
paRh-tey
paRh-taa
paRh-eeN
paRh-ee
paRh-ey
paRh-aa
paRh-eyN
paRh-ooN
paRh-ey
paRh-ao
paRh-eeay
paRh-eyN
paRh-ao
Morphological Information
paRhnaa+V+root
paRhnaa+V+inf+masc
paRhnaa+V+inf+fem
paRhnaa+V+inf+obl
paRhnaa+V+repeat+fem+pl
paRhnaa+V+repeat+fem+sg
paRhnaa+V+repeat+masc+pl
paRhnaa+V+repeat+masc+sg
paRhnaa+V+perf+fem+pl
paRhnaa+V+perf+fem+sg
paRhnaa+V+perf+masc+pl
paRhnaa+V+perf+masc+sg
paRhnaa+V+subj+1st+pl
paRhnaa+V+subj+2nd+polite
paRhnaa+V+subj+3rd+pl
paRhnaa+V+subj+1st+sg
paRhnaa+V+subj+3rd+sg
paRhnaa+V+subj+2nd+formal
paRhnaa+V+subj+2nd+request
paRhnaa+V+impr+2nd+polite
paRhnaa+V+impr+2nd+formal
Urdu script
Transliteration
paRhpaRh-aa
paRh-aa-naa
paRh-aa-nee
paRh-aa-ney
paRh-aa-teeN
paRh-aa-tee
paRh-aa-tey
paRh-aa-taa
paRh-aa-eeN
paRh-aa-ee
paRh-aa-ey
paRh-aa-yaa
paRh-aa-eyN
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
paRh-aa-ooN
paRh-aa-ey
paRh-aa-ao
paRh-aa-eeay
paRh-aa-eyN
paRh-aa-ao
paRh-aa
paRh-waa
paRh-waa-naa
paRh-waa-nee
paRh-waa-ney
paRh-waa-teeN
paRh-waa-tee
paRh-waa-tey
paRh-waa-taa
paRh-waa-eeN
paRh-waa-ee
paRh-waa-ey
paRh-waa-yaa
paRh-waa-eyN
54
55
56
57
58
59
60
paRh-ooN
paRh-aa-ey
paRh-waa-ao
paRh-waa-eeay
paRh-waa-eyN
paRh-waa-ao
paRh-waa
66
Morphological Information
paRhnaa+V+impr+2nd+frank
paRhnaa+V+root +caus1
paRhnaa+V+caus1+inf+masc
paRhnaa+V+caus1+inf+fem
paRhnaa+V+caus1+inf+obl
paRhnaa+V+caus1+repeat+fem+pl
paRhnaa+V+caus1+repeat+fem+sg
paRhnaa+V+caus1+repeat+masc+pl
paRhnaa+V+caus1+repeat+masc+sg
paRhnaa+V+caus1+perf+fem+pl
paRhnaa+V+caus1+perf+fem+sg
paRhnaa+V+caus1+perf+masc+pl
paRhnaa+V+caus1+perf+masc+sg
paRhnaa+V+caus1+subj+1st+pl
paRhnaa+V+caus1+subj+2nd+polite
paRhnaa+V+caus1+subj+3rd+pl
paRhnaa+V+caus1+subj+1st+sg
paRhnaa+V+caus1+subj+3rd+sg
paRhnaa+V+caus1+subj+2nd+formal
paRhnaa+V+caus1+subj+2nd+request
paRhnaa+V+caus1+impr+2nd+polite
paRhnaa+V+caus1+impr+2nd+formal
paRhnaa+V+caus1+impr+2nd+frank
paRhnaa+V+root+caus2
paRhnaa+V+caus2+masc
paRhnaa+V+caus2+fem
paRhnaa+V+caus2+obl
paRhnaa+V+caus2+repeat+fem+pl
paRhnaa+V+caus2+repeat+fem+sg
paRhnaa+V+caus2+repeat+masc+pl
paRhnaa+V+caus2+repeat+masc+pl
paRhnaa+V+caus2+perf+fem+pl
paRhnaa+V+caus2+perf+fem+sg
paRhnaa+V+caus2+perf+masc+pl
paRhnaa+V+caus2+perf+masc+sg
paRhnaa+V+caus2+subj+1st+pl
paRhnaa+V+caus2+subj+2nd+polite
paRhnaa+V+caus2+subj+3rd+pl
paRhnaa+V+caus2+subj+1st+sg
paRhnaa+V+caus2+subj+3rd+sg
paRhnaa+V+caus2+subj+2nd+formal
paRhnaa+V+caus2+subj+2nd+request
paRhnaa+V+caus2+impr+2nd+polite
paRhnaa+V+caus2+impr+2nd+formal
paRhnaa+V+caus2+impr+2nd+frank
Figure 4.2 shows Acyclic Deterministic Finite State Automata (ADFSA) for
few Urdu words having root forms: hans, bol, paRh, deykh and xareed. Each node
67
represents state and arrow represents transition. The hollow nodes represent
intermediate states, while filled nodes represent final states. The characters starting
with a dot represent grammatical information, while those starting with no dot
represent normal characters.
h
b
o
a
n
n
.inf
.m
ee
ey
.f
aa
.caus1
e
e
waa
.caus2
d
.impr.fr
ao
t
.perf
.repeat
.obl
eyN
eeay
ao ooN ey
aa
.sg.m
.sg.f
aa
.impr.fo
.impr.po
eyN .impr.re
eeN
.pl.m
.subj.2.ru
.subj.3.sg .subj.2.re
.subj.3.pl
.subj.1.sg
.subj.1.pl
.subj.2.fo
.pl.f
4.5 Tense
Tense ( )tells about the location in time at which an event occurs or a state
changes. It is mainly divided into three categories: present; past and future. It is a
grammatical category which is either marked on the verb itself or it can be marked on
the accompanying auxiliary or helping verbs. Tense refers to the time of the event or
state denoted by the verb in relation to the time of utterance.
The tense can be represented in terms of Reichenbachian relations (Butt 2003).
It defines three temporal points: the time of utterance/speech (S), the reference time
(R), and the event time (E). These three points generate two relationships, one
between S and R time (S/R), which is contextually determined relation, and another
between R and E time (R/E), which is intrinsic relation. The temporal points of these
relationship may occur simultaneous, as S and R are in the Present Tense, or may be
ordered sequentially, as in the tenses with perfect aspect, E occurs before R (E < R),
68
regardless of the relationship (S/R). This allows for perfect aspect in the past, present
and future tenses. Table 4.17 lists tenses in Reichenbachian concept relations.
Table 4.17: Tenses in Reichenbachian Concept Relations
Tense
Present Tense
Present Perfect Tense
Past Tense
Past Perfect Tense
Future Tense
Future Perfect Tense
Reichenbachian Relations
E R and R S
E < R and R S
E R and R < S
E < R and R < S
E R and R > S
E < R and R > S
Tense
Present
Present
Present
Present
Present
Present
Past
Past
Past
Past
Past
Past
Past
Past
Future
Future
Future
Future
Future
Future
Person
1st
2nd
2nd
2nd
3rd
1st, 3rd
1st, 3rd
1st, 3rd
2nd
2nd
1st, 3rd
1st, 3rd
2nd
2nd
1st, 3rd
1st, 3rd
2nd
2nd
1st, 3rd
2nd
Gender
masc, fem
masc, fem
masc, fem
masc, fem
masc, fem
masc, fem
masc
masc
masc
masc
fem
fem
fem
fem
masc
masc
masc
masc
fem
fem
Number
sg
sg
pl
sg
pl
sg
pl
sg
pl
sg, pl
sg, pl
Honor Form
frank
formal
polite
frank
formal, polite
frank, formal
polite
formal, polite
The auxiliaries for tense have complex dependence on person, number and
gender as shown in Table 4.18. The present auxiliaries are the same for masculine and
feminine gender, in other words, these do not have dependency on gender. For second
person, Urdu has honorific forms like frank (or rude), formal (or familiar) and polite
(or respect). In the case of second person, most of the time the same auxiliary is used
for singular or plural person, therefore the number is not significant.
In negative present tense, sometimes, present auxiliary is dropped.
69
rahaa
rahee
rahey
Aspect
Perfect
Perfect
Perfect
Perfect
Perfect
Perfect
Progressive
Progressive
Progressive
Person
1st, 3rd
1st, 3rd
1st, 3rd
2nd
2nd
2nd
1st, 3rd
1st, 3rd
1st, 3rd
Gender
masc
fem
masc
masc
masc
fem
masc
fem
masc
Num
sg
sg, pl
pl
sg
sg, pl
pl
Honor Form
frank
formal, polite
frank, formal, polite
Perfective aspect in Urdu can be expressed by either the use of perfective verb
auxiliary and or by perfective verb-morpheme. There are other aspect auxiliaries in
Urdu like chalaa, jaa, rahaa, lagaa, etc. that show duration and repetition related
aspects. The following aspects will be discussed, with some syntactic details, in
Chapter 8.
Perfective aspect
Progressive aspect
Repetitive aspect
Inceptive aspect
4.7 Mood
The verb mood describes the relationship of a verb with respect to purpose and
actual happening. Languages mostly differentiate various moods by inflecting the
verb form. The verb mood of a verb expresses a fact (indicative mood), a command
(imperative mood), a question (interrogative mood), a wish (optative mood), or a
conditionality (subjunctive mood). This aspect is shown by using modality. A modal
auxiliary is used in English to show the mood. The following moods are commonly
expressed in Urdu texts.
70
Declarative mood
Permissive mood
Prohibitive mood
Imperative mood
Capacitive mood
Suggestive mood
Compulsive mood
Dubitative/Presumptive mood
Subjunctive mood
In this Chapter, various Urdu verb forms and characteristics are described. The
attributes and the values, which attributes can take to represent verb types and
characteristics, are summarized in Table 4.20. These attribute-values are useful in
describing Urdu verbs for the morphological and syntactical analysis.
Table 4.20: AttributeValues for Urdu Verbs
Attribute
VFORMSTM
VFORM
VFORMINF
VFORMSUB
VFORMIMP
GENDER
NUMBER
PERSON
TENSE
ASPECT
MOOD
VOICE
BASELANG
Values
root, causative1, causative2
infinitive, perfective, repetitive, imperative, subjunctive
absolute, oblique
S1, S2, S3, S4
frank, formal, polite, request
masculine, feminine
singular, plural
first, second, third
present, past, future
perfect, repetitive, progressive, inceptive
declarative, imperative, subjunctive, capacitive,
presumptive, compulsive, permissive, prohibitive, suggestive
active, passive
Arabic, Persian, Hindi, Turkish, English
Comments
verb stem forms
verb forms
infinitive verb forms
subjunctive verb forms
imperative verb forms
gender attribute
number attribute
person attribute
tense
aspect
mood
voice
base language
Urdu verbs lexical attributes and their respective values are associated with
words, however, these are also syntactically important at a sentence level, because
these are useful in various syntactic agreement requirements.
Chapter 5
URDU NOUN CHARACTERISTICS AND
MORPHOLOGY
Noun ( )is a word, which is the name of something, i.e., name of a person, an
animal, a place, a thing, a situation, a time, or a concept, etc.
Initial classification of nouns is into proper and common (improper) nouns.
a a )is the name of particular person, place or thing, like:
Proper noun (
Zafar, Lahore, Kohinoor, etc. Common noun ( a a )is the general name for any
person, place or thing, like: boy, city, diamond, etc.
Attribute
NCLASS
Values
proper, common
Comments
Noun Classes
Common nouns are further classified with respect to concept they are
representing, into state nouns, group nouns, spatial nouns, temporal nouns,
instrumental nouns, etc.
Attribute
NCONCEPT
Values
state, group, spatial, temporal,
instrument
Comments
Concepts represented by
common nouns
theme. These are usually used in declarative sentences telling some state or news
about someone. Group Nouns ( a )represent a group or collection of multiple
nouns and look like that their number is plural, but their syntactic use in the sentence
is singular. Spatial Nouns ( a a )refer to location in space. Temporal Nouns
( a a )refer to location in time. Instrumental Nouns ( a )refer to
instrument. Some examples of common noun categories are shown in Table 5.1.
Another classification of common nouns is mass and count nouns. Mass noun
a a -a a )is the same for part or whole of something, e.g., a small amount of
(
water is called water and similarly whole sea contains water. The mass nouns are not
counted. Some examples of mass and count nouns are shown in Table 5.2.
71
72
faoj, army
aanjoman, society
meylah, fair/ festival
(c) Few Spatial Nouns in Urdu
ghar, home
parestaan, fairyland
parsooN,
a day after tomorrow/
a day before yesterday
(e) Few Instrument Nouns in Urdu
chaaqoo, knife
qalam, pen
kolhaaRee, axe
qttaar, queue
jhonD, cluster
garoop, group
Count Nouns
makaan, house
palang, bed
ghaRee, watch
baaG, garden
However, mass nouns may adopt plural and oblique forms, if we want to refer
to number of different kinds of mass nouns. Like in (91), daal contains plural
morpheme eyN to refer to different kinds of grams (or pulses) used to make haleem,
a special Asian dish.
(91)
a
a
a aa a a a a
meyN=ney saaree daal-eyN Daal kar haleem pakaaee hay
I=erg all gram-pl put having haleem.sg.fem cook.perf.sg.fem
I, having put all the grams, cooked haleem.
73
The basic characteristics associated with Urdu nouns are: (1) gender; (2)
number; (3) form; and (4) case, which are briefly discussed below.
5.1.1 Gender
Nouns in Urdu bear masculine and feminine gender (Mustafa 1973; Abdul-Haq
1991; Schmidt 1999). This gender is realistic for animate nouns, which have natural
gender classification, but for inanimate nouns, this gender classification is unrealistic
and artificial, because they do not have natural gender. This tradition of assigning
gender to inanimate nouns has come in Urdu from its ancestors languages. Gender of
such nouns in some languages is neutral, which is a realistic classification. Some
gender classification for Urdu nouns is shown in Table 5.3.
Table 5.3: Gender for Some Urdu Nouns
moZakar, Masculine
naawel, novel
qalam, pen
makaan, house
aadmee, man
maoanas, Feminine
ketaab, book
pensel, pencil
daokaan, shop
Aaorat, woman
There is no general rule in Urdu to find the gender classification for inanimate
nouns. Usually huge, heavy, powerful, dominant and bigger things are masculine,
while smaller, weak and lighter are feminine. Normally, bigger nouns ( a )are
a )are feminine as shown in Table 5.4.
masculine, while smaller nouns (
Table 5.4: Few Smaller and Bigger Nouns
Bigger Noun ( a )
moZakar, Masculine
ras-aa, thick rope
jaal, net
pag, big special cap
paggaR, big special cap
ghaR-ee-aal, clock
deygch-aa, big pan
Smaller Noun (
a )
maoanas, Feminine
ras-ee, thin rope
gaol-ee, small spherical thing
jaal-ee, small net
ghaR-ee, watch
deygch-ee, small pan
74
However, Arabic based nouns, ending with suffix h, are mostly singular feminine.
Nouns ending with Persian suffixes pan, pa are masculine. Table 5.5 shows
examples of nouns with masculine gender suffixes.
Table 5.5: Nouns with Masculine Gender Suffixes
laRk-aa, boy
morG-aa, rooster
qarJ-ah, loan
raopey-ah, rupee
boRhaap-aa, old age
laRk-ey, boys
bakr-ey, (male) goats
raop-ey, rupees
bach-pan, childhood
Similarly, the (Hindi based) nouns ending in suffixes ee, eeaa are generally
singular feminine and the nouns ending in suffixes eeaN, eyN are generally plural
feminine. Arabic-based nouns adopted in Urdu ending with suffixes at, aa are
feminine. Persian based nouns adopted in Urdu ending with suffixes gaah, ee, gee,
haT, aawaT are feminine. Various examples of feminine nouns with the abovementioned endings are shown in Table 5.6.
Table 5.6: Nouns with Feminine Gender Suffixes
laRk-ee, girl
morG-ee, hen
don-eeaa, world
ceR-eeaa, sparrow
Aebaadat-gaah, place of worship
zenda-gee, life
raok-aawaT, obstacle
laRk-eeaN, girls
morG-eeaN, hens
ketaab-eeN, books
ceR-eeaN, sparrows
daost-ee, friendship
ghabraa-haT, discomfort
moskraa-haT, smile
5.1.2 Number
Urdu nouns like English have two dimensions of number: singular and plural.
Unlike Arabic or Sanskrit, it has no category for dual nouns.
5.1.3 Form
75
laRkeeaN, girls
morGaa, rooster
kamrah, room
Oblique (OBL)
laRkey, boy
laRkee, girl
laRkaoN, boys
laRkeeoN, girls
morGey, rooster
kamrey, room
Vocative (VOC)
laRkey, boy
laRkee, girl
laRkao, boys
laRkeeo, girls
laogao, people
bach.chao, children
5.1.4 Case
The case markers that follow nouns in the form of post positions cannot be
handled at lexical level through morphological suffixes and are thus needed to be
handled at syntactic level (Butt and King 2002). Table 5.8 lists case markers in Urdu
along with example sentences.
Table 5.8: Case Markers in Urdu
Case
Ergative
(agent/subject)
Case Marker
a
a
mayN ney laRkey kao ketaab dee
I gave the book to the boy.
Dative
(indirect object)
kao
Accusative
(direct object)
kao
Instrumental
sey
Ablative
(agent in passive)
sey
Locative
meyN
Locative
par
a a a
laRkey ney ketaab xareedee
The boy bought a book.
ney
Example Sentence
a a a a
laRkey ney ketaab kao xareedaa
The boy bought the book.
a a
a a
laRkey ney pensel sey lekhaa
The boy wrote with the pencil
a a a a
laRkey sey xat lekhaa gayaa
a a a
laRkaa kamrey meyN hay
The boy is in the room.
a a a
ketaab meyz par hay
The book is on the table.
a
a a
laRkaa ketaab xareedey gaa
The boy will buy a book
76
This work divides Urdu nouns into five categories based on difference in
morphemes and associated syntactic information.
The category 1 nouns are animate nouns that end with morpheme aa or ah.
More specifically the morphology of this category is applicable in daily life usage to
those animate nouns that are used for humans but sometimes this morphology is also
used with other animate nouns in a narration or a story. In this category although there
are eight morphemes, i.e., aa (or ah), ey, ooN, ao, ee, eeaN, eeooN, ee
ao but total noun forms are ten based on different tags as shown in Table 5.9 (a).
1
2
3
4
5
6
7
8
9
10
Morphological
Tags
+masc+sg
+masc+sg+obl
+masc+pl
+masc+pl+obl
+masc+pl+voc
+fem+sg
+fem+sg+obl
+fem+pl
+fem+pl+obl
+fem+pl+voc
boy
child
lark-aa
lark-ey
lark-ey
lark-ooN
lark-ao
lark-ee
lark-ee
lark-ee-aN
lark-ee-ooN
lark-ee-ao
bach.ch-ah
bach.ch-ey
bach.ch-ey
bach.ch-ooN
bach.ch-ao
bach.ch-ee
bach.ch-ee
bach.ch-ee-aN
bach.ch-ee-ooN
bach.ch-ee-ao
masc. goat
bakr-aa
bakr-ey
bakr-ey
bakr-ooN
bakr-ao
bakr-ee
bakr-ee
bakr-ee-aN
bakr-ee-ooN
bakr-ee-ao
1
2
3
4
Morphological
Tags
+masc+sg
+masc+sg+obl
+masc+pl
+masc+pl+obl
Mango
Novel
Letter
Plane
Question
aam
naawel
xatt
jahaaz
jahaaz
aam-ooN
naawel-ooN
xatt-ooN
a
jahaaz-ooN
jahaaz-ooN
1
2
3
Morphological
Tags
+fem+sg
+fem+sg+obl
+fem+pl
+fem+pl+obl
Book
Table
Talk
Road
Socks
ketaab
meyz
baat
baat
ketaab-eyN
meyz-eyN
baat-eyN
baat-eyN
joraab
joraab-eyN
77
Morphological
Tags
+masc+sg
2
3
4
+masc+sg+obl
+masc+pl
+masc+pl+obl
Lock
Food
taal-aa
khaan-aa
taal-ey
khaan-ey
Door
Room
Birdcage
darwaaz-aa
darwaaz-ey
kamar-aa
kamar-ey
penjr-aa
penjr-ey
Key
Car
Bread
chaab-ee
chaab-eeaN
gaaR-ee
gaaR-eeaN
raot-ee
raot-eeaN
1
2
3
Morphological
Tags
+fem+sg
+fem+sg+obl
+fem+pl
+fem+pl+obl
Chair
kors-ee
kors-eeaN
Staircase
seeRh-ee
seeRh-eeaN
The category 2 nouns are inanimate masculine nouns that do not end in
masculine gender morpheme aa or ah. The singular, plural and singular-oblique
forms of these nouns are the same, but their plural-oblique form has morpheme ooN.
Table 5.9 (b) lists some category 2 nouns along with gender, number and obliqueness
information tags.
The category 3 nouns are inanimate feminine nouns that do not end in feminine
gender morpheme. The singular and singular-oblique forms of these nouns are the
same, the plural form has morpheme eyN, their plural-oblique form has morpheme
ooN. Table 5.9 (c) lists some category 3 nouns along with gender, number and
obliqueness information tags.
The category 4 nouns are inanimate masculine nouns that end in masculine
gender morpheme aa or ah. Their singular form has morpheme aa or ah, their
singular-oblique and plural forms have morpheme ey and their plural-oblique form
has morpheme ooN. Table 5.9 (d) lists some category 4 nouns along with gender,
number and obliqueness information tags.
The category 5 nouns are inanimate feminine nouns that end in feminine gender
morpheme ee. Their singular and singular-oblique forms have morpheme ee, the
plural forms have morpheme eeaN and the plural-oblique form has morpheme ee
ooN. Table 5.9 (e) lists some category 5 nouns along with gender, number and
obliqueness information tags.
78
Adjectives in Urdu come before noun to which they modify and these are
required to agree with noun form in gender, number and obliqueness if they have
morpheme to represent these features. If adjectives do not have morpheme to identify
gender, number and obliqueness features then theoretically they do not require to
agree, but in practice to handle both categories of adjectives in a uniform manner this
work assumes that these have these features which are not morphologically visible.
Therefore, this work divides Urdu adjectives into two categories: one having
morphology for agreement with noun as shown in Table 5.10 (a) and the other has no
morphology to agree with nouns as shown in Table 5.10 (b). However, both adjective
categories have gender, number and obliqueness features to satisfy noun-adjective
agreement equations.
Table 5.10: Adjective Morphology in Urdu
(a) Category 1 Adjective Morphology in Urdu
Morphological
Tags
good
blue
1 +masc+sg
ach.ch-aa
neel-aa
2 +masc+sg+obl
3 +masc+pl
ach.ch-ey
neel-ey
4 +masc+pl+obl
5 +fem+sg
6 +fem+sg+obl
ach.ch-ee
neel-ee
7 +fem+pl
8 +fem+pl+obl
green
har-aa
har-ey
fresh
taaz-ah
taaz-ey
third
teesr-aa
teesr-ey
harsh
kaRw-aa
kaRw-ey
har-ee
taaz-ee
teesr-ee
kaRw-ee
1
2
3
4
5
6
7
8
Morphological
Tags
round
red
red
old
naughty
hard
worker
+masc+sg
+masc+sg+obl
+masc+pl
+masc+pl+obl
+fem+sg
+fem+sg+obl
+fem+pl
+fem+pl+obl
gaol
sorkh
laal
baasee
shareer
meHnatee
In this Chapter, noun types and characteristics related to Urdu nouns has been
reviewed. The summary of attributes and various values, which attributes can take,
79
are listed in Table 5.11. These attributes and associated values may be helpful for the
morphological and syntactical analysis for Urdu nouns.
Table 5.11: AttributeValues for Urdu Nouns
Attribute
N-CLASS
N-CONCEPT
N-TYPE
GENDER
N-FORM
NUMBER
CASE
PERSON
BASELANG
Values
common, proper
abstract, group, spatial, temporal, instrumental,
animate
mass, count
masculine, feminine
nominative, oblique, vocative
singular, plural
nominative, ergative, dative, accusative, instrumental,
locative, travel, infinitive, participant, temporal
first, second, third
Arabic, Persian, Hindi, Turkish, English
Comments
noun class
noun semantic
concept
noun type
noun gender
noun form
noun number
noun case
person
base language
Urdu nouns attributes and their respective values are lexical in nature as these
are associated with words, however, these are also syntactically important, because
these are useful in various syntactic agreement requirements.
Chapter 6
ALGORITHMS FOR LEXICON IMPLEMENTATION
6.1 Introduction
This chapter reviews the various algorithms and methods for efficient storage
and retrieval of lexicon. The chapter has been organized into two main parts: In the
first part lexicon is implemented and tested using hash tables with least consideration
of morphology and therefore all word forms are stored separately in the hash table.
The hash table storage is efficient to access but requires more space in memory. In
second part lexical transducers, which are specialized finite state automata, are
considered for the storage of Urdu lexicon. These are efficient both in time and space
but require morphological analysis of language data.
6.2 Storage of Urdu Lexicon
80
81
High word lookup efficiency of the order of O(1), close to perfect hashing, can
be achieved using hash tables with appropriate hash functions. Hashing results in a
simpler and acceptable lexicon design at the cost of some extra space. A compact
representation for lexicon can be a character tree structure, called a trie. Lexicon
storage using trie reduces word search time as well as storage space as compared to
simple word list. Further enhancement in efficiency is achieved by converting the trie
into directed acyclic word graph (Ciura and Deorowicz 2001). The directed acyclic
word graph could be utilized for automatic separation of word stems from prefixes.
Further compressed form of which is a directed acyclic word graph (DAWG). The
search time efficiency for DAWG is O(L), where L is the average length of words in
lexicon. Simple DAWG can be used with spell checking application (Ciura and
Deorowicz 2001), but for MT application morphology information is also required.
For MT application a specialized form of DAWG called lexical transducer is a better
choice (Beesley and Karttunen 2003). Lexical transducer is a form of DAWG, which
maps input surface word form to lexical word form and vice versa. Urdu language
morphology rules are inherited from many languages, like Sanskrit, Arabic, Persian,
etc. which make full morphology-based design of the Urdu lexicon containing
inflected as well as derivative words is a difficult task. A comparative study shows
that lexical transducer implementation, due to morphological analysis requirement, is
relatively more complex than hashing but it is efficient for both search time and
storage space requirements.
6.3 Storage in a Hash Table
Hashing is one of the solutions for a large dictionary problem. Although, using
hash tables, the retrieval of data is very fast in one or few steps, but the main
problem with hashing is that all strings with full-length are needed to be stored, which
requires more memory space. The perfect hash function is the one in which no
collisions occur. It means no repeating hash values should arise for different words. It
is difficult to find such a perfect hash function, however, a function that is close to
perfect hashing can be found. Hash functions are classified by the way they generate
hash values from data. In addition-method, the hash value is computed by traversing
through each character of the word and continually incrementing an initial hash value.
The calculation done on the element value is usually in the form of a multiplication by
a prime number. In bitwise-shift hashing, similar to addition-method every character
of the word in the data string is used to construct the hash, but the value is calculated
through bitwise left and right shifting, the shift value is normally a prime number.
Some string hashing functions have been implemented and tested for the English
word list and for the Unicode-based Urdu word list, where dictionary sizes varying
from 17,000 words to 75,000 words are used with details shown in Table 6.1. The
82
results listed in Table 6.2 show that we can achieve high search efficiency close to
perfect-hashing requirements. A basic algorithm to calculate hash value is as follows,
other algorithms are modification to this basic algorithm:
1:
2:
3:
4:
5:
Some of the hash functions are listed here which are used for hashing Urdu
lexical entries. The simple hash function based on addition method is:
hash = 0;
for(i=0; i<word.Length; i++)
hash = hash*29 + (word[i]-'A');
hashIndex = hash % hashSize;
We present some of the rotative hash functions. The JS hash function developed
by Justin Sobel (Partow) based on bitwise shifting is:
hash = 1315423911;
for(int i = 0; i <word.Length; i++) {
hash ^= ((hash << 5) + Word[i] ;
hash += (hash >> 2));
}
hashIndex = (hash & 0x7FFFFFF);
83
English 1
74317
131071
8.5
English 2
25017
32749
7.2
Urdu 1
17476
32749
13.4
Urdu 2
49427
65521
10.9
As shown in Table 6.1, two Unicode based text files containing Urdu word list
and two text files containing English word list are used for testing the hashing
functions. Table 6.1 shows number of words in each file, hash table (HT) size, and
average word length in each file.
There are other hashing functions that are not included in this study as the focus
of the study was comparison of hashing as an alternate approach for lexicon
implementation. For choosing a hash table size, largest prime number smaller than
2m+1 was used with a condition that 2m-1 < N < 2m, where N is the total number of
words and m is an integer exponent.
Table 6.2: Average Word Lookup Searches in a Hash Table
Hash Function
Simple
RS
JS
ELF
DJB
AP
English 1
1.7
1.7
1.7
6.4
1.7
1.7
Urdu 2
2.8
2.6
2.7
3.3
3.0
2.6
The results are shown in Table 6.2 for the average number of word lookup
searches required to find a word in a word list, which is implemented as a hash table.
The results show that there are not many collisions and values of average access time
per word are close to the perfect hashing value of one. This average word search time
is calculated by accessing, one by one, all the words in the dictionary file. The linear
open addressing method is used for collision resolution in this study.
6.4 Storage using Lexical Transducer
84
the word stems, prefixes and suffixes as well as a set of rules known as
morphotactics. The morphotactics tell how to combine the roots, stems, prefixes and
suffixes with each other to make meaningful words. For example, there are two words
to represent Muslim: the moslem (
) and mosalmaan (
). To make
antonym non-Muslim we can use prefix Gayr ( ) with moslem to make Gayr
moslem (
a ), but we cannot use it to make Gayr mosalmaan (
a ).
These rules that govern which affix can be joined with which stem are known as
morphotactics.
The following subsections cover some basic definitions and simpler data
structures used to introduce lexical transducers and to define a method for
automatically separating stems from affixes.
6.4.1 Trie Tree Structure
A trie, or a character tree, is one of the solutions to store a lexicon with less
storage space requirement as compared to string storage using a linear list, binary
search tree or hash tables. A trie is a tree for storing strings in which there is one node
for every common prefix. The strings are stored in extra leaf nodes.
Definition: For a given set of strings S = {A1, A2, , AN }, where each string
Ai contains characters from the given set of alphabets A = { a, b, c, , z }; the trie for
the given set S is defined recursively as:
Trie(S)= {Trie(S\1 ), Trie(S\ 2 ), " , Trie(S\ r )}
where S\j means the subset of S consisting of strings that start with j, stripped
of their initial letter j; recursion is halted when S is empty resulting in an empty trie.
6.4.2 Finite State Automata
Definition: A deterministic finite state automata is a 5tuple: D = , S, s, F, ,
where:
:
S:
s S:
F S:
(r, a) :
A finite state machine with at most one transition for each symbol and state
combination is a deterministic finite state automaton (DFSA).
85
(r, ) r.
The language of acyclic finite state automata is finite. By merging all equivalent
sub-tries of a full trie into one, we can get acyclic DFSA.
A trie can be compressed to a minimal acyclic finite-state automata, which is
also known as directed acyclic word graph (DWAG) by using algorithms (Daciuk
1998; Ciura and Deorowicz 2001). A directed acyclic graph represents the suffixes of
a given string in which each edge is labeled with a character. The characters along a
path from the root to a node make the substring, which the node is representing.
Definition: The deterministic finite state automata D = , S, s, F, is called
minimal, for a given language L(D) *, when for every other deterministic finite
state automata D' = , S', s', F', ' having language L(D') = L(D), there exists the
inequality |S| |S'|, where |S| represents number of states. This means that a minimal
DFSA has minimum number of states for the given language. For a non-empty
language, it is minimal if and only if every state is reachable from the starting state,
from every state a final state is reachable, and there are no different equivalent states.
There always exists a unique minimal automation for a given language (Daciuk
1998).
6.4.3 Implementation of Word Insertion
a a
a a
In the trie shown in Figure 6.1, each path from root to a leaf represents a single
word and the branching in tree represents successive characters. Trie is certainly an
improvement over plain string storage, but it can be observed that paths compression
is possible by representing common suffixes as one path instead of many paths with
the same suffix. Result, then, of course will be a graph instead of a tree. The terms
86
states and transitions from automata theory will be used instead of nodes and branches
from the graph theory. In Figure 6.1, four paths [a a a a a ] can be
compressed to one with one final state denoted as , instead of a simple circle ,
which represents an ordinary state. This simple form of DFSA has one start state and
one final state.
One class of algorithms for acyclic DFSA construction is by minimization of
the trie (Daciuk 1998; Ciura and Deorowicz 2001) and another class of algorithms is
for directly building acyclic DFSA from the given set of strings (Mihov; Daciuk
1998). Inserting the same word list (92) results in an acyclic DFSA, which is shown in
Figure 6.2. We can see that the automaton created using above algorithm is both
deterministic and acyclic. It has a single start state (with no in-coming transition) and
a single final state (with no out-going transition).
87
The parts of words from start state to suffix state are candidates for being stems.
For finding prefixes, the same algorithm is used, but first all strings are reversed and
finally each suffix found is also reversed, which results in the required prefix.
Implementation and testing of algorithm showed that although correct
identification of affixes and stems is carried out but there is also noticeable false
detection and therefore more work on the algorithm may be made to reduce or remove
false detection. Only those prefixes and suffixes are to be retained for inclusion to
lexical transducer that has moderate frequency in the list of words and at the same
time, they have morphological significance.
6.5 Lexical Transducers
l
l
a
a
R
R
k
k
0
a
0
a
0
+N
a
a
+sg +masc
Lexical Side
88
laRkaa+N+sg+masc:laRkaa
Nouns, adjectives and verbs are morphologically open classes of words, some
morphological properties of which have been discussed in previous chapters.
6.6 Conclusions
In this Chapter, few algorithms for lexicon implementation are discussed. The
simple word list, due to large search time, is definitely not an acceptable solution for
real MT applications. Table 6.2 shows the time efficiency for hash table
implementation of a lexicon. The average number of word lookup searches for Urdu
words ranges from 1.6 to 3.3 for different hash functions, which is acceptable and can
be further improved by using better collision resolution strategy than linear open
addressing. The advantages of hash table lexicon implementation are the fast access
time, lesser morphological knowledge requirement and easier inclusion of nonmorphological word attributes. The only disadvantage is more space requirements,
which is not a big issue for current desktop computing standards.
Trie and directed acyclic word graphs have search time proportional to the
length of words and have lesser space requirements. Table 6.2 shows that average
Urdu word length is less than 15 characters, therefore for successful word search we
need about 15 comparisons, while the unsuccessful search in these branching
structures is even faster. These structures are useful for spell checking and automatic
stemming applications. Lexical transducer based lexicon implementation is best suited
for both search time and storage space requirements. However, the knowledge of
stems and affixes as well as morphotactics must be available for the lexical transducer
implementation through morphological analysis.
PART III
SYNTACTICAL ANALYSIS
AND
MODELING
89
Chapter 7
MODELING URDU NOMINAL SYNTAX
BY IDENTIFYING
CASE MARKERS AND POSTPOSITIONS
In Chapter 3 and Chapter 4, morphological analysis of various verb forms, noun
forms and adjective forms in Urdu, and various attributes associated with different
morphemes have been analyzed and listed. These lexical attributes obtained through
morphology are very useful for the syntactic analysis based on the Lexical Functional
Grammar. In the approach used in this research, morphology variations are handled
by using finite state transducers (Karttunen 1994; Beesley and Karttunen 2003).
Given the various word forms, the finite state transducers, extract useful grammatical
information from the word morphemes. In LFG, these lexical attributes extracted by
finite state transducers become feature-value pairs at the feature-structure level. To
assign syntactic attributes values extracted by finite state transducers, a form of
mapping table is used. For example, GEND attribute may get values MASC or FEM,
if the finite state tags have value +Masc or +Fem for the word under consideration.
When constituent-structure nodes unify, these attributes at leaf node, which contain
attributes obtained from lexical entries, get unify to generate overall f-structure.
In this Chapter, the NP structure is analyzed and its syntactic combination with
various case-markers/ postpositions in Urdu is distinguished. A Noun Phrase (NP) in
Urdu is characterized by a rich case-marking system, which makes possible its free
phrase order. The case markers and postposition are similar in nature and it is not easy
to find a definition, which clearly separates the two. In this Chapter, an approach to
distinguish various classes of case-markers and postposition has been introduced. The
term case marker or case clitic is generally used for a word, which appears with a
noun or a noun phrase such that the resultant phrase is a case marked noun phrase.
While for a postposition, the resultant phrase is a postpositional phrase that acts as an
adjunct to verb phrase. Some terms are defined below which may be referred in this
chapter.
Transitivity refers to the number of objects a verb requires or takes in a
grammatically well-formed clause or a sentence. The argument structure of a verb
always contains subject and zero, one or two objects. The transitivity refers only to
objects present in the argument structure of a verb. A subject is treated as a specifier
90
91
of the verb, while the object noun phrases appear in complement position in grammar
modeling theories like X-bar and HPSG. Urdu, in contrast, has a flat phrase structure
with rich case marking system, which allows relatively free order of phrase structure
of sentence daughter phrases, and the verb is sister to subject noun phrase. The
specifier and verb phrase thus do not appear in Urdu as in English.
Valency refers to the total number of arguments controlled by a predicate. Thus
verb valency counts all the arguments of the verb including subject, objects, oblique
case marked noun phrases and complement phrases. Valency is more relevant for
analysis of Urdu verbs argument structures presented in this chapter for causative
verbs and for other cases, which are marked with marker sey.
Thematic role is the semantic relationship between a predicate (e.g. a verb) and
an argument (e.g. the noun phrases) of a sentence. There are different thematic roles
available in the literature and different authors agree on different roles. The more
widely used thematic roles are briefly reviewed here.
Agent is the one who deliberately performs the action, the one who is the
principal cause of action and/or the one that controls the event, e.g., Hamid ate the
apple. Experiencer is the one who gets affect of sensory, emotional or abstract input
or the one who is unconsciously participating in the event, e.g., Anjom is shocked,
and Hamid fears heights. Beneficiary is the one who benefits from the action, e.g.,
The teacher teaches Anjom, and The teacher gave Anjom the book. Theme or
Patient is the role of the undergoer of an action, e.g., The boy crushed the snake,
and The teacher gave Anjom the book. Instrument is a thing used to carry out the
action, e.g., Hamid cut the apple with the knife. Location is the place in space and
time where the action occurs, e.g., Hamid plays cricket in the park. Goal is the
person or place towards which action is directed, e.g., Hamid is going to the school,
He writes a letter to her. Source is the person or place from where the action is
initiated, e.g., The rain is coming from the west, and He received a letter from the
principal.
Thematic hierarchy presents relative prominence among various thematic roles.
The > sign means that role on left side has more prominence than on right side.
There are variations in the literature, however the more acceptable (Bresnan 2001;
Dalrymple 2001) is given in (94).
(94)
These thematic roles are mapped to the grammatical functions in the argument
structure of verbs. The mapping of grammatical functions and thematic roles is called
linking or mapping theory. There are many approaches for mapping with theoretical
92
details (Butt 2005), however, usually agent and experiencer roles are mapped to
subjects; patient and theme roles are mapped to objects; and goal/beneficiary are
mapped to indirect objects. Locative, instrument, source and goal roles fill oblique
arguments or they are attached as adjuncts as summarized below.
subject
object
indirect object
oblique arguments
agent, experiencer
patient, theme
goal, beneficiary
instrument, locative, source, goal
This chapter presents the data and analysis to show that the role of case marker
sey is quite diverse and it adopts various grammatical functions or thematic roles in
the argument structure of different verbs. The role of sey is described as versatile,
and it is treated as the instrumental case which adopts different roles (Mohanan
1990; Butt and King 2002). The marker sey marks subjects, objects, instruments,
time and space nouns, post-positional phrases, adverbial phrases, etc. The analysis
presented in this chapter shows that semantic considerations simplify classification of
these roles. It is also shown that the marker sey marks indirect subjects, for
causative form 2 verbs. At the end, the chapter includes evidence of Urdu tetravalent
causative verbs and presents a model for their handling.
7.1 Classification of Case Markers and Postpositions
For languages with case marking, mostly, the case marker is morphologically
attached at the lexical level. The Urdu-Hindi noun changes its form at the lexical level
which is sometimes referred to as a case (Mohanan 1994; Arsenault 2002). Other
case-markers in Urdu-Hindi that help in mapping the verb argument structure appear
as syntactic unit. To distinguish between syntactic case marking, morphological case
marking and other post-positions, it is proposed that these may be classified based on
the way these are handled or according to their function. The case marking and
postposition system in Urdu/Hindi have been divided into five categories: (i) noun
form, (ii) core case markers, (iii) oblique case markers, (iv) possession markers and
(v) pure post-positions. This division of case markers into these categories is
primarily based on the difference in computational modeling required in each case.
The division of case markers may be based on morphological (lexical), structural
(syntactic) and on functional (semantics) reasons. Therefore, the division presented in
this work borrows heavily from the division of case markers presented by (Butt and
King 1999), which includes lexical, structural, semantic and quirky case. However,
the divsion presented in this work separates possession marking and also includes use
of semantic features to help distinguish core and oblique verb arguments. Figure 7.1
93
shows hierarchical structure of the case markers and post-positions in Urdu and Hindi,
which are explained in the following sections.
Case Markers/Postpositions
in Urdu-Hindi
Lexical
(i)
Morphological
Noun Forms
Syntactic
Case Marker
(verb arguments)
(ii)
Core
(iii)
Oblique
(iv)
Possession
Marker
(v)
Post-Position
(adjuncts)
(b) aaaa*
(a) aaaa
ghor-ey aor bakr-ey
*ghor aor bakr -ey
*horse and goat -sg.masc.obl
horse-sg.masc.obl and goat-sg.masc.obl
horses and goats
horses and goats
(96)
(a)
a aaaa a
ghor-ey=ney aor bakr-ey=ney
horse=erg and goat=erg
horses and goats
(b)
a aaaa
ghor-ey aor bakr-ey =ney
horse and goat =erg
horses and goats
The lexical suffixes do not play direct role in linking or mapping to the verb
argument structure, as only noun form cannot tell which grammatical function noun
may adopt. The oblique form is used with case markers and postpositions, which
impart verb categorization features. However, the vocative form is used as subject in
the imperative mood. As the vocative form is governed by the verb in the imperative
mood, therefore it is the only example of lexical case in Urdu or Hindi. The
94
nominative form appears in the absence of case marker or postposition. These have
already been discussed in the section on morphology, and are being reproduced in
Table 7.1.
Table 7.1: Noun Forms in Urdu
Nominative (NOM)
laRkaa, boy
laRkee, girl
laRkey, boys
laRkeeaN, girls
morGaa, rooster
kamrah, room
Oblique (OBL)
laRkey, boy
laRkee, girl
laRkaoN, boys
laRkeeoN, girls
morGey, rooster
kamrey, room
Vocative (VOC)
laRkey, boy
laRkee, girl
laRkao, boys
laRkeeo, girls
laogao, people
bach.chao, children
The core case markers are those that assign nouns a universal grammatical
relation like subject, object and indirect object. These core grammatical relations in a
sentence are directly controlled by verbal predicate and these help noun find a
position in the argument structure of the verb. These are counted in verb transitivity as
well as in valency of the verbal predicate. These core case markers will be discussed
in more details later in this chapter. The case marker and corresponding grammatical
relation is summarized as follows:
no marker
, ney
, kao
, sey
subject, object
subject
object, indirect object, subject
subject, object
The oblique case markers are those that assign noun the oblique grammatical
relation associated with a semantic role, these are governable by verbal predicate
through its argument structure. The noun phrase marked with an oblique case is not an
optional phrase in a sentence, as its presence is predictable from the argument
structure of a verb, in contrast to an optional post-positional phrase, which is not
predictable from the argument structure of the verb. As English do not have a case
marking system, the oblique arguments of the verbal predicate are treated as
prepositional phrases. In languages with strong case marking, like Urdu, the oblique
arguments may be treated as case marked rather than simple postpositional phrases.
For some Australian languages, such as Warlpiri, case marked oblique phrases have
been observed (Nordlinger 1998). Few markers that act as the oblique case markers
are:
95
(98)
a a a
a
laRk-ey=ney
ferej=sey
paanee
water=nom
boy-sg.masc=erg fridge=source
The boy took the water out from the fridge.
(99)
a a a a a
aadmee=ney
kamrey=meyN saamaan rakh-aa
luggage put-perf.sg.masc
man-sg.masc=erg room=dest
The man put the luggage in the room.
a
*
(100) a aa a
*laRk-ey=ney
ferej=meyN
paanee
water=nom
boy-sg.masc=erg fridge=dest
*The boy took out the water in the fridge.
(101)
nekaal-aa
take out-perf.sg.masc
nekaal-aa
take out-perf.sg.masc
a a a a a *
*aadmee=ney
kamrey=sey
saamaan rakh-aa
luggage put-perf.sg.masc
man-sg.masc=erg room=source
*The man put the luggage from the room.
However, for few liquid objects, sometimes the verb nekaal-naa may be used
with destination location and the sentence is well formed without mentioning a source
96
a
aa a
a
laRk-ey=ney
cap=meyN (X=sey) chaaey
X=source tea=nom
boy-sg.masc=erg cup=dest
A boy took out tea in a cup (from a teapot).
nekaal-ee
take out-perf.sg.fem
Gender
masc
fem
masc
Number
sg
pl
7.1.5 Postpositions
The pure postpositions are those that are not controlled by verbal predicates and
a sentence is complete in its meaning with or without postpositional phrases.
Postpositional phrases are optional in the sense that these are not controlled by the
argument structure of the verb. These, therefore, are counted neither in the transitivity
nor in the valency of a verb. A larger list of postpositions in Urdu is given in Chapter
10. Semantic features of nouns, as employed for for case markers, are also important
for better machine translation of the postpositional adjunct phrases from one natural
97
a a a
a a a
aadmee
kamrey=meyN khaanaa
room=loc.
food
man-sg.masc=nom
The man is eating food in the room.
For example, the sentence in (103) is complete, even if the postpositional phrase
kamrey meyN (in the room) is omitted. The postpositional phrases add information
to the event happening but are not directly related to the arguement structure of the
verbal predicate. There may be zero or more postpositional phrases, which appear as a
set of adjuncts to a verbal predicate.
7.2 Urdu Case Marking Phrase Structure
a
laRk-ey=ney
boy-sg.obl.masc=erg
98
NP
K
ney
NP
N
laRk-ey
K
ney
N
laRk-ey
phrase
HEAD
VAL
noun
1
AGR
CASE 2
SPR
COMPS
H
phrase
HEAD
3
VAL
noun
1
AGR
FORM 4
SPR
COMPS
word
HEAD
VAL
casemarker
2
CASE
3 NP
SPR
FORM 4 obl
COMPS
HPSG based phrase structure rule shown in Figure 7.3 is being proposed. The
head daughter (H) is a noun or a noun phrase. The mother NP gets agreement (AGR)
features like gender, number from the head daughter (H) and gets CASE feature from
the case marker (M). The number 1 in box with the AGR feature of the mother and
the daughter noun phrase describes that these values are the same. Similarly, the
boxed number 2 expresses that CASE value of mother NP is required to match with
the value of the same attribute of the case marker M. The noun phrase is proposed in
99
this rule as the specifier of case marker, which means that whenever there is an overt
case marker, the noun or noun phrase numbered 3 is required. In this rule, the case
marker selects noun but the resultant phrase is a noun phrase as the head of the phrase
is designated a noun phrase. With the restriction using number 4 the attribute FORM
of the specifier of the case marker must match with the same attribute of the noun
phrase, which fills the specifier slot. This is necessary for the noun-case agreement
requirement that the oblique form of a noun (or a noun phrase) is needed with case
markers.
(105)
a a
*
*laRk-ey=ney=ney
*boy-sg.obl.masc=erg=erg
phrase
HEAD
VAL
noun
1
AGR
CASE 2
SPR
COMPS
H
phrase
HEAD
3
VAL
noun
1
AGR
FORM 4 obl
CASE 5 nom
SPR
COMPS
M
word
HEAD
VAL
casemarker
2
CASE
3 NP
FORM 4
SPR
CASE 5
COMPS
The HPSG based rule proposal 1 may be used to form noun phrases using the
case markers in Urdu. However, in the absence of a case marker, the default case of
the noun phrase needs to be nominative which is not mentioned in the rule.
Moreover, the above rule may result in recursion and could generate sentences with
cascaded case markers as shown in (105), which means that above rule may register
more than one case marker for a single noun phrase. To handle the above conditions,
the following is proposed. The attribute CASE of each lexical noun is assigned a
100
value nominative by default along with extra constraint that the noun phrase 3 in
the specifier argument of case marker (M) requires that its CASE be nominative
using 5 in addition to oblique FORM through 4 . The extended rule as proposal 2
is shown in Figure 7.4, which takes care of default nominative case requirement for a
noun phrase in the absence of case marker and at the same time avoids recursive
inclusion of cascaded case markers.
It may be noted that the proposal 2 rule does not include genitive or
possessive marker. It is assumed that the possessive markers are distinct from the
case markers due to characteristics presented in section 7.1.4 and therefore require
separate treatment. The phrase structure rules for possessive markers are proposed in
section 7.5.
The LFG based phrase structure rule for case marked noun phrase is shown in
(106), which describes that a mother noun phrase (NP) can be constructed with a noun
phrase (NP) followed by a case marker (CM).
(106)
NP
NP
(
(
(
(
CM
( CASE ) = ( CASE )
NUM ) = ( NUM )
GEND ) = ( GEND ) ( FORM ) =c oblique
FORM ) = ( FORM )
CASE ) =c nom
The functional schemata attached with daughter NP describes that mother NPs
f-structure will take NUM, GEND and FORM attributes from the f-structure of
daughter NP. The daughter NP has a constraint that its CASE value be nominative,
which is needed to avoid cascaded inclusion of case markers. The functional schemata
attached with the CM node expresses that mother NPs CASE value is to be taken
from the f-structure of CM. A constraint equation at CM node checks that the FORM
attribute of the mother NP has a value oblique. Indirectly this constraint is applied to
the FORM attribute of daughter NP, as the mother NP has taken this oblique value
from the daughter NP.
7.3 Analysis for Urdu Case Markers
The following sections present analysis of Urdu case markers along with
example sentences and analysis. The nominative, ergative, dative and accusative case
has been analyzed extensively in the literature (Mohanan 1994; Butt and King 2002).
A brief review of these case markers has been included with somewhat different
analysis by including semantic features of nouns and by using verb valency instead of
verb transitivity. However, a detailed analysis of case marked with sey marker and
its role in causative verbs is discussed using semantic features of nouns and verb
valency.
101
If there is no case marker with the noun (or the noun phrase), the noun is said to
be in nominative case, which is the default case for noun phrases, as shown in (107)
below. Here both boy and book are in nominative form, which assume subject and
object functions respectively.
(107)
a
a a
laRk-aa
ketaab
boy-sg.masc=nom book=nom
A boy will buy a book
(108) S
xareed-ey
buy-subj.obl
NP
g-aa
AUX-future-sg.masc
NP
( OBJ ) =
( SUBJ ) =
( CASE ) = nom =
( CASE ) = nom
( N-CONCEPT ) =c animate
The example contains two nominative NPs in a sentence and both NPs can fill
subject and object slot of verbs argument structure. The subject slot should be filled
with an agent. For nominative subjects, LFG rule shown in (108) includes a constraint
that a NP can fill the subject slot only if it has a value animate for the noun concept
attribute. The agreement between a verb and a noun is with the highest nominative
argument in the argument structure of the verb. In this example, therefore according
to thematic hierarchy shown in (94), agent (subject) assumes higher role and the verb
agreement is with laRkaa (the boy), instead of agreement with object ketaab (the
book), which assumes lower role. The f-structure for sentence in (107) is shown in
Figure 7.5, where both subject and object have nominative case but the animate
attribute helps to find that a boy is more suitable as a subject.
PRED
SUBJ
OBJ
TENSE
CASE
nom
'a'
SPEC
CASE
nom
'a'
SPEC
future
102
Noun phrase marked with case marker , ney, expresses the role of an actor
or agent that fills the subject argument in the list of grammatical functions. The
ergative case appears for verbs in a perfective form having valency greater than one.
An example is shown in sentence (109) for transitive verb xareed-naa (to buy).
(109)
a a a
laRkey=ney
ketaab
xareed-ee
boy-sg.masc=erg book.nom buy-perf.sg.fem
A boy bought a book.
The example contains one ergative and one nominative argument in the
sentence. The verb-noun agreement is with highest nominative argument of in the
argument structure of the verb according to thematic hierarchy shown in (94). In this
example, subject NP is ergative and object NP is the nominative. Therefore, the verb
agreement is with object ketaab (the book).
As a general rule, the ergative case marker ney is not used with intransitive
verbs but there are few exceptions to this rule for intransitive (monovalent) verbs like
thook-naa (to spit) and moot-naa, (to piss) for which case marker ney is required
and nominative form is not acceptable. The acceptable and unacceptable usage of
ergative case for intransitive verbs is shown in (110).
(110)
(a)
a
woh geraa
He=nom fall.perf
He fell
(b)
a a*
*aes ney geraa
He=erg fall.perf
He fell
(c)
a
woh saoyaa
He=nom sleep.perf
He slept
(d)
a a*
*aes ney saoyaa
He=erg sleep.perf
He slept
(e)
a
mozafar daraa
Mozafar=nom scare.perf
Mozafar scared
(f)
a a
*
*mozafar ney daraa
Mozafar=erg scare.perf
Mozafar scared
(g)
a a
Zafar ney thookaa
Zafar=erg spit.perf
Zafar spitted.
(h)
a *
*Zafar thookaa
Zafar=nom spit.perf
Zafar spitted.
103
(i)
a a
bakree ney mootaa
Goat=erg.fem piss.perf
Goat pissed
(j)
a * / a*
*bakree mootee / *bakraa mootaa
Goat=nom piss.perf
Goat pissed
Some intransitive verbs listed in (111) are usually used without ergative case
but they are also known to be acceptable in ergative case for deliberate and purposeful
actions (Abdul-Haq 1991; Mohanan 1994; Butt and King 2002). A brief survey is
carried out to check contemporary Urdu usage in Lahore and Islamabad and the
sentences shown in (111) are presented to few people. It is found that the ergative
form is scarcely acceptable in a volitional sense for transitive verbs and to show
volitional effect it is better to use a participle adverbial conjunctive, jaan boojh kar
(deliberately).
It is not a general rule that using ergative subjects with intransitive verbs
expresses a volitional effect, and only few intransitive verbs may require ergative
subject in perfective tenses to show volitional effect. It is therefore suggested that we
can use a general rule that intransitive verbs of Urdu require nominative subjects. If
there are intransitive verbs that could be used with ergative subjects, they may be
specifically marked for the ergative requirement in the lexicon.
(111)
(a)
(c)
(e)
a
woh nahaayaa
He=nom bathe.perf
He bathed
(b)
a a*
* aes ney nahaayaa
He=erg bathe.perf
He bathed (deliberately).
a
woh khaansaa
He=nom cough.perf
He coughed
(d)
a a?
? aes ney khaansaa
He=erg cough.perf
He coughed (deliberately).
(f)
a a ?
? Zafar ney chheenkaa
Zafar=erg sneeze.perf
Zafar sneezed (deliberately).
a
Zafar chheenkaa
Zafar=nom sneeze.perf
Zafar sneezed
(g)
a
mozafar cheeKhaa
Mozafar=nom scream.perf
Mozafar screamed
(h)
a a
?
? mozafar ney cheeKhaa
Mozafar=erg scream.perf
Mozafar screamed (deliberately).
(i)
a
woh chelaayaa
He=nom shout.perf
He shouted
(j)
a a?
? aes ney chelaayaa
He=erg shout.perf
He shouted (deliberately).
104
(112)
(a) a
a a
Zafar ney sheeshee taoRee
Zafar=erg bottle=nom break.perf
Zafar broke the (glass) bottle.
(b)
a
a
* Zafar sheeshee taoRee
Zafar=nom bottle=nom break.perf
Zafar broke the (glass) bottle.
(c)
aa a
mozafar ney aam khaayaa
Mozafar=erg mango=nom eat.perf
Mozafar ate the mango.
(d)
aa
(e)
a a a
mayN ney baat samjhee
I=erg communication=nom
comprehend.perf
I comprehended the communication
(f)
a a
* mayN baat samjhaa
a
a a
mayN ney paRhnaa seekhaa
I=erg read=nom learn.perf
I learned reading
(h)
(g)
I=nom communication=nom
comprehend.perf
Transitive and ditransitive verbs (or for the verbs having valency greater than
one, this includes tetravalent verbs) when appear in perfective form require subjects
marked with case marker ney, i.e., ergative subjects. Sentences shown in (112)
employ transitive verbs in perfective form, the sentences with nominative subject are
not acceptable, while sentences with an ergative subject are acceptable. However, few
exceptions exist for divalent verbs, which require nominative subjects even in
perfective forms, the examples are shown in (113).
(113)
(a)
a a
woh ketaab laayaa
He=nom book=nom bring.perf
He bring the book
(c)
a a a
woh shaadee sey sharmaayaa
He=nom marriage=inst embarrass.perf
He embarrassed from the marriage
(114) ney
(b)
a a a
*aes ney ketaab laayee
He=nom book=nom bring.perf
He bring the book
(d)
a a a a?
? aes ney shaadee sey sharmaayaa
He=erg marriage=inst embarrass.perf
He embarrassed from the marriage
(K CASE) = ergative
(K N-SEM N-CONCEPT) =c animate
((SUBJ K) V-FORM) =c perfect
((SUBJ K) V-VAL) ~= 1
((SUBJ K) SUBJ) = L
SUBJ
OBJ
TENSE
V-FORM
V-VAL
105
CASE
erg
CASE
nom
past
perfect
Functional schema for the LFG based lexical entry of ney has been shown in
(114), which marks an ergative case. The entry expresses in the first equation that
the CASE attribute of mother NP has a value ergative. In second equation, which is
a constraint equation, it is described that mother NPs semantic attribute must have a
value animate, this is to verify that ergative case can be assigned only to animate
nouns and inanimate nouns are not marked with ergative case. In third equation, a
constraint is applied to verb form to be perfect. The notation (SUBJ K) is for insideout functional uncertainty, which is used to refer to a f-structure by traversing insideout through the hierarchy of f-structures until the required f-structure having attribute
SUBJ is found. The next constraint equation checks verb valency for ergative case
should not be one, therefore the verb valency attribute V-VAL can take values 2, 3 or
4 for Urdu verbs. The last equation expresses that noun marked with ergative case
fills the subject argument of the verb.
Sometimes, apparently inanimate nouns are assigned ergative case to mark
them as agents, which could be assigned only to animate nouns. These nouns are not
intrinsically animated but there is some external force or power, which imparts them
animate attribute. The use of ergative case for such externally animated nouns is
shown in (115) and (116), and it is assumed that these nouns have semantic feature
value as animate, which allows them to be used in ergative case.
(115)
a
a a
a a
rayl gaaRee=ney mojhey laahaor
pohanch-aa
dee-aa
train-sg.masc=erg me.pron Lahore.nom help reach-caus1.perf.sg.mas completely
The train caused me reach Lahore.
(116)
a a
a
zzalzzaley=ney
makaan
ger-aa
dee-aa
earthquake-sg.masc=erg house.nom cause fall-caus1.perf.sg.mas completely
The earthquake caused the house fall.
106
In a dative case, a noun phrase marked with case marker , kao, expresses the
role of an indirect object, recipient, beneficiary or receiver as the third argument in the
argument structure of ditransitive verbs, where the other two arguments are the
subject and the object. An Urdu sentence expressing dative case is shown in (117),
where book is a direct object and receiver boy is an indirect object marked with
the dative case.
a a
(117) a a a
mayN=ney laRk-ey=kao
boy-sg.obl=dat
I=erg
I gave the book to the boy.
(118)
(119)
a a a a a
laRkey=kao
sardee
boy-sg.obl=dat cold.nom
The boy is feeling cold.
ketaab
d-ee
book.nom buy-perf.sg.fem
a a a
a
laRkey=kao
boxaar
hao+ga-yaa
hay
boy-sg.obl=dat fever.nom happened-perf.sg.masc AUX-pres
The boy has got fever.
PRED
SUBJ
OBJ GOAL
OBJ
TENSE
V-FORM
V-VAL
NUM
sg
erg
CASE
N-SEM [ N-CONCEPT animate ]
CASE
dat
N-FORM oblique
[ N-CONCEPT animate ]
N-SEM
CASE
nom
past
perfect
The Urdu verbs, which express some feeling or state change of someone, do not
take ergative or nominative subjects in their argument structure, and employ dative
107
case for subjects as shown in (118) and (119). Some Urdu verbs that show physical
feelings like cold sardee, hot garmee, hunger bhook, thirst peyaas, etc. are
used in dative case pattern shown in (118). Similarly, state change of subjects is
expressed in dative case as in (119), for verbs like fever boxaar, headache sar
daard, love peyaar, hate nafrat, etc.
The example (120) shows a usage of the dative case to represent an unwilling
agent. This dative case appears to represent a subject when infinitive verb form is
used with auxiliary (or light-verb) paR-aa, which represents a forced mood.
Another sentence mood represents a willing agent having obligation to do
something. This obligation mood is represented with dative case subject as shown in
(121), where infinitive form is used with present auxiliary hay. This obligation
mood with the same semantics is sometimes used with ergative subject, but dative
subject should be preferred over ergative subject.
a a
(120)
Haamed=kao sakool
jaanaa
paRaa
Hamid-sg=dat school.nom go-inf.sg.masc AUX-forced mood
Hamid went to the school (unwillingly, forcefully).
(121)
a a
Haamed=kao sakool
jaanaa
hay
Hamid-sg=dat school.nom go-inf.sg.masc AUX-pres
Hamid has to go to the school (as a duty, obligation or responsibility).
The example (122) shows usage of a dative agent assuming the subject role in a
sentence in the suggestion mood. This mood uses infinitive form followed by a
mood auxiliary chaah-ee-ey, which signals recommendation, advisability or
suggestion for the agent. This auxiliary is translated to should in English.
(122)
a a
Haamed=kao sakool
jaanaa
chaaheeey
Hamid-sg=dat school.nom go-inf.sg.masc AUX-suggestion mood
Hamid should go to the school.
The features and constraints applied by kao for dative case using LFG based
lexical entry are shown in (123). The first line of the lexical entry (123) for the dative
marker kao assigns mother nodes CASE value to be dative. The second line puts
constraint that the semantic concept of noun be animate, which means that dative case
can be assigned only for animate nouns. The curly brackets { and } are used to
group choices. The choices are separated using or symbol |. The first choice uses
inside-out functional uncertainty to refer to some outer f-structure having attribute
OBJgoal, in that f-structure the verb valency is constrained to have value 3, which
108
means that dative case will assign object goal function if the corresponding fstructures verbal predicate is ditransitive. The second choice uses inside-out
functional uncertainty to refer to the outer f-structure with SUBJ attribute, the verb
valency attribute V-VAL is constrained to take value 2, which means dative-subject
occurs for transitive verbs.
(123) kao
(K CASE) = dative
(K N-SEM N-CONCEPT) =c animate
{
((OBJgoal K) V-VAL) =c 3
(OBJgoal ($) K)
|
((SUBJ K) V-VAL) =c 2
(SUBJ ($) K)
}
The accusative case of a noun or noun phrase is represented using case marker
, kao, which expresses direct object, undergoer or patient usually for transitive
verbs. The accusative marker kao is phonetically the same case marker used to mark
the dative case, however, it marks a different grammatical function and therefore
represents a separate case. The object represented by accusative case typically
becomes subject under passivization. One example of it is given in sentence (125), in
which dog is in accusative case and occupies the patient, , mafAool, or
object grammatical function position in the argument structure of the verb. The
accusative case is mostly used with the transitive verbs while dative case is used with
ditransitive verbs to mark object and indirect object respectively. The accusative
case is normally used to mark animate nouns as object, such as ergative case is used
to mark animate nouns as subjects. The accusative marker is necessary especially for
proper-animate nouns.
The use of accusative kao with animate nouns is dictated by the verb argument
structure. The sentences in (125) to (128) are interesting examples, which illustrate
that both nominative and accusative case can appear in the same structure, and which
case will be allowed is dictated by the respective verb argument structure. The
examples show phonetically the same verbs having different meaning and argument
structure. In (125) dog is in accusative case, while in (126) it is in nominative case.
The verb , maar-aa, is not the same in both sentences. In sentence (125) it means
to beat, while in sentence (127) it means to kill. The verbs beat and kill have
different cases to fill object role in the argument structures, as shown by lexical
entries in (124), where beat requires accusative case and kill requires nominative
109
case for the object. Similarly, in sentence (127) the causative verb to help someone
take bath requires accusative case, while the causative verb to make someone fly in
(128) requires nominative object.
(124)
PRED
SUBJ
OBJ
TENSE
V-FORM
V-VAL
CASE
erg
N-CONCEPT animate
N-SEM
proper
N-CLASS
CASE
acc
past
perfect
(125)
a a a
aakmal=ney
kott-ey=kao
maar-aa
dog-sg.obl=acc
beat-perf.sg.masc
Akmal=erg
Akmal beat a dog.
(126)
a a
aakmal=ney
kott-aa
maar-aa
dog-sg.masc=nom kill-perf.sg.masc
Akmal=erg
Akmal killed a dog.
(127)
a
a
bach.ch-aa
bakr-ee=kao nehl-aa-taa
hay
mother-sg.masc =nom goat-sg.fem=acc bath-make.caus1-repeat.sg.masc AUX.pres
A child is used to give bath to a goat.
(128)
a a
bach.ch-ey=ney kabootar
aoR-aa-yaa
child-pl.masc =erg pigeon-sg.masc=nom fly-make.caus1-perf.sg.masc
A child made the pigeon fly.
110
The argument structure of the verbs used in the above example sentences is
shown in (129) for the particular verb form and tense shown in examples. It is
assumed that argument structure dictates the case selection.
(129) maar-aa< agent ergative case, patient accusative case>
maar-aa< agent ergative case, patient nominative case>
nehl-aa-taa< agent nominative case, patient accusative case>
aoR-aa-yaa< agent ergative case, patient nominative case>
The accusative , kao, is also known for signaling specificity (Butt and King
2002) for inanimate objects (and sometimes for animate objects) as shown in (130).
Moreover, if there is no nominative verb argument as in (125) and (130), then the
default verb agreement is singular and masculine. By presenting the sentence (130) to
native speakers of Urdu in Lahore and Islamabad, it is found that the specifier is either
missing or implied by default in the sentence (or perhaps the pro-drop phenomenon).
The more acceptable form of sentence (130) is shown in (131). For unspecified
objects, the sentence (132) is more acceptable. Therefore, it is suggested that , kao,
itself is not a marker for specificity but there is missing or implied pronoun, which
generates attribute for specificity and requires kao to accompany.
a a a a
?
(130)
? laRk-ey=ney ketaab=kao
xareed-aa
boy-sg.masc=erg book.sg.fem=acc buy-perf.sg.masc
The boy bought the/this/that (particular) book.
(131)
a a a
laRk-ey=ney
aes
ketaab=kao
xareed-aa
boy-sg.masc=erg this.spec book.sg.fem=acc buy-perf.sg.masc
The boy bought this (particular) book.
(132)
a a a
laRk-ey=ney
ketaab
boy-sg.masc=erg book.sg.fem=nom
The boy bought a book.
xareed-ee
buy-perf.sg.fem
LFG based lexical entry for kao expressing accusative case is shown in (133),
which applies complex constraints. The first line of lexical entry (133) tells that the
accusative marker kao assigns a value of accusative to the mother nodes CASE
attribute. The second line describes that this f-structure is the object OBJ in some
outer f-structure found by traversing inside-out. Lines 3 to 7 put constraints on verb
valency attribute V-VAL that it can be assigned a value 2 or 4. Lines 9 and 10 put
constraints that if the nouns semantic concept animate and its class is proper, then
accusative case can be assigned. The lines 13 and 14 describe another possibility that
111
for animate or thing object, the accusative case can be used but along with another
constraint in line 12, which is applied to check the presence of a specifier.
(133) kao
(K CASE) = accusative
(OBJ ($) K)
{
((OBJ K) V-VAL) =c 2
|
((OBJ K) V-VAL) =c 4
}
{
(K N-SEM N-CONCEPT) =c animate
(K N-SEM N-CLASS) =c proper
|
{
(K N-SEM N-CONCEPT) =c thing
| (K N-SEM N-CONCEPT) =c animate
}
(K SPEC) =c definite
}
The accusative case of Urdu needs more detailed analysis to describe the
usage of marker kao. The example in (134) shows the postpositional use of kao,
which typically follows an infinitive. This Urdu postpositional kao can be replaced
a , key leeey as
with another equivalent and more popular Urdu postposition,
shown in (135). Both kao and key leeey are translated to preposition for in
English. Although kao is sometimes acceptable after an infinitive, yet normally key
leeey is preferred as it is unambiguous and more frequently used.
a a a a a
?
(134) a
? Haamed=ney anjom=kao ketaab
paRh-ney=kao d-ee
Hamid-sg.m=erg Anjom=dat book.sg.fem=nom read-inf.pl=pp
give-perf.sg.fem
Hamid gave Anjom the book for reading.
(135) a a
a a a a a
Haamed=ney anjom=kao ketaab
paRh-ney=key leeey d-ee
give-perf.sg.fem
Hamid-sg.m=erg Anjom=dat book.sg.fem=nom read-inf.pl=pp
Hamid gave Anjom the book for reading.
(136) a
a a a a a
Haamed=ney anjom=kao ketaab
paRh-ney d-ee
Hamid-sg.m=erg Anjom=dat book.sg.fem=nom read-inf.pl AUX-permissive
Hamid let Anjom read the book.
The sentence (136) is similar to (134), but the verb d-ee in (134) and (136) are
different in meaning and argument structure. In (134), d-ee means give and
requires three arguments a giver, a recipient and a gift, while in (136), d-ee
112
means let and requires three arguments one who allows an action, one who is
allowed and an action which is allowed.
7.4 Classification of Cases Marked with sey
The noun (or noun phrases) marked with case marker , sey are mostly
characterized as an instrumental case in the Urdu and Hindi literature (Mohanan
1994; Butt and King 2002). The case marker sey is too versatile and noun cases
marked with sey occupy different grammatical relations. The sey as case marker
fills subject, object, indirect subject and oblique argument roles that are controlled by
verb argument structure and sey as postposition appear in a post-positional phrase or
in an adverbial phrase which act as adjunct to the verb phrase. Sometimes sey is
used for comparison between two things and sometimes it is used with adjectives.
Therefore, the use of post-position sey is quite versatile and it may be classified
according to the function in various roles, instead of using it as a bare instrumental
case marker in all cases. In the following sections, this case-marker and/or postposition is being modeled for different situations.
7.4.1 Agentive Case
An animate noun (or noun phrase) marked with case marker , sey, is
categorized as an agentive case and it occupies subject or indirect subject role in
the verbs argument structure. Sentence (137) shows agent in passive voice form,
where focus is on the object letter, which appears in the nominative case and
therefore the gender-number agreement of verb is with object. In Urdu, the agent in
active voice is assigned nominative or ergative case, while in passive voice it is
changed to agent case. For the English sentence in passive voice, the subject and the
object positions are interchanged and therefore it is assumed that the object (in active
voice) has become the subject (in passive voice). While in Urdu, the position of the
subject and the object are relatively less important due to its free phrase order.
(137)
a a a
a
xatt
laRk-ey=sey
lekh-aa
ga-yaa
letter.sg.masc=nom boy-sg.masc=agent write-perf.sg.masc go-perf.sg.masc
A letter was written by a boy.
(138)
a
xatt
(X=sey)
letter.sg.masc=nom (X=agent)
A letter was written (by someone).
lekh-aa
ga-yaa
write-perf.sg.masc go-perf.sg.masc
For example of a passive sentence in (137), in both English and Urdu, the doer
of the action is a boy and the undergoer of the action is a letter, therefore
113
according to thematic hierarchy they should fill subject and object arguments
respectively. However, the analysis become troublesome, when a well-formed passive
voice sentence could be produced without an agent as shown in (138). The analysis of
passive, majhool
, presented in this work assumes that in a passive voice, the
primary focus is on the undergoer and the agent becomes secondary, and therefore
sometimes omitted. It is assumed that semantic subject is still the agent and if the
agent is omitted from a passive sentence, then it is semantically implied as there is a
slot for agent in the argument structure of the verb. We cannot assume that for an
action there is no actor. Therefore, for sentence (138), an unknown agent X is
assumed to fill the writer slot of the verb write.
PRED
SUBJ
OBJ
FOCUS
TENSE
VOICE
V-FORM
V-VAL
CASE
agent
CASE
nom
past
passive
perfect
default value
if
SUBJ is empty
and
VOICE is passive
This work analyzes passive by assuming that there is no change in the verb
argument structure, as shown in Figure 7.9, the FOCUS attribute points to OBJ and a
default SUBJ is assumed if it is omitted in a passive sentence. More evidence is found
if we make negative of the passive sentence (137) shown in (139) and another
negative of a passive is shown in (140). These examples show the inability of an agent
marked with sey to perform an action.
(139)
a a a a
a
xatt
laRk-ey=sey
lekh-aa
ga-yaa
letter.sg.masc=nom boy-sg.masc=agentive write-perf.sg.masc go-perf.sg.masc
A boy was not able to write a letter.
(140)
a a
a
a a
laRk-ey=sey
khaanaa
khaa-yaa
naheeN jaa-taa
boy-sg.masc=agent food.sg.masc=nom eat-perf.sg.masc not
go-perf.sg.masc
The boy is not able to eat food
(141) sey
114
(K CASE) = agent
(K N-SEM N-CONCEPT) =c animate
((SUBJ K) V-VAL) =c 2
{
((SUBJ K) NEG) =c +
((SUBJ K) TNS-ASP MOOD) =c inability
|
((SUBJ K) TNS-ASP VOICE) =c passive
}
(SUBJ K)
There is another agentive form of animate noun that appears in the argument
structure of causative verb forms, where noun marked with sey appears as an agent,
which will be discussed in more detail in section 7.6. The LFG based lexical entry for
case marker sey is shown in (141). The first line of lexical entry for sey marks the
CASE as agentive. The second line puts a constraint that noun semantic concept is
animate. The third line puts the constraint on verb valency to be 2, which means that
this case is assigned to transitive verbs. The lines 4 to 9 constrain that sentences
should be passive voice. The last line tells that this entry is to become the SUBJ of the
outer predicate.
7.4.2 Participant Case
(143)
a a a
a a
Haamed=ney Hameed=sey
baat
k-ee
Hamid=erg
Hameed=participant talk=nom do.perf.sg.fem
Hamid talked with Hameed.
(144)
a a a
a a
Haamed=ney Hameed=sey
madad l-ee
Hamid=erg
Hameed=participant help=nom take-perf.sg.fem
Hamid took help from Hameed.
(145)
a a a
a a
Haamed=ney Hameed=sey
waAdah
kee-aa
Hamid=erg
Hameed=participant promise=nom do-perf.sg.masc
Hamid did a promise with Hameed.
(146)
a a a
a a
Haamed=ney Hameed=sey
sawaal
poochh-aa
Hamid=erg
Hameed=participant question=nom ask-perf.sg.masc
Hamid asked a question from Hameed.
(147) sey
115
(K CASE) = participant
((OBJ K) SUBJ N-SEM N-CONCEPT) =c animate
((OBJ K) OBJ N-SEM N-CONCEPT) =c animate
(OBJ K)
The lexical entry for the participant case is shown in (147). The first line of
lexical entry assigns the case as participant. The second line puts constraint on SUBJ
that it should be an animate noun. The third line puts constraint on OBJ that it should
be an animate noun. Therefore, in participant case, both the subject and the object
nouns are animate. The last line tells that noun marked as participant case using
sey will become the object of the predicate. The f-structure of sentence (143) is
shown in Figure 7.10.
PRED
SUBJ
OBJ
TENSE
V-FORM
V-VAL
CASE
erg
N-CONCEPT animate
N-SEM
proper
N-CLASS
PRED
'
Hameed
'
CASE
participant
N-CONCEPT animate
N-SEM
proper
N-CLASS
past
perfect
116
For the inanimate nouns (or noun phrases) known as the instrumental nouns in
Urdu: a aesm-e-aalah, marked with case marker , sey, are categorized as
instrumental case. For instrumental case the nouns are inanimate and classified as
instrumental nouns. These are typically used by some agent or actor as an aid to
accomplish some task. Example sentences are given in (148) and (149). The noun
phrases in instrumental case are oblique grammatical functions and sometimes act as
adjunct to a sentence. This case is usually translated in English as a prepositional
phrase employing with as a preposition.
(148)
(149)
a a a
aa a
laRk-ey=ney
pensel=sey
boy-sg.masc=erg pencil.sg.fem=inst
A boy wrote a letter with the pencil
xatt
letter
a
a a
a
maaN=ney
chhoor-ee=sey seyb
mother-sg.fem=erg knife-sg.fem=inst apple=nom
The mother cut the apple with the knife
(150) sey
lekh-aa
write-perf.sg.masc
kaat-aa
cut-perf.sg.masc
(K CASE) = instrumental
(K N-SEM N-CONCEPT) =c instrument
(OBL-inst K)
LFG based lexical entry for instrumental case is shown in (150), which
assigns the case of mother noun phrase as instrumental. The constraint is applied
such that only those nouns that have semantic concept as instrument will be
assigned this case. The last line describes that instrumental case fills the oblique
argument of the verbs argument structure. The f-structure of sentence (149) for
instrumental case is shown in Figure 7.11.
(151) sey
(K CASE) = instrumental
(K N-SEM N-CONCEPT) =c instrument
(K PRED) = 'sey<(K OBJ)>'
(K P-CASE) = sey
(ADJUNCT ($) K)
If the verb argument structure does not allow an instrument, then the
instrumental phrase will be treated as an adjunct using the lexical entry shown in
(151). The lexical entry makes the instrumental noun phrase the object of postposition
sey in the line 3. In the line 4, the value of postpositional case attribute is oblique
instrumental. The line 5 makes the postpositional phrase an adjunct to main predicate.
117
SUBJ
OBJ
OBL INST
TENSE
V-FORM
V-VAL
N-SEM [ N-CONCEPT animate ]
CASE
nom
CASE
instrumental
past
perfect
The verbs that depict activity related to movement or travel. These require
various inanimate noun (or noun phrase), marked with case marker , sey, to
convey information about transportation means/ vehicle, path/ passage or
source location. The sentence (152) shows example, where someone traveled by
boarding on some vehicle, the noun representing vehicle is marked with case marker
sey. If someone travels on foot without a vehicle, then no case marker or
postposition is required with the noun paydal as shown in (153). The sentence in
(154) describes a path and in (155) describes a passage followed in a journey.
(152)
(153)
(154)
(155)
a a a a a
aos=ney
jahaaz=sey
He/She-sg=erg plane.sg.masc=vehicle
He/She traveled by a plane
safar
travel.sg.masc
kee-aa
go-perf.sg.masc
a
a
aos=ney
paydal
He/She-sg=erg
on foot.sg.masc
He/She traveled on foot.
safar
travel.sg.masc
kee-aa
go-perf.sg.masc
a a a a a
aos=ney
saRak=sey
He/She-sg=erg road.sg.masc=path
He traveled by a road
safar
travel.sg.masc
a a a aa
woh
darwaaz-ey=sey
He/She-sg=nom door-obl.sg.m=passage
She came to room through the door
kee-aa
do-perf.sg.masc
kamrey=meyN aa-ee
room=loc.in
come-perf.sg.fem
118
(157)
a a a a
woh
laahaor=sey
He/She-sg=nom Lahore=source
He has come from Lahore.
aa-yaa
hay
come-perf.sg.masc be.pres
a
a a a
teyl
zameen=sey
nekal-taa
hay
oil-sg-masc=nom earth=source
come out-repeat.sg.masc
be.pres
The oil comes out from earth. The oil is taken out from underground.
The Urdu cases, which describe travel or transport, and sometimes represent
path, passage or source as a location, have been described in the above mentioned
examples. These cases are usually translated in English with different prepositional
phrases depending upon the usage of noun concept as summarized below in the form
of a short table.
Noun Concept
vehicle
path
passage
source
Noun Case
conveyor
locative.path
locative.passage
locative.source
English preposition
by
by
through
from
a a a
a a
a
woh
SobaH=sey
maqaalah
He/She-sg=nom morning=temporal paper=nom
He is writing a paper since morning
(159)
a a a a a aaa
woh
dao den=sey
tomhaaraa aentezzaar kar rahee hay
He/She-sg=nom two days=temporal your=nom wait.root.sg.fem.cont.pres
She has been waiting for you for two days.
(160)
a a a a
woh
modat=sey
He/She-sg=nom long=temporal
He/She is ill since long.
beemaar hay
ill=nom
be.pres
(161) sey
119
(K CASE) = temporal
(K N-SEM N-CONCEPT) =c temporal
(K PRED) = 'sey<(K OBJ)>'
(K P-CASE) = sey
(ADJUNCT ($) K)
A LFG based lexical entry is shown in (161) which assigns temporal case only
to those nouns that bear temporal characteristics. The f-structure of sentence (158) is
shown in Figure 7.12 for a temporal case, where temporal noun phrase is added as an
adjunct to the f-structure.
PRED
SUBJ
OBJ
ADJUNCT
TENSE
ASPECT
V-VAL
CASE
erg
PERS
3rd
sg
NUM
CASE
nom
CASE
temporal
OBJ
temporal
EPT
N-CONC
N-SEM
[
]
present
progressive
120
(162)
(163)
(164)
aa
a
woh
jaldee=sey
He/She-sg=nom hurriedly=adverbial
She reached school hurriedly.
a
aa
aa a
Zafar
shaoq=sey
Zafar-sg.m=nom keenly=adverbial
Zafar reads the lesson keenly.
sakool
school
sabaq
lesson
pohanch-ee
reach-perf.sg.fem
paRh-taa
read-repeat.sg.m
hay
be=pres
a
a a aa a
mozzafar
tawajah=sey
caartoon dekh-taa
hay
Mozafar-sg.m=nom attentively=adverbial cartoon watch-repeat.sg.m be=pres
Mozafar watches cartoons attentively.
(165) sey
(K CASE) = adverbial
(K N-SEM N-CONCEPT) =c concept
(K PRED) = 'sey<(K OBJ)>'
(K P-CASE) = sey
(ADJUNCT ($) K)
PRED
SUBJ
OBJ
ADJUNCT
TENSE
V-FORM
V-VAL
CASE
erg
PERS
3rd
NUM
sg
CASE
nom
CASE
adverbial
OBJ
-CONCEPT
concept
N
N-SEM
[
]
past
perfect
In an infinitive case, the Urdu infinitives (also called verbal nouns) are marked
with sey and sometimes with other markers. Some example sentences of infinitives
marked with sey are shown in (166) to (168). These phrases are normally translated
121
(167)
(168)
a a a
a
aosey
paRh-ney=sey
He/She=acc/dat read-inf.obl.m=inf
He/She has hatred for reading.
a a a
a
mojhey
ger-ney=sey
I=acc/dat
fall-inf.obl.m=inf
I got injury from falling.
nafrat
hay
hatred=nom be.pres
chaoT
injury.sg.fem=nom
a a a
a a a a
mayN=ney kaamraan=kao baol-ney=sey
I=erg
Kamran=acc
injury.sg.fem=inf
I prohibited Kamran from speaking.
PRED
SUBJ
OBJ
OBL INF
TENSE
V-VAL
lag-ee
touch-perf.sg.fem
manA keeaa
forbid-nom
PERS
first
sg
NUM
N-SEM [ N-CONCEPT animate ]
acc
1 CASE
P-CASE sey
N-SEM [ N-CONCEPT infinitive ]
SUBJ
1
past
(169) sey
(K CASE) = infinitive
(K PRED) = 'sey<(K OBJ)>'
(K P-CASE) = sey
(ADJUNCT ($) K)
The marker sey is also used in Urdu for the comparison between two noun
phrases in a declarative or indicative mood. Two examples of such cases are shown in
122
(170) and (171). The LFG based Lexical Entry is shown in (172), which uses a
constraint to check that the semantic concept of two nouns being compared is the
same. The dissimilar nouns may not be compared.
(170)
(171)
a a aa a
yeh
jootaa
aos=sey
behtar
this=pro shoe=nom that.pro=comp
better
This shoe is better than that (shoe).
hay
AUX.pres
a a a
a
Zafar
mozzafar=sey lambaa hay
Zafar=nom Mozafar=comp taller
AUX.pres
Zafar is taller than Mozafar.
(172) sey
(K CASE) = comparison
((OBJ K) SUBJ N-SEM N-CONCEPT) =
((OBJ K) OBJ N-SEM N-CONCEPT
(OBJ ($) K)
(174) NP
(175) NP
(176) NP
(177) NP
a a
laRk-ey k-ee
boy-sg.obl.masc
ketaab
PM-sg.fem
a a
gaaR-ee k-aa
car-sg.fem PM-sg.masc
taal-aa
lock.sg.masc
a a
gaaR-ee k-ey
car-sg.fem PM-pl.masc
taal-ey
lock.pl.masc
a
*
laRk-ey
boy-sg.obl.masc
k-ee
PM-sg.fem
a*
gaaR-ee k-aa
car-sg.fem PM-sg.masc
book.sg.fem
123
NP
NP
N
laRk-ey
PM
k-ee
NP
NP
NP
N
ketaab
N
laRk-ey
CM
ney
Figure 7.15 shows phrase structures of possession marker (PM) and case
marker (CM). To make a well-formed noun phrase, a possession-marker requires two
noun phrases both on the left and on the right side of a possession-marker, while a
case-marker just requires a noun phrase ahead of itself. Using a possessive marker as
a case-marker results in phrases like the one shown in (176) and (177), which cannot
be used at a place where a noun phrase is required. Such phrases are incomplete noun
phrases and need another noun phrase for the completion. In other words, possessive
marker has valency for combining with two noun phrases, while case marker has
valency for combining with one noun phrase. Figure 7.16 shows HPSG based lexical
entries of possessive markers kaa, kee and key.
(178) kaa
kee
key
The LFG based lexical entries for Urdu possession markers kaa, kee, and
key are shown in (178), each of which require a possessor noun phrase and a
possessee noun phrase in the argument structure with associated constraints. The
LFG based phrase structure rule is shown in (179), which can be used recursively.
The LFG based rule checks that first noun phrase form is oblique. The first NP is
followed by a PM. The second NP assigns all of its characteristics, such as, the
number, gender, case, form and other semantic properties, to the mother NP. Figure
7.17 shows f-structure of a possessive noun phrase.
124
HEAD
(a)
VAL
word
PHON
HEAD
(b)
VAL
word
PHON
HEAD
(c)
VAL
possessionmarker
NUM sg
AGR
1
GEND masc
NP
SPR
FORM
obl
NP
COMPS AGR 1
kaa
kee
possessionmarker
NUM
|
sg
pl
AGR
GEND
fem
NP
SPR
FORM
obl
NP
COMPS
AGR 1
key
possessionmarker
pl
NUM
AGR
GEND masc
NP
SPR
FORM obl
NP
COMPS
AGR 1
(179) NP
NP
( N-FORM ) =c oblique
PM
NP
(
(
(
(
(
NUM ) = ( NUM )
GEND ) = ( GEND )
CASE ) = ( CASE )
N-FORM ) = ( N-FORM )
N-SEM ) = ( N-SEM )
PRED
POSSESSOR
POSSESSEE
CASE
GEND
NUMB
N-FORM
N-SEM
125
CASE
nom
N-FORM oblique
[ N-CONCEPT animate ]
N-SEM
' ketaab '
PRED
CASE
nom
GEND
fem
sg
NUMB
N-FORM nom
[ N-CONCEPT thing ]
N-SEM
nom
fem
sg
nom
[ N-CONCEPT thing ]
The Urdu and Hindi languages are known to have a morphological causative
formation in contrast to English language, which engages idiomatic use of verbs like
make, get, have, help or let for representing causative structures. The
causative verb forms (or transitivitized verb forms) in Urdu are normally derived from
intransitive and transitive verb-root-forms by adding suffixes: aa, waa. Adding
these suffixes to root-form of a verb forms the stems of new verbs. These stems are
morphologically productive like verb roots, which have been described in Chapter 4
on the verb morphology. It is assumed in the analysis presented here that the
causativization is normally a valency increasing morphological process in Urdu,
which changes not only the argument structure of the verb but also the meanings
conveyed. The formation of higher valency causative argument structure from the
univalent and bivalent verbs can be seen in the examples presented in this section.
The example (180) shows a univalent verb ger-naa (to fall), which requires an
unergative subject. The causative form 1 of the verb is ger-aa-naa (to make
someone fall), which is a bivalent verb as shown in (181). It requires an ergative agent
for perfect verb form and nominative agent otherwise. The verb ger-aa-naa requires
accusative object if the object is animate and nominative object otherwise. The
causative form 2 of the verb is ger-waa-naa (to make someone fall through
someone), which is a trivalent verb as shown in (182).
(180)
Haamed
ger-aa
Hamid.sg.m=nom fall.perf.sg.m
Hamid fell (down).
(181)
a
Hameed=ney
Haamed=kao ger-aa-yaa
Hameed.sg.m=erg Hamid.sg.m=acc fall-make.caus1.perf.sg.m
Hameed caused Hamid fall (down).
(182)
a
a
a
Hameed=ney Haamed=kao aeHmad=sey ger-waa-yaa
Ahmad=agent fall-make.caus2.perf.sg.m
Hameed=erg Hamid=acc
Hameed engaged Ahmad to cause Hamid fall (down).
(183)
aa
a
Hameed=ney Haamed=kao (X=sey)
ger-waa-yaa
(X=agent)
fall-make.caus2.perf.sg.m
Hameed=erg Hamid=acc
Hameed engaged someone to cause Hamid fall (down).
126
It is normally argued that the intermediate agent marked with sey is optional
and even after semantically recognizing the presence of an intermediate or logical
agent, it is assumed that the presence of an intermediate agent not dictated by the
verb argument structure because it is syntactically optional (Mohanan 1990; Bhatt and
Embick 2003; Butt 2003). However, this work assumes the following:
1. The intermediate agent marked with sey is governed by the argument
structure of the causative verb form 2.
2. The intermediate agent marked with sey is not optional, however, it is
sometimes omitted due to the reason that either the intermediate agent is
already known in a discourse, requires least focus or cannot be precisely
stated.
This work presents the following arguments to support the above stated
assumptions:
1. The intermediate agent marked with sey cannot be used with causative
verb form 1. The use of an intermediate agent is syntactically wrong,
because it does not act as a normal adjunct.
2. If the intermediate agent marked with sey is omitted, then it is
semantically implied. Because, if two sentences have the same words with
the same syntactic structures, such that one employs causative verb form 1
and the other uses causative verb form 2, then the interpretation of the two
127
aa
Hameed=ney
Haamed=kao ger-aa-yaa
Hameed.sg.m=erg Hamid.sg.m=acc fall-make.caus1.perf.sg.m
Hameed didnt cause Hamid fall (down).
(185)
a aaa
a
Hameed=ney Haamed=kao (X=sey)
ger-waa-yaa
(X=agent)
fall-make.caus2.perf.sg.m
Hameed=erg Hamid=acc
Hameed didnt engage anyone to cause Hamid fall (down).
(186)
a
a
a
Hameed=ney Haamed=kao aeHmad=sey ger-waa-yaa
Ahmad=agent fall-make.caus2.perf.sg.m
Hameed=erg Hamid=acc
Hameed didnt engage Ahmad to cause Hamid fall (down).
128
The example of a transitive verb son-ee (to listen something) is shown in the
sentence (187). The examples in (188) and (189) show causative forms of the
transitive verb son-ee. The causative form 1 of this verb is son-naa-ee, which is
trivalent and means to involve someone listen something, recited by the agent
himself, is shown in the sentence (188). The causative form 2 of the verb is son-naaee, which is tetravalent and means to involve someone listen something, recited by
some intermediate agent (including electronic devices), is shown in (189).
(187)
a
a
Haamed=ney
naZam
poem=nom.sg.f
Hamid=erg.sg.m
Hamid listened a poem.
son-ee
listen.perf.sg.f
(188)
a
a
Hameed=ney
Haamed=kao naZam
son-aa-ee
Hameed.sg.m=erg Hamid.sg.m=acc poem=nom.sg.f listen-make.caus1.perf.sg.f
Hameed made Hamid listen a poem (recited by Hameed).
(189)
a a
a
a
Hameed=ney Haamed=kao aeHmad=sey naZam
son-waa-ee
Ahmad=agent poem=nom listen-make.caus2.perf
Hameed=erg Hamid=acc
Hameed made Ahmad recite and made Hamid listen a poem (recited by
Ahmad).
The following is a pair of intransitive and transitive verbs, which after causative
formation becomes ambiguous, as the ditransitive form is phonetically very close, but
have different meaning and argument structure.
baol-naa, to speak
intransitive
bolaa-naa, to call/invite
transitive
bol-waa-naa, to make
someone speak something
bol-waa-naa, to make
someone call someone
ditransitive
ditransitive
a a a a a
maaN=ney bach.ch-ey=sey
sheyr
bol-waa-yaa
lion
speak-caus2-perf
mother.erg child.sg.m=agent
A mother caused/helped a child to speak lion.
129
(192)
a a a a
bach.ch-ey=ney baap=kao
father=acc
child=erg
A child called a father.
bolaa-yaa
call-perf
(193) a a a a a
a
maaN=ney bach.ch-ey=sey baap=kao
mother.erg child-sg.m=agent father=acc
A mother asked a child to call a father.
bol-waa-yaa
summon.caus2.perf
For the sentence in (191), we can say agent of action speak is a child, while
mother is the causer of the action. Similarly, for the sentence in (193) the agent of
action call is the child. Therefore, for the causative form 1 (formed by using suffix aa) the causee is in accusative case marked with case marker kao, while for
causative form 2 (formed by using suffix -waa) the causee is in agent case marked
with case marker sey. The examples (194) to (198) have been taken from (Butt and
King 2002), which show that accusative case is compatible with causative form 1,
while agent case is compatible with causative form 2. While using agent case with
causative form 1 and using accusative case with causative form 2 is incorrect. There
the case selection for the verb argument is dictated by causative form. The causative
form 1, kat-aa-yaa, is also sometimes used in place kat-waa-yaa to mean the same
semantics, but actually it does not exist in Urdu usage, because kat-aa-naa is not
compatible with agent case.
(194)
a
a */a a
a
anjom=ney Saddaf=kao/*sey khaanaa khel-aa-yaa
Saddaf =dat/*agent food.nom eat.caus1.perf
Anjom=erg
Anjom made Saddaf eat food (gave Saddaf food to eat).
(195)
a a / *a
a a
anjom=ney Saddaf=*kao/sey paodaa
kat-waa-yaa
Saddaf=*acc/agent plant.nom cut-caus2-perf
Anjom=erg
Anjom had Saddaf cut a/*the plant.
(196)
a
a a
a
anjom=ney Saddaf=kao
meSaalHah
Saddaf=acc
spice=nom
Anjom=erg
Anjom had Saddaf taste the seasoning.
(197)
chakh-aa-yaa
taste-caus1-perf
a
a a
a
anjom=ney Saddaf=sey
meSaalHah chakh-waa-yaa
Saddaf=agent
spice=nom
taste-caus2-perf
Anjom=erg
Anjom made Saddaf had someone taste the seasoning.
Anjom made Saddaf had herself taste the seasoning.
(198)
130
a
a a
a
anjom=ney Saddaf=kao
meSaalHah chakh-waa-yaa
Saddaf=acc
spice.nom
taste-caus2-perf
Anjom=erg
Anjom made someone had Saddaf taste the seasoning.
ger-naa<SUBJ>
ger-aa-naa<SUBJ, OBJ>
ger-waa-naa<SUBJ, SUBJ2, OBJ>
b. laugh
hans-naa<SUBJ>
hans-aa-naa<SUBJ, OBJ>
hans-waa-naa<SUBJ, SUBJ2, OBJ>
c. taste
chakh-naa<SUBJ, OBJ>
chakh-aa-naa<SUBJ, OBJ2, OBJ>
chakh-waa-naa<SUBJ, SUBJ2, OBJ2, OBJ>
d. eat
khaa-naa<SUBJ, OBJ>
khel-aa-naa<SUBJ, OBJ2, OBJ>
khel-waa-naa<SUBJ, SUBJ2, OBJ2, OBJ>
131
Argument
subject
indirect subject
indirect object
object
NP Case
ergative
agentive
dative
nominative
Thematic Role
causer/ initiator of the action
causee/ agent of the action
beneficiary of the action
object of the action
(200) a a
a a
a
a
maaN=ney baap=sey bach.ch-ey=kao khaanaa khel-waa-yaa
food.nom make eat.caus.perf
mother=erg father=ag child.obl=dat
The mother caused (asked, requested) the father to give food to the child.
(201) a
a
a a
a
a a
maaN=ney chamchey=sey bach.ch-ey=kao khaanaa khel-aa-yaa
child.obl.dat
food.nom make eat.caus.perf
mother.erg spoon.inst
The mother gave the food to the child by using spoon, or
The mother made the child eat food by means of a spoon.
The sentences in (200) and (201) have four noun phrases with the same case
markers, and each sentence has one verbal predicate. The tetravalent predicate, khelwaa-yaa, in (200), accepts all the four noun phrases as functional arguments, while
the trivalent predicate, khel-aa-yaa, in (201), accepts only three noun phrases as
functional arguments: The spoon in (201) is used as an instrument. The spoon is not
animate to perform the action on its will, and therefore cannot take the position of an
agent for performing the action. The mother in (201), is the actual performer of the
action, making child to eat food. The spoon is used by the mother to perform the
action. The instrumental argument spoon is optional, and therefore it is not
controlled by the predicate and acts as an adjunct. It may again be noted that the
phrase baap=sey, cannot be used in place of chamchey=sey in (201), however
chamchey=sey can be used in (200). Figure 7.18 shows f-structure with tetravalent
predicate for the sentence in (200) and Figure 7.19 shows f-structure with trivalent
132
predicate for the sentence in (201). The difference of indirect subject SUB2 and an
optional ADJUNCT can be seen in the f-structures.
PRED
SUBJ
SUBJ2
OBJ2
OBJ
CASE
erg
CASE
agent
'
bachchah
'
PRED
CASE
dat
CASE
nom
SUBJ
OBJ2
OBJ
ADJUNCT
CASE
erg
PRED
'
'
bachchah
CASE
dat
CASE
nom
Figure 7.20 shows f-structure with trivalent predicate for the sentence in (196),
which has all the three required grammatical functions. However, Figure 7.21 shows
f-structure with tetravalent predicate for the sentence in (197), which has three
grammatical functions and the intermediate agent is omitted.
133
SUBJ
OBJ2
OBJ
proper
erg
CASE
PRED ' Sadaf '
N-CLASS
proper
dat
CASE
nom
CASE
PRED
SUBJ
OBJ2
OBJ
SUBJ2
proper
N-CLASS
erg
CASE
'Sadaf '
PRED
proper
N-CLASS
dat
CASE
CASE
nom
The LFG based lexical entry for animate nouns is shown in (202), which
assigns agent case to animate nouns. If the verb valency is 2, then the noun phrase is
used in a passive voice sentence or in an inability mood as a subject. In case of verbs
134
having causative form 2 and having valency 3 or 4, the agent case marked with sey
can be used as an indirect subject.
(202) sey
(K CASE) = agent
(^ N-SEM N-CONCEPT) =c animate
{
((SUBJ K) V-VAL) = 2
{ ((SUBJ K) NEG) = +
((SUBJ K) TNS-ASP MOOD) = inability
| ((SUBJ K) TNS-ASP VOICE) = passive
(SUBJ K) }
|
{ {((SUBJ2 K) V-VAL) = 3
|((SUBJ2 K) V-VAL) = 4}
((SUBJ2 K) V-FORM) = Caus2
(SUBJ2 K) }
7.7 Conclusions
In this Chapter, the proposals to handle syntax of the noun phrase in Urdu have
been presented. Use of semantic and verb valency features to better resolve
nominative, ergative, dative and accusative cases has been suggested. Rule for
possession markers is suggested. Noun semantic features also found useful for
differentiating cases marked with sey. The agentive case marked with sey for
animate nouns is also used to propose the concept of indirect subject for the
causative 2 verb forms in Urdu. A method for causative verbs in Urdu based on
morphological valency alternation has been proposed (Butt and King 2006), which
enables generation of new argument structure for a verb based on causative
morphemes.
Chapter 8
MODELING URDU VERBAL SYNTAX
BY IDENTIFYING
TENSE, ASPECT AND MOOD FEATURES
A verb is a word, which is used to describe an action (doing), state (being), or
occurrence (happening). A verb not only carries information about the argumentstructure, but also contains information about tense, aspect, mood and voice. The
argument-structure of a verb describes the number and type of phrases that may be
required to make a well-formed sentence. The tense indicates the time of action, state,
or occurrence in relation to the time of utterance. The aspect expresses a feature of the
action without reference to time, such as completion, repetition or duration. The mood
of a verb expresses a feature representing the type of an action, such as command,
request, question, wish, or conditionality. The voice expresses the focus (or topic) of a
sentence, e.g., in active-voice the focus is on the subject, while in the passive-voice
the focus is on the object.
A verb, in some languages, uses the inflectional affixes to represent tense,
aspect, and mood. In some other languages, it uses tense, aspectual and modal
auxiliaries. Urdu uses both verb auxiliaries and affixes to represent tense, aspect and
mood. As described in Chapter 3, a verb in Urdu can have 60 forms having different
agreement features, while in English a verb has only five forms. Therefore, in Urdu,
the verb-form dependency is relatively complex as compared to the dependency in
English. In addition to this, in Urdu, sometimes a verb form depends on the gender
and number of the object, and sometimes depends on the gender and number of
the subject. Similarly, the auxiliaries also change their form to comply with various
attributes.
In this Chapter, the modeling of the verbal structure in Urdu is presented by
assembling tense, aspect and mood features from the verbal morphemes and
auxiliaries used in a sentence. The agreement tables traditionally appear in Urdu
grammars, which are presented here to gather agreement information and based on
those tables the information associated with various verb morphemes and auxiliaries
is collected. In this Chapter, the phrase structure rules, c-structures and f-structures
are proposed to describe the tense, aspect and mood variations in Urdu language.
135
136
The verbs in Urdu require agreement with noun phrase for various attributes,
such as, the gender, number, person, case and honor form. All nouns in Urdu
carry gender attribute, which also require agreement with the verb-forms. To show
the agreement dependency involved in the tense system, traditionally, Urdu grammars
show different sentence formations for a particular tense in a tabular form normally
known as a gardaan ( ) or a paradigm of a tense. The present-repetitive-tense
paradigm, which requires a subject-agreement, is shown in Table 8.1 and the present
perfect tense paradigm, which requires an object-agreement, is shown in Table 8.2.
The gender (GEND), number (NUM), person (PERS) and honor-form (H-FORM)
attributes of the subject are shown in columns of a table, while gender and number
variation of the object is shown in sub-tables. Urdu has honor attributes associated
with second person pronouns. The pronoun too you is used either in a frank
manner with a friendly tone or in a rude speech with an impolite tone. The pronoun
tom you is a formal (or normal) way to talk with colleagues or with familiar
person. The pronoun aap you expresses polite mood even with younger persons or
it is used as respect. The second person pronoun is usually a singular, however, for
plural reference, phrases such as tom laog you people, tom saarey you all, and
tom sab you all, aap laog you people, app saarey you all, and aap sab
you all, are used. This means that too appears only as a singular pronoun, but tom
and aap can be used as plural pronouns.
Table 8.1: A Present-Repetitive-Tense Paradigm for a Transitive Verb
Having Subject-Agreement
(a) Singular Feminine Object, book (ketaab )
Transliteration
a
a
Urdu Script
GEND
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
PERS
NUM
H-FORM
1st
sg
1st
pl
2nd
sg
2nd
sg
2nd
sg
3rd
sg
3rd
pl
frank,
rude
formal,
familiar
polite,
respect
137
Urdu Script
a
a
GEND
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
PERS
NUM
H-FORM
1st
sg
1st
pl
2nd
sg
2nd
sg
2nd
sg
3rd
sg
3rd
pl
frank,
rude
formal,
familiar
polite,
respect
a
a
Urdu Script
GEND
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
PERS
NUM
H-FORM
1st
sg
1st
pl
2nd
sg
2nd
sg
2nd
sg
3rd
sg
3rd
pl
PERS
NUM
H-FORM
1st
sg
1st
pl
2nd
sg
2nd
sg
2nd
sg
3rd
sg
3rd
pl
frank,
rude
formal,
familiar
polite,
respect
)
Urdu Script
GEND
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
a
a
frank,
rude
formal,
familiar
polite,
respect
138
By observing the present repetitive tense paradigm shown in Table 8.1, it may
be seen that the verb-form and the auxiliary-form remain the same for objects having
different number and/or gender attributes. Therefore, the verb-form and the
auxiliary-form do not require agreement with the number and gender of an object
for the present-repetitive-tense. The verb-form and the auxiliary-form agree with the
highest nominative argument of the verb, the subjects of the sentences in Table 8.1 are
nominative, and therefore, require verb-subject agreement.
Table 8.2: A Present-Perfect-Tense Paradigm for a Transitive Verb
Having Object-Agreement
(a) Singular Feminine Object, book (ketaab )
Transliteration
Urdu Script
GEND
masc/ fem
masc/ fem
masc/ fem
masc/ fem
masc/ fem
masc/ fem
masc/ fem
PERS
1st
1st
2nd
2nd
2nd
3rd
3rd
NUM
sg
pl
sg
sg
sg
sg
pl
H-FORM
frank
formal
polite
)
Urdu Script
a
a
a
a
a
a
a
GEND
masc/ fem
masc/ fem
masc/ fem
masc/ fem
masc/ fem
masc/ fem
PERS
1st
1st
2nd
2nd
2nd
3rd
NUM
sg
pl
sg
sg
sg
sg
H-FORM
frank
formal
polite
masc/ fem
3rd
pl
PERS
1st
1st
2nd
2nd
2nd
3rd
3rd
NUM
sg
pl
sg
sg
sg
sg
pl
Urdu Script
a
a
a
a
a
a
a
a
a
a
a
a
a
a
GEND
masc/ fem
masc/ fem
masc/ fem
masc/ fem
masc/ fem
masc/ fem
masc/ fem
H-FORM
frank
formal
polite
)
Urdu Script
GEND
masc/ fem
masc/ fem
masc/ fem
masc/ fem
masc/ fem
masc/ fem
PERS
1st
1st
2nd
2nd
2nd
3rd
NUM
sg
pl
sg
sg
sg
sg
H-FORM
frank
formal
polite
masc/ fem
3rd
pl
139
mayN
mayN
ham
ham
too
too
tom
tom
aap
aap
woh
woh
woh
woh
(obj) vstaa
(obj) vstee
(obj) vstey
(obj) vstee
(obj) vstaa
(obj) vstee
(obj) vstey
(obj) vstee
(obj) vstey
(obj) vstee
(obj) vstaa
(obj) vstee
(obj) vstey
(obj) vstee
hooN
hooN
hayN
hayN
hay
hay
hao
hao
hayN
hayN
hay
hay
hayN
hayN
a
a
a
a
a
a
a
a
a
a
a
a
a
a
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
Urdu Script
GEND
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
PERS
NUM
H-FORM
1st
sg
1st
pl
2nd
sg
frank
2nd
sg
formal
2nd
sg
polite
3rd
sg
3rd
pl
Table 8.3 and Table 8.4 show pattern for the formation of the present-repetitivetense and past-repetitive-tense respectively. In these tenses, it has been observed that
the verb-form dependence is not on the object. The verb form depends on the person,
number and gender of the subject. Both verb morpheme and auxiliary verb change
their form to agree in number, gender, person and honor form with the subject.
140
The same sentence formation pattern can be used for the intransitive and transitive
verbs. For an intransitive verb, the object is omitted from the pattern, and for a
transitive verb, an object having any gender and number attributes can be placed.
Table 8.5 shows the pattern for the formation of future tense. The agreement of the
verb-form and auxiliary-form, in the future tense, is also with the gender, number and
person of a nominative subject.
Table 8.4: The Pattern of the Past Repetitive Tense
for an Optional Object (obj) and a Verb Root/Stem (vs)
Transliteration
mayN
mayN
ham
ham
too
too
tom
tom
aap
aap
woh
woh
woh
woh
Urdu Script
(obj) vstaa
(obj) vstee
(obj) vstey
(obj) vstee
(obj) vstaa
(obj) vstee
(obj) vstey
(obj) vstee
(obj) vstey
(obj) vstee
(obj) vstaa
(obj) vstee
(obj) vstey
(obj) vstee
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
thaa
thee
they
theeN
thaa
thee
they
theeN
they
theeN
thaa
thee
they
theeN
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
(obj)
GEND
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
PERS
NUM
H-FORM
1st
sg
1st
pl
2nd
sg
frank
2nd
sg
formal
2nd
sg
polite
3rd
sg
3rd
pl
PERS
NUM
H-FORM
1st
sg
1st
pl
2nd
sg
frank
2nd
sg
formal
2nd
sg
polite
3rd
sg
3rd
pl
gaa
gee
gey
gee
gaa
gee
gey
gee
gey
gee
gaa
gee
gey
gee
Urdu Script
vs (obj)
vs (obj)
vs (obj)
vs (obj)
vs (obj)
vs (obj)
vs (obj)
vs (obj)
vs (obj)
vs (obj)
vs (obj)
vs (obj)
vs (obj)
vs (obj)
GEND
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
It may also be observed from the above tables that the future and past auxiliaries
are not dependant on the person attribute of a subject, while present auxiliaries are
141
dependant on the person attribute. In future tense, the person attribute is marked on
the verb-morpheme. In past tense, the person attribute is marked neither on the verbmorpheme nor on the auxiliary. Such irregular variations in the agreement
dependency for verb-morphemes and auxiliaries are shown in Table 8.6 and Table 8.7
respectively.
Table 8.6: The Dependence of Verb Morphemes for the Subject Agreement
Morpheme
taa
tee
tey
taa
tee
tey
ooN
ey
eyN
ey
ao
eyN
Person
1st, 3rd
1st, 3rd
1st, 3rd
2nd
2nd
2nd
1st
3rd
1st, 3rd
2nd
2nd
2nd
Number
sg
sg, pl
pl
sg
sg, pl
sg, pl
sg
sg
pl
sg
sg
sg, pl
Gender
masc
fem
masc
masc
fem
masc
masc, fem
masc, fem
masc, fem
masc, fem
masc, fem
masc, fem
Tense
present, past
present, past
present, past
present, past
present, past
present, past
future
future
future
future
future
future
H-Form
frank
formal, polite
frank
formal
polite
Table 8.7: The Dependence of Auxiliary Verb for the Subject Agreement
Auxiliary
hooN
hay
hayN
hay
hao
hayN
thaa
they
thee
theeN
thaa
they
thee
theeN
gaa
gey
gee
gaa
gey
gee
Tense
present
present
present
present
present
present
past
past
past
past
past
past
past
past
future
future
future
future
future
future
Gender
masc, fem
masc, fem
masc, fem
masc, fem
masc, fem
masc, fem
masc
masc
fem
fem
masc
masc
fem
fem
masc
masc
fem
masc
masc
fem
Number
sg
sg
pl
sg
sg, pl
sg, pl
sg
pl
sg
pl
sg
pl
sg
pl
sg
pl
sg, pl
sg
sg, pl
sg, pl
Person
1st
3rd
1st, 3rd
2nd
2nd
2nd
1st, 3rd
1st, 3rd
1st, 3rd
1st, 3rd
2nd
2nd
2nd
2nd
1st, 3rd
1st, 3rd
1st, 3rd
2nd
2nd
2nd
H-Form
frank
formal
polite
frank
formal, polite
frank
formal, polite
Table 8.6 shows the agreement of verb morphemes with reference to the person,
number and gender for the present-repetitive, past-repetitive and future tenses, while
142
Table 8.7 shows the agreement of the auxiliary (helping) verbs with reference to the
person, number and gender for the same tenses. In both of these tables, the agreement
is with the nominative subject. Table 8.8 show pattern for present-perfect and pastperfect tenses. The objects gender (GEND) and number (NUM) attributes shown in
columns require agreement with verb-form and auxiliary-form. The subject case
should be ergative. This dependence of verb-morpheme and auxiliary-form is
summarized in Table 8.9.
Table 8.8: The Pattern of the (a) Present Perfect Tense (b) Past Perfect Tense
for a Subject (sub), an Object (obj) and a Verb Root/Stem (vs)
(a)
Transliteration
sub
obj vsaa
sub
obj vsee
sub
obj vsey
sub
obj vsee
(b)
Transliteration
sub
obj vsaa
sub
obj vsee
sub
obj vsey
sub
obj vsee
hay
hay
hayN
hayN
Urdu Script
obj sub
obj sub
obj sub
obj sub
GEND
masc
fem
masc
fem
NUM
vs
vs
vs
vs
vs
vs
vs
vs
Urdu Script
obj sub
obj sub
obj sub
obj sub
GEND
masc
fem
masc
fem
NUM
thaa
thee
they
theeN
sg
pl
sg
pl
aa
ee
ey
eeN
Object
Number
Gender
sg
masc
sg, pl
fem
pl
masc
pl
fem
Subject
Case
erg
erg
erg
erg
Aux
no
(b)
Auxiliary Form
hay
hayN
thaa
they
thee
theeN
Tense
present
present
past
past
past
past
Gender
masc, fem
masc, fem
masc
masc
fem
fem
Number
sg
pl
sg
pl
sg
pl
The agreement dependency between the verb-form and noun-phrases, that has
been presented in the above tables, is summarized as general rules shown in (203),
which describes that subject agreement is observed, when the subject bears a
nominative case and the verb-form is a repetitive or subjunctive. The present and past
tenses appear with the repetitive verb-form, while future tense appears with the
143
subjunctive verb-form. The verb for subject agreement can be intransitive, transitive
or ditransitive, therefore the two object noun phrases are optional for subject
agreement. However, for object agreement, the object noun phrase is not optional, and
therefore object agreement is observed only for transitive and ditransitive verbs.
Moreover, for object agreement, the object must be in a nominative case, if it is in an
accusative case then the agreement is not with any of the noun phrase and the default
singular-masculine verb-form is used.
(203) Ssubject-agreement NPSUBJ-nominative (NPOBJ2 ) (NPOBJ ) Vnarrative-form AUX*
Ssubject-agreement NPSUBJ-nominative (NPOBJ2 ) (NPOBJ ) Vsubjunctive-form (AUX future )
The dependence for the subject and object agreement, shown in above tables
and rules, can be directly encoded into LFG based lexical entries, using functional
equations. For example, the lexical entry for a verb xareed-taa, buy, is shown in
(204), and for an auxiliary hooN is shown in (205). Using the rule shown in (204),
the verb, V, and auxiliary, AUX, can combine to form V1, the f-structure of which is
shown in Figure 8.1, which contains constraint on the number, gender, person
and case for the subject.
(204) xareed-taa
(K
(K
(K
(K
(K
(205) hooN
AUX
(K
(K
(K
{
|
(K
(K
TENSE) = present
V-FORM) =c repetitive
SUBJ NUM) =c sg
(K SUBJ GEND) =c masc
(K SUBJ GEND) =c fem
SUBJ CASE) =c nom
SUBJ PERS) =c 1st
(206) V1 V (AUX)*
V-FORM narrative
'
with constraint
SUBJ
NUM
GEND
PERS
CASE
sg
masc
1st
nom
144
The formation of the f-structure for auxiliary hooN is relatively simple from
all other auxiliaries because hooN appears only for first-person subject-agreement,
and, therefore, has lesser restrictions. The lexical entry for the auxiliary hay, shown
in (207), is more complex because it requires agreement sometimes with the subject
and sometimes with the object, along with other constraints.
(207) hay
AUX
(K TENSE) = present
{ { (K SUBJ NUM) =c sg
{ (K SUBJ GEND) =c masc
| (K SUBJ GEND) =c fem }
(K SUBJ PERS) =c 3rd
}
|{(K SUBJ H-FORM) = frank
{ (K SUBJ GEND) =c masc
| (K SUBJ GEND) =c fem }
}
(K SUBJ PERS) =c 2nd
(K SUBJ CASE) =c nom
}
(K V-FORM) =c repetitive
| {(K OBJ NUM) =c sg
{ (K OBJ GEND) =c masc
| (K OBJ GEND) =c fem }
(K SUBJ CASE) =c erg
} }
(K V-FORM) =c perfect
VM
(K TENSE) = present
(K SUBJ GEND) =c masc
(K SUBJ CASE) =c nom
(K V-FORM) = repetitive
{{(K SUBJ NUM) =c sg
(K SUBJ PERS) =c 3rd }
|{(K SUBJ H-FORM) = frank
(K SUBJ PERS) =c 2nd }
-aa hay
VM
(K TENSE) = present
(K V-FORM) = perfect
{(K OBJ NUM) =c sg
(K OBJ GEND) =c masc
(K SUBJ CASE) =c erg
|(K SUBJ CASE) =c nom }
(209) xareed-
VB
(210) V2 VB VM
To simplify the lexical entry, a proposal presented in this work is to lump the
verb-form suffix and the verb auxiliary into one unit and to term the combination as
verb morpheme (VM) as shown in (208) (Rizvi and Hussain 2002). This simplifies
the lexical entries and reduces search space during parsing and unification by
avoiding multiple options, for example, options for the auxiliary in (207). The verb
145
base (VB) is stored separately as shown in the lexical entry (209). The VB describes
information about the argument-structure and the VM describes information about
agreement requirements. Although, this proposal results in extra lexical entries for the
VM, but for each verb only the VB needs to be stored instead of storing all 60 verbforms, therefore, the total number of lexical entries are significantly reduced.
Moreover, this proposal is simpler to carry out because it can be implemented without
using a morphological analyzer. As shown in the rule (210), the lumped VM can
combine with the verb base (VB) to form a verb V2. The f-structure formed using this
rule is shown in Figure 8.2.
' xareednaa SUBJ, OBJ
PRED
TENSE
present
V-FORM perfective
'
with constraint
SUBJ
OBJ
CASE erg
NUM sg
GEND masc
In Urdu, generally, to make a sentence, we need zero or more case marked noun
phrases (NP) followed by a verb as shown in (211), where the verb can have V1 form
as in rule (206) or can have V2 form using rule (210).
(211) S
NP*
( GF ) =
{V
V2 }
The Urdu sentences shown in (212), (213), and (214) give evidence that the
representation of the verb V2 using a combination of VB and VM, is relatively simple
than the representation of V1 using a combination of V and AUX. The sentence in
(212) has perfective verb-form xareed-ee followed by auxiliary hay representing
present tense. The sentence in (213) has the same verb-form followed by past
auxiliary thee. Therefore, for modeling using V and AUX combination, the ASPECT
feature gets value perfect from V, and the TENSE feature gets value present or
past from AUX. However, this scheme requires special handling in finding the
TENSE feature for the sentence in (214), which uses the perfective verb-form
without an auxiliary verb, and the TENSE attribute for the sentence is simple past.
(212)
a
Haamed=ney
ketaab
xareed-ee
hay
book=nom
buy-pref.sg.m
AUX.pres
Hamid=erg
Hamid has bought a book. (TENSE = present, ASPECT = perfect).
146
(213)
a
Haamed=ney
ketaab
xareed-ee
thee
book=nom
buy-pref.sg.m
AUX.past
Hamid=erg
Hamid had bought a book. (TENSE = past, ASPECT = perfect).
a
(214)
Haamed=ney
ketaab
xareed-ee
book=nom
buy-pref.sg.m
Hamid=erg
Hamid bought a book. (TENSE = past).
(215) -ee
VM
(K TENSE) = past
(K OBJ NUM) =c sg
(K OBJ GEND) =c fem
-ee hay
VM
(K
(K
(K
(K
TENSE) = present
ASPECT) = perfect
OBJ NUM) =c sg
OBJ GEND) =c fem
-ee thee
VM
(K
(K
(K
(K
TENSE) = past
ASPECT) = perfect
OBJ NUM) =c sg
OBJ GEND) =c fem
However, by separating verb base (VB) (the part of a verb, responsible for the
argument structure of the verb) from the morphological affix (the part of a verb that
contains agreement features) and then defining verb morpheme (VM) as the suffix of
the verb including all auxiliary verbs, the above-mentioned case may be handled. The
VM contains information about TENSE, ASPECT, MOOD and agreement-features as
shown in (215). The c-structure formed by using the combination of VB and VM for
the sentence in (214) is shown in Figure 8.3.
S
NP
V2
NP
K=L
(LCASE)=nom
N
Haamed
CM
ney
N
ketaab
VB
xareed
VM
ee
(KPRED) = Haamed
(KPERS) = 3rd
(KNUM) = sg
(KGEN) = masc
(KCASE) = erg
(SUBK)
147
can be leaves in a c-structure tree (Bresnan 2001). For example, the use of VB and
VM in coordinated structures is difficult to handle. Figure 8.4 shows the c-structure
for the sentence in (212), which obeys the lexical integrity principle by combining
morphologically complete words, i.e., V and AUX.
S
NP
V1
NP
K=L
(LCASE)=nom
N
Haamed
CM
ney
N
ketaab
V
xareed-ee
AUX
hay
(KPRED) = Haamed
(KPERS) = 3rd
(KNUM) = sg
(KGEN) = masc
(KCASE) = erg
(SUBK)
Figure 8.4: C-Structure of Haamed ney ketaab xareedee hay using V and AUX
In the following sections, the use of syntactic combination V and AUX will be
preferred over the combination VB and VM. However, it has been worked out that
the use of VB and VM with aspectual and modal auxiliaries can also simplify LFG
modeling equations. Similarly, this combination can also handle the complex
predicate by using the lexical entries for allowed N-V and V-V combination, instead
of composing them at the syntactic level.
8.2 Verb Aspect in Urdu
The verb aspect gives a description about the duration, repetition and/or
completion of an event without reference to its actual position in time. If the action of
a verb has been completed, tamaam () , it is termed as perfect, otherwise, if the
action is incomplete, naa tamaam ( a ), the aspect is referred to as imperfect,
progressive, or continuous form, jaaree () . In the previous section, we have
seen that the perfective and imperfective (repetitive) morpheme is directly marked on
the verb, which represents aspect of the verb. In addition to aspectual morphemes,
Urdu employs aspectual auxiliaries to represent the aspect. In Table 8.2, a verbs
perfective form is used to represent present perfect tense, Table 8.10 shows the use of
perfective aspectual-auxiliary chokaa to form present perfect tense.
148
mayN
mayN
ham
ham
too
too
tom
tom
aap
aap
woh
woh
woh
woh
obj vs
obj vs
obj vs
obj vs
obj vs
obj vs
obj vs
obj vs
obj vs
obj vs
obj vs
obj vs
obj vs
obj vs
Urdu Script
hooN
hooN
hayN
hayN
hay
hay
hao
hao
hayN
hayN
hay
hay
hayN
hayN
a
a a
a
a
a
a
a
a
a
a
a
a
a
a
vs obj
vs obj
vs obj
vs obj
vs obj
vs obj
vs obj
vs obj
vs obj
vs obj
vs obj
vs obj
vs obj
vs obj
GEND
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
masc
fem
PERS
NUM
H-FORM
1st
sg
1st
pl
2nd
sg
frank
2nd
sg
formal
2nd
sg
polite
3rd
sg
3rd
pl
Table 8.11: The Attributes Associated with the Aspectual Auxiliary Morphemes
for the Agreement with a Nominative Subject
Morpheme
-aa
-ee
-ey
-eeN
-aa
-ee
-ey
GEND
NUM
PERS
H-FORM
masc
fem
masc
fem
masc
fem
masc
sg
sg, pl
pl
pl
sg
sg, pl
sg, pl
1st, 3rd
1st, 3rd
1st, 3rd
1st, 3rd
2nd
2nd
2nd
frank
formal, polite
Tense
Auxiliary
no
The aspectual auxiliaries in Urdu have gender and number morphemes: -aa,
-ee, and -ey as shown in Table 8.11. The plural feminine morpheme eeN appears
only if the auxiliary is not used in a sentence. The aspectual auxiliaries with such
morphemes usually follow verbs root-form (or stem-form). These auxiliaries require
agreement in gender, number, person and honor form with a subject in the
nominative case. In the following sub-sections, some commonly used Urdu aspectual
auxiliaries are described.
8.2.1 Perfective Aspect
The perfective aspect describes that the action or event has ended and appears
in the present and past tenses. Urdu has two auxiliaries to show perfect aspect. More
frequently used auxiliary to show perfective aspect is chok-aa, the example sentence
of which is shown in (216). It requires agreement with the nominative subject. Other
auxiliary in Urdu, which describes completion, is l-ee-aa as shown in (217). This
auxiliary has irregular morphology, appears with transitive verbs, and requires
149
agreement with an object in the gender and number. The LFG based lexical entries
for both perfective auxiliaries are shown in (218).
There is a semantic difference between these two perfective auxiliaries. The
auxiliary chok-aa tells about the end of an action. For example, the meaning of
sentence (216) is: the event reading has ended, i.e., the whole book or a part of the
book, whatever was intended to be read, has been read. However, for sentence (217)
the meaning is that whole book has been completely read.
(216)
(217)
a a a a
woh
ketaab
paRh
chok-aa
He=nom book=nom read-root AUX.perf-sg.m
He has read the book.
hay
AUX.pres
a a a a a
aos=ney ketaab
paRh
l-ee
hay
book=nom read-root AUX.completely-sg.f AUX.pres
He=erg
He has (completely) read the book.
(218) chok-aa
l-ee
AUX
AUX
(K
(K
(K
(K
(K
(K
The general rule for the formation of sentences employing perfective auxiliaries
is shown in (219), which shows that for auxiliary chok-aa the object NP is optional
and subject case is nominative, while for l-ee-aa, both NPs are required and the
subject case is ergative. Figure 8.5 shows, side by side, f-structures of sentences (216)
and (217). The value of the attribute ASPECT in both f-structures is perfect.
However, in the f-structure for the auxiliary l-ee-aa, another attribute ACTION has a
value complete to show completion of the action with reference to the object.
(219) Sperfective NPNOM-SUBJ (NPOBJ ) VROOT-FORM AUX chok-aa AUX TENSE
Sperfective NPERG-SUBJ NPOBJ VROOT-FORM AUX l-ee-aa AUX TENSE
150
PRED
SUBJ
OBJ
TNS-ASP
CASE nom
PERS 3rd
sg
NUM
CASE nom
present
TENSE
ASPECT perfect
V-FORM root
PRED
SUBJ
OBJ
TNS-ASP
CASE erg
PERS 3rd
NUM sg
CASE nom
TENSE
present
ASPECT perfect
ACTION complete
V-FORM root
The progressive aspect describes the continuation of an event such that the
event continues for the whole duration of the reference time. Urdu employs aspectual
auxiliary rah-aa having morphemes: -aa, -ee, and -ey, which require subject
agreement. The example sentence is shown in (220) and the rule for the progressive
sentence formation is shown in (221), which can be extended for ditransitive and
higher valency verbs. Figure 8.6 shows the f-structure of the progressive sentence in
(220), which contains a value progressive for the attribute ASPECT.
(220)
a a a a
woh
ketaab
paRh
rah-aa
hay
He=nom book=nom read-root AUX.progressive-sg.m AUX.pres
He is reading a book.
SUBJ
OBJ
TNS-ASP
CASE nom
PERS 3rd
NUM
sg
CASE nom
present
TENSE
ASPECT progressive
V-FORM root
151
Urdu has aspectual auxiliaries, such as chal-aa and jaa-taa, which show that
an event or action is repeated for shorter and longer durations. In addition to show
repetition, these auxiliaries, like English phrase keep on, also describe the
persistency or resolve of the agent to perform an action. These auxiliaries are used
with repetitive-form of a verb, and these require agreement with the subject in a
,
nominative form. The repetitive-form of a verb, also called habitual-form
itself describes the repetition of an action. An example sentence without repetitive
aspectual auxiliary is shown in (222). The auxiliary jaa-taa can be used without
auxiliary chal-aa, as shown in (223), but chal-aa always require jaa-taa to
follow, as shown in (224). The auxiliary chal-aa adds the attributes of the
continuation and/or longer-duration to the meanings of auxiliary jaa-taa, and,
therefore, increases the intensity of the persistency. The rule for the formation of
repetitive sentence is shown in (225), which requires verb in the repetitive-form.
(222)
a
a a
woh
ketaab
paRh-taa
hay
AUX.pres
He=nom book=nom read-repeat
He is used to read a book, or, He reads a book (daily or regularly).
(223)
a
a a
woh
ketaab
paRh-taa jaa-taa
He=nom book=nom read-repeat AUX.repeat
He keeps on reading a book (repeatedly).
(224)
hay
AUX.pres
a
a
a a
woh
ketaab
paRh-taa chal-aa
jaa-taa
hay
He=nom book=nom read-repeat AUX.cont AUX.repeat AUX.pres
He keeps on reading a book (repeatedly and continuously).
S
NP
NP
V2
K=L
(LCASE)=nom
N
woh
N
ketaab
V
paRhtaa
AUX
chalaa
AUX
jaataa
AUX
hay
SUBJ
OBJ
TNS-ASP
152
CASE nom
PERS 3rd
NUM
sg
CASE nom
present
TENSE
ASPECT repetitive
ACTION continuous
V-FORM repetitive
(225) Srepetitive NPNOM-SUBJ NPOBJ VNARRATIVE-FORM (AUX chal-aa ) (AUX jaa-taa ) AUX TENSE
There are two more repetitive aspectual auxiliaries in Urdu, which describe
other features of repetition and persistency, such as a most occurring action and
irregular but repeating action. The auxiliary rah-taa describes a predominant action
over the reference time span, as shown in the sentence (226), in which the main action
read may be intercepted by other smaller actions, such as, eating or drinking, but the
main action is read and the interpretation of the sentence is that he usually keeps on
reading a book. The auxiliary kar-taa describes an irregular repetition of an action
as shown in the sentence (227), the f-structure of which is shown in Figure 8.9.
PRED
SUBJ
OBJ
TNS-ASP
CASE nom
PERS 3rd
NUM sg
CASE nom
TENSE
present
ASPECT repetitive
ACTION irregular
V-FORM perfective
(226)
a
a a
woh
ketaab
paRh-taa rah-taa
hay
He=nom book=nom read-repeat AUX.mostly-sg.m be.pres
He mostly (for the maximum available time) reads a book.
153
(227)
a
a a
woh
ketaab
paRh-aa kar-taa
AUX.intermittently-sg.m
He=nom book=nom read-perf
He intermittently (often but not regularly) reads a book.
hay
AUX.pres
a
a a
woh
ketaab
paRh-ney
lagaa
hay
He=nom book=nom read-inf.m.obl AUX.start be.pres
He has just started to read the book, (start = 1) or
He is just going to start reading a book. (start = 0).
(230)
a
a a
woh
ketaab
paRh-ney
waalaa hay
He=nom book=nom read-inf.m.obl AUX.start be.pres
He is going to start reading a book. (start = -1).
PRED
SUBJ
OBJ
TNS-ASP
CASE nom
PERS 3rd
NUM
sg
CASE nom
present
TENSE
ASPECT inceptive
ACTION going2start
V-FORM infinitive
The verb mood describes the purpose of an action, or the type of an action, such
as a fact, news, command, request, wish, doubt, question, and potential. Languages
154
express distinctions of various moods either by inflecting the form of the verb or by
using a modal auxiliary. In English, usually a modal auxiliary, such as should, would,
could, etc., is used to show mood, while in Urdu, both modal auxiliaries and
morphological affixation are used to show mood variations. In the following subsections, the commonly used moods in Urdu are described.
8.3.1 Declarative or News Mood
a a
Haamed
beemaar
Hamid=nom sick=nom
Hamid is sick.
hay
be.pres.sg
S
NP
(K SUBJ)=L
NP
(K INFO)=L
VBE
N
Haamed
N
beemaar
V
hay
(K PRED) = Haamed
(K CASE) = nom
(K PERS) = 3rd
(K NUM) = sg
(K GEN) = masc
(K PRED) = beemaar
(K NUM) = sg
(K GEN) = masc
(K CASE) = nom
K=L
155
SUBJ
INFO
TNS-ASP
PERS 3rd
NUM sg
PRED ' beemaar '
CASE nom
present
TENSE
MOOD declarative
S
NP
(K LOC)=L
NP
(K SUBJ)=L
VBE
K=L
NP
PM
NP
NP
CM
N
Haamed
kee
N
paydaaesh
N
laahaor
meyN
(K PRED) = Haamed
(K CASE) = nom
(K PERS) = 3rd
(K NUM) = sg
(K GEN) = masc
V
hoo-ee
(K PRED) = haonaa<SUBJ, LOC>
(K TNS-ASP TENSE) = past
(K TNS-ASP V-FORM) = perfective
(K TNS-ASP MOOD) = declarative
(K SUBJ NUM) =C sg
(K SUBJ GEND) =C fem
(K SUBJ CASE) =C nom
Although the present form of predicate hay does not require gender
agreement, the perfective forms hoo-aa and hoo-ee require gender agreement with
the subject as shown in the examples (233) and (234). In the case of possessive NP,
the agreement is with the last possessee NP in the chain of possessive NPs.
(233)
a a a a a
Haamed=kee paydaaesh
laahaor=meyN hoo-ee
Hamid=gen
birth.nom.sg.fem Lahore.loc
happen-perf.sg.fem
Hamids birth took place in Lahore
156
(234) a a a a a
Haamed=kaa janam
laahaor=meyN hoo-aa
Hamid=gen
birth.nom.sg.masc Lahore.loc
happen-perf.sg.masc
Hamids birth took place in Lahore.
PRED
SUBJ
LOC
TNS-ASP
CASE nom
POSSESSOR
PERS 3rd
NUM sg
CASE nom
POSSESSEE
GEND fem
NUM sg
CASE
nom
GEND
fem
sg
NUM
CASE locative
past
TENSE
MOOD
declarative
V-FORM perfective
The agreement of verb in sentence (233) is with the subject, which gets its
gender, number and case features from the possessee as described in the c-structure
and f-structure shown in Figure 8.13 and Figure 8.14. The sg-fem noun paydaaesh
(birth) is used with the sg-fem verb-form hooee as shown in (235). Similarly, the sgmasc noun janam (birth) is used with sg-masc verb-form hooaa, as shown in the
sentence (234). The analysis presented in this work is different from (Mohanan 1990).
Mohanan assumed that for a sentence like the one shown in (236), janam hooaa is a
N-V complex predicate having genitive subject Haamed kaa and a locative object
laahaor meyN. However, in this work it is assumed that, for the sentence in (236),
Haamed kaa is an incomplete noun phrase, until it is joined with another nominative
noun phrase. The phrase laahaor meyN is not a nominative noun phrase, therefore,
paydaaesh is a possessee of the possessor Haamed, which comes after a locative
phrase in the phrase-order. Based on this observation, this work assumes a rule for the
noun phrase order in Urdu and Hindi as shown in (237). This assumption is also found
useful for the analysis of other moods like the permissive mood.
157
(235)
a a a a a
Haamed=kee laahaor=meyN paydaaesh
hoo-ee
Hamid=gen
Lahore.loc
birth.nom.sg.fem happen-perf.sg.fem
Hamids birth took place in Lahore
(236) a a a a a
Haamed=kaa laahaor=meyN janam
hoo-aa
Hamid=gen
Lahore.loc
birth.nom.sg.masc happen-perf.sg.masc
Hamids birth took place in Lahore.
(237) Assumption: Surface Linear Order for Noun Phrases in Urdu/Hindi:
If case-marking on noun phrases enables identifying arguments for all the
predicates having the argument-structure in a sentence, then the noun phrases
in Urdu and Hindi can take any surface linear order, and even the arguments
of different predicates could be scrambled.
S
NP
(K SUBJ)=L
NP
PM
N
Haamed
kaa
(K PRED) = Haamed
(K CASE) = nom
(K PERS) = 3rd
(K NUM) = sg
(K GEN) = masc
NP
(K LOC)=L
NP
CM
N
laahaor meyN
(K PRED) = laahaor
(K CASE) = locative
VBE
K=L
NP
N
janam
(K PRED) = janam
(K NUM) = sg
(K GEN) = masc
(K CASE) = nom
V
hoo-aa
(K PRED) = haonaa<SUBJ, LOC>
(K TNS-ASP TENSE) = past
(K TNS-ASP V-FORM) = perfective
(K TNS-ASP MOOD) = declarative
(K SUBJ NUM) =C sg
(K SUBJ GEND) =C masc
(K SUBJ CASE) =C nom
158
(238)
a a a a a
Haamed=kaa laahaor=meyN
Hamid=gen
Lahore.loc
Hamids house is in Lahore.
makan
hay
house.nom.sg.masc be-pres.sg
S
NP
(K SUBJ)=L
NP
PM
N
Haamed
kaa
(K PRED) = Haamed
(K CASE) = nom
(K PERS) = 3rd
(K NUM) = sg
(K GEN) = masc
NP
(K LOC)=L
NP
CM
N
laahaor meyN
(K PRED) = laahaor
(K CASE) = locative
VBE
K=L
NP
N
makan
V
hay
(K PRED) = makan
(K NUM) = sg
(K GEN) = masc
(K CASE) = nom
(239)
a a a
a a
yeh
Haamed=kee
This
Haamed=gen
This is Hamids book.
ketaab
book.nom.sg.fem
hay
be-pres.sg
(240)
a a
a a a
yeh
ketaab
Haamed=kee
hay
This
book.nom.sg.fem Haamed=gen
be-pres.sg
This is Hamids book (with a focus on the book).
(241)
a a
a
ketaab
Haamed=kee
yeh
hay
book.nom.sg.fem Haamed=gen
This
be-pres.sg
This is Hamids book (with a focus on Hamid).
a a a a a a
a
Haamed kee paydaaesh
aapney naanaa key ghar=meyN hoo-ee
Hamids.focus Birth.sg.fem his grandfathers home=location happen-perf.sg.fem
Hamids birth took place at his grandfathers home.
159
NP
(K OBJ)=L
NP
(K OBJ2)=L
NP
CM
NP
CM
NP
N
Haamed
ney
N
aanjom
kao
N
ketaab
(K PRED) = Haamed
(K PERS) = 3rd
(K NUM) = sg
(K GEN) = masc
(K PRED) = aanjom
(K PERS) = 3rd
(K NUM) = sg
(K GEN) = fem
(K CASE) = erg
Vdeynaa
K=L
(K OBJ SUB)=(L OBJ2)
VN
paRhney
(K PRED) = ketaab
(K CASE) = nom
(K OBJ NUM) =C sg
(K OBJ GEND) =C fem
(K CASE) = dat
V
d-ee
Figure 8.17: C-Structure of Haamed ney aanjom kao ketaab paRhney dee
160
sentence (245), which also has three arguments but the object in that case is not a
verbal noun and, therefore, does not have attributes for V-FORM and V-FORM2 as
infinitive and oblique, respectively. Moreover, the phrase paRhney key leeey (for
reading) is a post-positional phrase in (245) and acts as adjunct in the final f-structure.
The f-structure for the sentence (244) is shown in Figure 8.18
PRED
SUBJ
OBJ 2
OBJ
TNS-ASP
GEND
PERS
erg
masc
3rd
1
GEND fem
PERS 3rd
1
SUBJ
CASE nom
OBJ
fem
GEND
sg
NUM
V-FORM infinitive
TNS-ASP V-FORM2 oblique
NUM
sg
fem
GEND
past
TENSE
MOOD
permissive
V-FORM perfective
'
Figure 8.18: F-Structures of Haamed ney aanjom kao ketaab paRhney dee
The f-structure in Figure 8.18 assumes that infinitive paRhnaa also has an
argument structure, the subject (SUBJ) of this infinitive verb is aanjom, which is an
indirect object (OBJ2) of the permissive verb deynaa and the object of the infinitive
is ketaab. The infinitive has its own tense-aspect attributes. The gender and number
attributes of this verbal noun are the same as that of the object, which are singular and
feminine.
a a a
a
(246) a
Haamed=ney ketaab
aanjom=kao
book=nom.sg.f Anjom=dat
Hamid=erg
Hamid let Anjom read a book.
paRh-ney d-ee
read-inf.obl let-perf.sg.f
161
The sentence in (246) is the same as the sentence in (244), but shows a different
order of noun phrases. The sentence in (244) is the more acceptable form of a
permissive sentence, but the sentence in (246) is also acceptable. In these sentences,
there are four NPs: Hamid is ergative, book is nominative, Anjom is dative and
to read is infinitive. Moreover, there are two verbs with argument-structures: dee
(let) requires three NPs: an ergative, a dative and an infinitive. The infinitive to read
requires two NPs: a dative subject and a nominative object. The argument of these
verbs are satisfied based on the case-marking according to the assumption (237) made
in this work. The scrambled c-structure of sentence (246) is shown in Figure 8.19.
The f-structure of sentence (246) is the same as that of sentence (244) shown in
Figure 8.18.
S
NP
(K OBJ2)=L
NP
(K SUBJ)=L
NP
CM
N
Haamed
ney
NP
NP
N
N
ketaab aanjom
NP
(K OBJ)=L
Vdeynaa
K=L
(K OBJ SUB)=(L OBJ2)
CM
VN
kao
paRhney
V
d-ee
(K PRED) = Haamed
(K PERS) = 3rd
(K NUM) = sg
(K GEN) = masc
Figure 8.19: C-Structure of Haamed ney ketaab aanjom kao paRhney dee
162
(248)
a a a
a a a a a
Haamed=ney aanjom=kao ketaab
paRh-ney=sey manA kee-aa
Anjom=dat book=nom.sg.f read-inf.obl=inf prohibit-perf.sg.m
Hamid=erg
Hamid prohibited Anjom from reading a book.
PRED
SUBJ
OBJ 2
OBJ
TNS-ASP
GEND
PERS
erg
masc
3rd
aanjom
PRED
'
'
CASE dative
1
GEND fem
PERS 3rd
1
SUBJ
CASE nom
OBJ
GEND fem
NUM sg
infinitive
V-FORM
NUM
sg
fem
GEND
CASE
infinitive
past
TENSE
MOOD
prohibitive
V-FORM perfective
'
Figure 8.20: F-Structures of Haamed ney aanjom kao ketaab paRhney sey
manA keeaa
163
and may be omitted. Special verb morphemes in Urdu are used to represent variations
within the imperative mood, such as, frank, formal, polite and request as shown in the
following example sentences.
(249) a a
too
ketaab
paRh
You=nom.frank book.sg.masc read=imp.frank
You read the book (in a frank or rude way).
(250)
a a
tom
ketaab
paRh-ao
You=nom.formal book.sg.masc read=imp.formal
You read the book (in a formal way or with someone familiar).
(251)
a a
aap
ketaab
paRh-eyN
You=nom.polite book.sg.masc read=imp.polite
You read the book (in a polite or respectful way).
(252)
a a
aap
ketaab
paRh-ee-ey
You=nom.polite book.sg.masc
read=imp.request
You read the book, please (in a more-polite way or as a request).
(253)
a a a
aap
ketaab
paRh
l-ee-jee-ey
You=nom.polite book.sg.masc
read=root
AUX.imp.appeal
You, please, read the book (in an extra-polite way or as an appeal).
Table 8.12: The Urdu Imperative Verb Forms for the Imperative Mood
Imperative
Read
Look
Speak
Sit
Eat
Frank
(or Rude)
paRh
deykh
baol
bayTh
khaa
Formal
(or Familiar)
Polite
(or Respect)
More Polite
(or Request)
paRh-ao
deykh-ao
paRh-eyN
deykh-eyN
paRh-ee-ey
deykh-ee-ey
baol-ao
baol-eyN
baol-ee-ey
bayTh-ao
khaa-ao
bayTh-eyN
bayTh-ee-ey
khaa-eyN
khaa-ee-ey
Extra Polite
(or Appeal)
a
paRh leejeeey
a
deykh leejeeey
a
baol deejeeey
a
bayTh jaaeeey
a
khaa leejeeey
Table 8.12 summarizes the imperative verb forms, for some Urdu verbs, which
are used in the imperative mood. Figure 8.21 shows the c-structure tree and Figure
8.22 shows the f-structure for the sentence in (252). The imperative verb-form in the
164
c-structure has constraints on the subject that it should be a second person, having
nominative case and polite-form of the pronoun. If the second person pronoun is
omitted, then these attributes are implied by default.
S
NP
(K SUBJ)=L
NP
(K OBJ)=L
N
aap
N
ketaab
(K PRED) = pronoun
(K PERS) = 2nd
(K H-FORM) = polite
(K CASE) = nom
V2
K=L
V
paRh-ee-ey
SUBJ
OBJ
TNS-ASP
nom
2nd
PERS
H-FORM polite
CASE nom
request
MOOD
V-FORM imperative
The capacitive mood shows the capability of the agent for performing an action.
This mood employs the auxiliary sak-taa to tell that the attribute of this mood is
capability. A general rule for capacitive mood is shown in (254) and the example
sentence is shown in (255)
(254) Scapacitive NPSUBJ-nom NPOBJ-nom V AUX sak-taa AUX tense
a a a
(255) a
mayN
ketaab
paRh
sak-taa
hooN
I=nom.1st.sg book=nom.sg.f read-root AUX-capacitive.sg.m AUX-pres.1st.sg
I can read a book.
165
NP
(K OBJ)=L
NP
NP
N
mayN
N
ketaab
(K PRED) = pronoun
(K PERS) = 1st
(K NUM) = sg
(K CASE) = nom
V2
K=L
V
paRh
(K PRED) = ketaab
(K NUM) = sg
(K GEN) = fem
(K CASE) = nom
AUX
hooN
AUX
saktaa
(K TENSE) = present
(K V-MOOD) = capacitive
a
a
a
tomheeN
ketaabeyN
paRh-nee chaaheeey
You=acc.2nd books=nom.pl.fem read-inf.fem AUX-suggestive
You should read books.
PRED
SUBJ
OBJ
TNS-ASP
CASE acc
PERS 2nd
CASE nom
GEND fem
pl
NUM
present
TENSE
MOOD
suggestive
V-FORM infinitive
(258)
166
a
a a
tomheeN
ketaab
paRh-nee chaaheeey
You=acc.2nd book=nom.sg.fem read-inf.fem AUX-suggestive
You should read a book.
a
a a / a
aos=ney/aosey ketaab
paRh-nee
hay
book=nom
read.inf.sg.masc AUX.pres
He=erg/acc
He wants to/has to read a book (self-imposed obligation).
(260)
a a
a a / a*
aosey
ketaab
paRh-naa
paR-ee
hay
He=*erg/acc book=nom read.inf.sg.masc AUX.compulsive-sg.fem AUX.pres
He has to read a book (externally-imposed obligation, unwillingness).
a a a a a
woh ketaab paRh
chokaa
hao gaa
He book
read.root AUX.perf be.sg.m.3.presumptive
He would have read the book.
167
(263)
(264)
a a a
a
a a
aab tak
paakestaan jeet
chokaa
Until now, Pakistan
won.root AUX.perf
Until now, Pakistan would have won.
a a a a a
woh sakool jaa
rahee
she school go.root AUX.cont
She will be going to school.
hao gaa
be.sg.m.3.presumptive
hao gee
be.sg.f.2.presumptive
(265)
a a a a a
woh mare
jaa
rahey
haoN gey
They Murree go.root AUX.cont be.pl.m.2.presumptive
They will be going to Murree.
(266)
a a a a a a
kal
mayN laahaor jaa
rahaa
hooN gaa
Tomorrow, I
Lahore go.root AUX.cont be.sg.m.1.presumptive
Tomorrow, I will be going to Lahore.
(267) Sdubitative/ presumptive NPSUBJ-nom (NPOBJ-nom ) Vroot AUX chok-aa AUX hao AUX g-aa
Sdubitative/ presumptive NPSUBJ-nom (NPOBJ-nom ) Vroot AUX rah-aa AUX hao AUX g-aa
168
The subjunctive mood is also used for praying to Allah for something, as shown
in (271) and also for seeking forgiveness of sins from Allah as shown in (272).
(271) a a
a
aal.lah
tomheyN beytaa
Allah=3.nom you=dat.2 son.sg.masc
May Allah give you a son.
(272) a a a a
aal.lah
meyrey gonaah
Allah=3.nom I=gen.1 sin=nom.pl
May Allah forgive my sins.
d-ey
give.sg.3.subjunctive
moAaf
kar-ey
forgive=nom do.pl.3.subjunctive
The verbal coordination refers to the use of two or more verbs with a common
subject of the sentence as shown in the sentence (273). In this sentence, there is a
single ergative subject Haamed and three actions eating breakfast, picking up a
bag and going to an office. These three actions are associated with the subject using
a conjunction and (aor ). This coordinated structure is well-formed because all
three verbs in the sentence are transitive and require an ergative subject.
(273)
a a aa a
a a
a a
Haamed=ney naashtah
kee-aa ,
bayg
aoThaa-yaa ,
Hamid=erg
breakfast=nom do.perf ,
bag=nom
pick up.perf ,
aor
daftar
chal-aa ga-yaa,
and
office=nom
go.perf
Hamid ate the breakfast, picked up (his) bag and went to (his) office.
a a aa a
aa a a
a a
? Haamed=ney naashtah
kee-aa , nahaa-yaa , bayg
Hamid=erg
breakfast=nom do.perf , bath.perf
, bag=nom
aoThaa-yaa , aor
daftar
,
chal-aa ga-yaa
office=nom ,
go.perf
pick up.perf , and
Hamid ate breakfast, took bath, picked up bag and went to office.
The sentence (275) and (276) use a transitive verb khaa-naa and an
intransitive verb nahaa-naa in coordination. The transitive verb requires an ergative
subject, while intransitive verb requires a nominative subject. The perfective verbform agrees with the object for a transitive verb and with the subject for an
intransitive verb. For sentence (275), the transitive verb khaa-naa agrees correctly
169
with the object aam, but the intransitive verb nahaa-naa cannot agree with the
subject, because the subject is ergative, and the intransitive verb cannot take an object,
moreover, the default agreement singular-masculine is also not correct. Therefore, the
only way to assume that sentence (275) is well formed is to assume that there is an
implied pronoun woh, like the one shown in sentence (277). Similarly, the verb
khaa-naa in sentence (276) requires an ergative subject and considered well-formed
if the pronoun aos ney is assumed, like the one shown in sentence (278). However,
such a verbal coordination between an intransitive and a transitive verb is well
formed for other imperfective verb-forms (i.e., infinitive, repetitive, imperative and
subjunctive), because these verb-forms require nominative subject for both transitive
and intransitive verbs.
(275)
aa
aa a *
* naadyah=ney aam
khaa-yaa
aor
Nadya=erg.fem mango=nom.masc eat.perf.masc and
Nadya ate mango and bathed.
(276) a
aaa
a *
* naadyah
nahaa-ee
aor
Nadya=nom.fem bath.perf.fem and
Nadya bathed and ate mango.
aam
mango=nom.masc
nahaa-ee
bath.perf.fem
khaa-yaa
eat.perf.masc
(277)
aaa
aa a
naadyah=ney aam
khaa-yaa aor woh
nahaa-ee
Nadya=erg.fem mango=nom.masc eat.perf.masc and she=nom bath.perf.fem
Nadya ate mango and bathed.
(278) a
aa aaa
a
naadyah
nahaa-ee aor aos=ney aam
khaayaa
Nadya=nom.fem bath.perf.fem and she=erg mango=nom.masc eat.perf.masc
Nadya bathed and ate mango.
The sentences (277) and (278) do not express verbal coordination because
each verb has its own subject and these sentences represent the coordination between
two sentential phrases instead of between two verbs. The verbal coordination is
a
types, which are
usually achieved in Urdu using two verbal conjunction
preferred to the general conjunction and (aor ), and both of these act as
participle adverbials. The first type perfective verbal conjunction is formed, as
shown in (279), by adding the auxiliary kar to the verb stem-form. The auxiliary
kar describes that the subject, after completing the first action, performs other
action. The kar phrase in Urdu has similar meaning as the having phrase in
English. The example sentences are shown in (280) and (281), and this syntax is
linguistically preferable to the coordination syntax used in (277) and (278).
170
a a aa
Haamed
aam
khaa
kar
nahaa-yaa
Hamid=nom.masc mango=nom.masc eat.root AUX.perf.conj bath.perf.sg.masc
Hamid, having eaten the mango, bathed.
Hamid, after eating the mango, bathed.
(281)
aa a a a
Haamed=ney nahaa kar
aam
khaa-yaa
Hamid=erg.masc bath.root AUX.perf.conj mango=nom.sg.masc eat.perf.sg.masc
Hamid, having bathed, ate the mango.
Hamid, after bathing, ate the mango.
PRED
SUBJ
TNS-ASP
ADJUNCT
[TENSE past ]
1
SUBJ
PRED
'
aam
'
OBJ
CASE nom
ASPECT perfective
TNS-ASP
conjunctive
MOOD
CASE nom
SUBJ
TNS-ASP
OBJ
ADJUNCT
nom
CASE
[TENSE past ]
CASE nom
1
SUBJ
ASPECT perfective
TNS-ASP MOOD
conjunctive
The f-structures of sentences in (280) and (281) are shown in Figure 8.25 and
Figure 8.26, which assign verbal conjunctive phrase an adverbial adjunct to the main
phrase, such that the subject for both phrases is the same. The adverbial adjunct has
perfective aspect and conjunctive mood.
171
The second type verbal conjunction progressive represents the overlap of two
actions, and one action is performed while the other action is in progress. This is
formed by appending an oblique form of auxiliary hoo-ey to the oblique repetitive
verb form, as shown in (282).
(282) An action accompanies another progressive action
Verbal Conjunction Progressive = Verb Repetitive Form -tey + hoo-ey
(283)
a
a
aa
Haamed
aam
khaa-tey
Hamid=nom.masc mango=nom.masc eat.m.obl
Hamid, while eating the mango, bathed.
PRED
SUBJ
TNS-ASP
ADJUNCT
hoo-ey
AUX.m.obl
nahaa-yaa
bath.perf.sg.masc
CASE
nom
[TENSE past ]
1
SUBJ
PRED
'
aam
'
OBJ
CASE nom
ASPECT
progressive
TNS-ASP
conjunctive
MOOD
172
both the subject and object. This contrast between the two is shown in sentences (284)
and (285).
8.5 Conclusion
In this Chapter, the modeling of the verbal structure is presented for commonly
used Urdu tenses, aspects and moods. Urdu language has rich verb morphology,
which requires agreement with the subject and/or object nouns. The verb morphology
as well as auxiliaries, describe various features for tense, aspect and mood. It is shown
that the LFG based c-structures and f-structures can handle such diverse modeling
requirements of Urdu verbal syntax.
Chapter 9
URDU PARSING BY CHUNKING
BASED ON CLOSED WORD CLASSES
USING ORDERED CONTEXT FREE GRAMMAR
Parsing is to find constituent structure of a sentence in a language using given
grammar and it is an important requisite for various natural language processing
applications. For machine translation systems, parsing the sentences of the source
language is important for the proper syntactic and afterwards semantic understanding
of the source language sentences. Unless the proper understanding of the structure of
the source language is attained, the knowledge conveyed by the sentence may not be
extracted and then reliable machine translation may not be achieved.
The parsing techniques for context free grammar (CFG), also known as rulebased parsing techniques, are well understood (Grune and Jacobs 1994). Various
parsing techniques are available, each of which is suited to some particular conditions.
The Earley parser, Tomita parser and chart parser are widely used for natural
language parsing. However, to find a complete unambiguous CFG parsing for natural
language processing (NLP) is a difficult task, until semantic disambiguation is done.
The statistical parsing techniques for the natural language parsing are achieving good
results (Manning and Schtze 2003). However, most of statistical techniques need a
large corpus and a considerable amount of manual work input for tagging and parsing
of corpus as an example data for statistical tagger and parser. Steven Abney proposed
to approach parsing of natural languages by starting with finding correlated chunks of
words (Abney 1991). Chunking is to divide text into syntactically related nonoverlapping groups of words. Ramshaw and Marcus used chunking through a
machine learning method (Ramshaw and Marcus 1995). Their approach categorized
every non-NP chunk as VP chunk. Buchholz et al. found various chunks: NP, VP, PP,
ADJP and ADVP (Buchholz, Veenstra et al. 1999). Veenstra worked with NP, VP
and PP chunks (Veenstra 1999). In 2000, a conference on Computational Natural
Language Learning (CoNLL)1 shared task was held on chunk parsing. JMLR
Special Issues2 was published on shallow parsing in 2002.
1
173
174
This work presents a novel Ordered Context Free Grammar (OCFG), which is
an extension of the context-free-grammar, and each rule OCFG has an order and type
associated with it, just like probabilistic context-free-grammar is an extension of CFG
(Yao and Lua 1998) and has a probability value associated with each rule. The formal
definition is given in (286).
(286) Definition: An ordered context free grammar (OCFG) is a four tuple
{W, N, S, R}, where W = {w1, w2, , wu} is a set of terminal symbols like
words in a sentence, N = {n1, n2, , nv} is a set of non-terminal symbols like
noun-phrase in a sentence, S = { n } is a set of one symbol, which is the goal
symbol, and R = {r1, r2, , rw} is a set of grammar rule, where each rule has a
unique order number and type (left, right or recursive)
(Rizvi and Hussain 2004).
Each rule r has an order number associated with it, which is used as a priority
during parsing. In addition to order, each rule has a left, right or recursive type. A leftrule is applied from left to right, and a right-rule is applied from right to left. A
recursive rule is applied recursively. A rule can have neither empty left-hand side nor
empty right-hand side. This means empty productions are not allowed.
In order to validate the proposed method a computer program is written and
tested for arithmetic expressions, programming language and natural language parsing
applications. In this section, parsing of arithmetic expression is presented to show that
method not only can parse but also takes care of association involved in arithmetic
expressions. In next sections, the OCFG will be used for parsing by chunking for
Urdu language.
175
0
1
2
3
4
5
Right/Recur
Right
Right
Left/Recur
Left
Left
(288) A = B + C * 5.3
3+4*(67)/5+3
The expressions shown in (288) are parsed to explain the method. In the first
step the input strings are converted to token objects. The parser sends tokens array to
rules one by one, and rule object applies itself. The rule # 0 has recursion. Therefore,
it finds the first two object ID and O3 and if found, it sends all tokens following O3 to
parser object for finding the parse. The parser takes the shorter array and uses the
same set of rules one by one to reduce it to E. The shorter array is shown using square
brackets in Figure 9.1. After recursive call, the rules are applied on the array shown
within square bracket, until it reduces to E, then recursion will return and parsing
before recursion call is resumed. Now, by using rule # 2 from right-to-left the N
object is used to construct E. Then using rule # 1 twice, two more E objects are
created from ID objects. Then the rule # 4 is used ahead of rule # 5, which takes care
of precedence requirement. Finally, the expression E is constructed successfully
within square bracket, where recursive call returns the parse in the previous array as
an E object. Which by using rule # 0, constructs the final expression for assignment.
176
A = B + C * 5.3
ID O3 ID O1 ID O2 N
ID = [ ID O1 ID O2 N ]
ID = [ ID O1 ID O2 E ]
ID = [ ID O1 E O2 E ]
ID = [ E O1 E O2 E ]
ID = [ E O1 E ]
ID = [ E ]
ID = E
E
Input expression
Token objects
Rule 0 : Recursion
Rule 2
Rule 1
Rule 1
Rule 4
Rule 5
Recursion returns
Rule 0
The grammar (287) is applied to another expression, which has a subexpression in brackets and its parse is shown in Figure 9.2. The sub-expression within
braces is also handled using a recursive call, in order to ensure that the inner subexpressions be evaluated ahead of out-side brace expression.
3+4*(67)/5+3
N O1 N O2 ( N O1 N ) O2 N O1 N
E O1 E O2 ( E O1 E ) O2 E O1 E
E O1 E O2 ( [ E O1 E ] ) O2 E O1 E
E O1 E O2 ( [ E ] ) O2 E O1 E
E O1 E O2 ( E ) O2 E O1 E
E O1 E O2 E O2 E O1 E
E O1 E O2 E O1 E
E O1 E O1 E
E O1 E
E
Input expression
Token objects
6 times, Rule 2
Rule 3 (Recursion)
Rule 5
Recursion returns
Rule 3
Rule 4 : Left
Rule 4
Rule 5 : Left
Rule 5
9.2 Tokenization
Before looking into chunking and parsing of Urdu sentences, the tokenization of
Urdu text is discussed. Tokenization is a trivial task for most of the languages where
space character is used to separate two lexical items. Urdu language has many lexical
entries that contain a space, and therefore, the space character cannot be used to
separate words. For example, consider the following sentences:
(289)
a a a
a a
Taaeyr pankchar hao gayaa hay.
The tyre has been punctured.
(290)
a a a a
a a
meyN Taylee phaon xareedney key leeey gayaa thaa.
I went for buying a telephone (set).
177
The character sequences: pankchar haonaa, Teylee phaon and key leeey
may be treated as a single word, which makes tokenization difficult. Alternatively,
these may be composed at syntactic level, which makes syntactic-rules more complex.
For example, the verbal-words such as pankchar haonaa and aenteezaar karnaa
may be composed at syntactic level as N-V complex predicates, but handling them at
syntactic level is not easy, because not all nouns can combine with all verbs. Such
words with spaces are normally listed in dictionaries (Feroz-ud-Din 2000) as single
words, and Urdu grammars (Mustafa 1973; Abdul-Haq 1991; Schmidt 1999) do no
present syntax rules to compose them.
Unicode1 character set and Urdu Zabta Takhti (UZT) character set (Afzal and
Hussain 2001), both contain two types of spaces. In Unicode character set normal
space is represented with hexadecimal value 0x0020, and another zero-width-nonjoiner space has value 0x200C. In UZT, the second space is given name hard-space
and represented using hexadecimal value 0x41. The function of both zero-width-nonjoiner space and hard-space is to represent space in character sequence that represents
single word. However, the current electronically available Urdu text uses only normal
space, such as the books written in word processor Inpage, the newspaper Jang text,
various Urdu books and websites using Pakistan data management systems Urdu
word processor Urdu 98, the Unicode based text at BBC Urdu news website. Before
tokenization of such a text, pre-processing of sentences in text is required. A simple
algorithm for the tokenization, by replacing soft-space with hard-space, for an Urdu
sentence is as follows:
0. given a sorted lexicon containing collocations (i.e., words with softspaces) such that the collocations having more spaces and greater length
are on top
1. for each such collocation, do
2. search the collocation in the sentence
3. if found; replace normal-space with hard-space for this collocation
4. separate words in the sentence as token at normal-spaces
9.3 Part of Speech (POS) Tagging
178
Case
ergative
dative, accusative
agentive, instrumental, locative,
Tag
CM_ney
CM_kao
CM_sey
Postpositions
Urdu postposition are different from case-markers, because these do not mark
basic grammatical relation, and these are not controlled by the verbal predicate. These
are adjuncts to the main sentence. These, like case-markers, always follow nouns or
noun-phrases. Therefore, these may be used to make post-positional phrase chunk. As
shown in the list of post-positions given below, that most of the post-positions are
composed of two or more basic units, because these contain space character.
Therefore, tokenization and tagging of post-position, must take care of the space
issue, in order to achieve better chunking results.
Postposition
meyN
par
sey
tak
a
key leeey
a
key aandar
1
English Preposition
in, into, at (place)
on
from, by,
up to, till
for
within, contained in
Tag
PP
PP
PP
PP
PP
PP
179
a
sey aoopar
a
key peechhay
a
key paas
a
key baAd
a
sey pahley
a
key saath
a
key mottaabeq
a
key motaAleq
a
key xelaaf
a
key sewwaa
a
kee ttaraH
a
kee ttaraf
a a
key baarey meyN
a
kee jagah
a
key Aelaawah
aa a
key ttaor par
a a
kee wajah sey
a
key ZareeAey sey
a
key sabab
a
key qareeb
a
key darmeyaan
a
sey baahar
a
key daoraan
a
kee xaatter
a
key saamney
a
sey Zeyaadah
Possession Markers
Urdu possession markers come in between two nouns or noun-phrases and these
describe possessive relation. Therefore, whenever the possession markers are found in
a text, there is a high probability that there will be two nouns or noun-phrases around
them. These may, therefore, be used to make a possessive noun-phrase chunk. It may
also be noted that many postpositions, contain kee and key, which are treated
differently from these possession markers. The easiest way to handle this ambiguity is
that first postpostion are tagged and then remaining kaa, kee and key are tagged
as possession markers. However, possessive phrase chunks are made ahead of
postpositional or case marker phrase chunks.
180
Possession Marker
kaa of
kee of
key of
Tag
PM
PM
PM
Conjunctions
introduce a dependent clause. These are mainly used to break apart large sentence into
smaller clause chunks, which are subsequently easier to parse.
Subordinating Conjunctions
aalbatah
magar
leyken
leyhaaZaa
a aes leeey
tao
a
tab hee
a
pher bhee
keh
a taa keh
Haalaankeh
jabkeh
keeonkeh
English
however
but
nevertheless
therefore
consequently
then
after that
yet, despite, in spite of
that
for the reason
in spite of this situation
whereas
because
Tag
CJS
CJS
CJS
CJS
CJS
CJS
CJS
CJS
CJS
CJS
CJS
CJS
CJS
coordinate two phrases or clauses. These are, likewise, used to break apart larger
sentence into smaller chunks, which are subsequently easier to parse.
181
Correlative Conjunctions
a
a
a
English Equivalent
if
then
whoever
he, she, it
whomever
he, she
whom, whose
they
because
therefore
a either
or
a a
neither
nor
a
not only
but also
a
a
whatever
the same
a
a
whatever, as much the same
where
there
wherever
there
when
then
a
until
a
whenever
then
Tag
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
CJR1, CJR2
Interjections
shaabaash
shaandaar
aoo
aey
aooh
aof
waah
xoob
aarey
Tag
IJ
IJ
IJ
IJ
IJ
IJ
IJ
IJ
IJ
IJ
IJ
Pronouns
The word used instead of a noun is called a pronoun. The subject pronouns are
those that stand for three persons forms, when these forms take the position of the
subjects. The subject pronoun forms may appear in nominative and ergative cases,
which means that these may be used with ergative marker ney or these may appear
alone in the nominative case.
182
Subject Pronouns
meyN
ham
too
tom
aap
woh
yeh
aes
aos
aen
aon
English
Tag
I
PNS
we
PNS
you
PNS
you
PNS
you
PNS
he, she, they, it (far) PNS
it (near)
PNS
he, she, this
PNS
he, she, that
PNS
they, these
PNS
they, those
PNS
The object pronouns are those that are used in place of three persons, when
these appear as the object. These pronoun forms cannot appear with other cases
markers like ergative or agentive markers, because these already bear dative or
accusative case.
Object Pronouns
mojhey
a a
hameyN, ham kao
tojhey
a a
tomheyN, tom kao
a aap kao
aa
aesey, aes kao
aa aonheyN
English
me
us
you
you
you
him, her, it
him, her, it
them
them
Tag
PNO
PNO
PNO
PNO
PNO
PNO
PNO
PNO
PNO
The possessive pronouns are those that are used in place of nouns to show
possessive relationship, these pronouns must be followed by a noun or a noun-phrase
to complete the possessive relationship.
Possessive Pronouns
a a
meyraa, meyree, meyrey
a a
hamaaraa, hamaaree, hamaarey
a a
teyraa, teyree, teyrey
a a
tomhaaraa, tomhaaree, tomhaarey
aa aa a aes kaa, aes kee, aes key
English
my, mine
our, ours
your, yours
your, yours
his, her, hers, its
his, her, hers, its
his, her, hers, its
his, her, hers, its
Tag
PNP
PNP
PNP
PNP
PNP
PNP
PNP
PNP
183
xaod
aapas
a aeyk doosrey
English
myself, ourselves, himself, herslef
myself, ourselves, himself, herslef
themselves
one another
Tag
PNR
PNR
PNR
PNR
sab
kaoee
kochh
a
yahaan, wahaan
a aedhar, aodhar
a har aeyk
daonaoN
kaee
baaqee
English
all
anyone, anybody, anything
some, something, few
here, there
here, there
each, everyone, everybody
both
many, serveral
others
Tag
PNI
PNI
PNI
PNI
PNI
PNI
PNI
PNI
PNI
Negation Markers
The negation markers are used to make negative senteces. These are like
adverbs, as these add negation to the meaning of actual sentence.
Negation Markers
nah
naheeN
mat
a
kabhee naheeN
English
no
no, nil
no
never
Tag
NM
NM
NM
NM
The qeustion markers are used to make interrogative senteces. These k-words
add question to the meaning of actual sentence.
Question Markers
keyaa
keeooN
keysey
kaon
English
what
why
how
who
Tag
QM
QM
QM
QM
184
Question Markers
kab
kahaaN
kedhar
a kes jagah
a a a a kes kaa, kes kee, kes key
a kes leeey
a kes waqt
a kes ttaraf
a kes semmat
English
when
where
where
where
whose
for what
at what time
in which direction
in which direction
Tag
QM
QM
QM
QM
QM
QM
QM
QM
QM
Auxiliaries
Tag
be (present), is, are
AUX
be (past), was, were
AUX
will, shall
AUX
Perfective Aspect
APA
Perfective Aspect
APA
Progressive Aspect
APrA
Repetitive Aspect
ARA
Repetitive Aspect
ARA
Inceptive Aspect
AIA
Inceptive Aspect
AIA
Compulsive Mood
ACoM
Capacitive Mood
ACaM
Capacitive Mood
ACaM
Suggestive Mood
ASM
Declarative Mood (happen) ADM
Permissive Mood
APeM
Prohibitive Mood
APrM
The cardinal numbers and ordinal numbers are open class words and finite state
morphology may be used to generate these numbers. These numbers follow adjectives
or noun phrases. However, names of days, names of weeks, and other time and date
185
nouns may be treated as close class of words. Similarly, country names, city names,
currencies may be made a close class of words.
aa aa
a a a
cardinal numbers NC
ordinal numbers NO
Title
President
aa
President of Pakistan
a Prime Minister
Tag
TLE
TLE
TLE
Verbal Morphemes
In Chapter 7, the two proposals to compose verb phrase were given, i.e., the
verb phrase, VP, may be composed by combining a verb form, V, with various
auxiliaries, AUX, or alternatively it may be composed by combining verb base, VB,
with the verbal morphome, VM, which contains all the auxiliaries lumped into actual
morpheme. These VM may be treated as the closed classes of words, because these
can be attached to many different VBs. A list of few VM is shown as follows:
taa hooN
tee hoon
tey hayN
tee hayN
taa hay
tee hay
tey hao
tee hao
ee
aa
eeN
ey
ee hay
aa hay
ee hayN
ey hayN
ee thaa
aa thee
ee theeN
ey thay
chokaa hooN
chokee hooN
chokey hayN
chokee hayN
chokey hao
chokee hao
chokaa hay
chokee hay
ooN gaa
ooN gee
eyN gey
eyN gee
ao gey
ao gee
ey gaa
ey gee
eyN gey
eyN gee
ao
ooN
9.4 Chunking
The morphologically closed word classes have been discussed in the previous
section. Case markers, possession markers are used to make noun phrase chunk, NP.
Postpositions are used to make postpositional phrase chunk, PPP. The verbal
morphemes and auxiliaries are used to make verb phrase chunk, VP. The conjunctions
are used to make recursive rules for breaking apart larger sentences into smaller
sentences. The interjections, negation markers, question k-words, are are not directly
186
useful in chunking and these are used as adjunct phrase, AJP, in the main sentence.
Therefore, closed word classes are found quite useful as an aid for chunking based on
their linguistic characteristics.
The chunking scheme presented in this research utilizes an ordered context free
grammar (OCFG) presented in the earlier sections of this chapter. List of recursive
chunking OCFG rules is given in (291), while non-recursive OCFG rules are given in
(292).
(291) 1
S S CJC S
S S CJS S
S CJR1 S CJR2 S
(292) 4
NP PNS
PPP (N | NP) PP
NP PNP (N | NP)
10
11
12
NP (NP,) NP CJC NP
13
V1 V (AUX)*
14
V2 VB VM
15
17
18
19
20
187
The outlines of the algorithm adopted for parsing by chunking and then using
ordered context free grammar is as follows:
1. Tokenize sentence into words.
2. Tag tokens by starting with morphologically closed classes of words.
3. Tag remaining tokens with morphologically open classes of words, by
using linguistic guess from already tagged closed class tokens.
4. Apply chunking rules in the given order to make chunks.
5. Use parsing rules on tagged and chunked sentence to achieve full
parsing.
a a a a a a a
woh aapnee behen key ghar jaa rahee hay.
She is going to her sisters house.
aapnee
behen
key
ghar
jaa
rahee hay
Tagging with closed classes results finds a pronoun (PNS) woh, a possessive
pronoun (PNP) aapnee, a possession marker (PM) and a verb morpheme (VM)
rahee hay:
woh
aapnee
PNS
PNP
behen
key
ghar
jaa
rahee hay
VM
PM
aapnee
behen
key
ghar
jaa
rahee hay
PNS
PNP
PM
VB
VM
188
Chunking rules are applied next. The PNP is combined with the following noun
to form a noun phrase (NP) using rule 7 and then rule 8 is used to form another
possessive noun phrase. The VB and VM are combined to form verb phrase by using
rule 12. The result of chunking is:
woh
aapnee
behen
key
ghar
jaa
rahee hay
PNS
PNP
PP
VB
VM
NP
VP
NP
S
NP
NP
VP
NP
PM
PNS
PNP
woh
aapnee
behen
key
VB
VM
ghar
jaa
rahee hay
Figure 9.3: Parse Tree of Sentence woh aapnee behen key ghar jaa rahee hay
After chunking, the remaining are only three tokens, NP, NP and VP. Therefore,
now parsing is achieved by using rule 20, this parse would have been difficult in the
absence of chunking. The Parse Tree is produced as shown in Figure 9.3.
(295)
a
aa a a a a
kamrey meyN takhtah seeah, meyz aaor korsee hay
There is a blackboard, a table and a chair in the room.
189
Tagging with closed classes finds that only three tokens belongs to this class,
i.e., a coordinating conjunction (CJC) aaor, a postposition (PP) meyN and a verb
auxiliary (AUX) hay. The rest of tokens are not found in the lexicon portion
representing closed classes.:
kamrey meyN takhtah seeah meyz aaor korsee hay
PP
CJC
AUX
Tagging with open classes using knowledgeable guess is done next. We know
that PP must follow a noun, N, and a conjunction must make a list of similar tokens.
When the length of the sentence is small and CJC is surrounded with a list of nouns,
then the result of breaking sentence into smaller sentence is false and in that condition
rule13 is used. The lexicon search finds four nouns:
kamrey meyN takhtah seeah meyz aaor korsee hay
PP
CJC
AUX
Next after tagging is finished, chunking rules are applied. This one PPP and one
NP chunk, by using rule 6 and rule 13, respectively, as shown:
kamrey meyN takhtah seeah meyz aaor korsee hay
PP
PPP
CJC
AUX
NP
After chunking, the resulting three tokens are two noun phrases (NP)s followed
by a verb auxiliary AUX. A parsing rule 19 is used to generate complete parse of the
sentence. Tree produced is shown in Figure 9.4.
S
PPP
NP
PP
CJC
AUX
190
In this section, it is shown that how parsing by chunking method can be used to
parse longer sentences taken from news websites, such as newspaper Jang
(http://www.jang.net) and BBC (http://www.bbc.co.uk/urdu). A sample sentence
taken from newspaper Jang is shown here:
Recursive chunking rule (S S CJS S) separates this bigger sentences into two
smaller chunks by separating sentence at subordinating conjunction, CJS, keh that,
thus the result is two smaller sentences as shown below:
S wazeer e aAzzam paakestaan shaokat Aazeez ney aenTarneyshnal karekeT
kaonsal kao yaqeen dahaanee karaaee hay
CJS keh
S aagar paakestaan kao bhaarat kee jagah aaee see see chaympeeanz Taraafee
kee meyzbaanee mel jaaey tao Hakoomat e paakestaan aaee see see kao teyks
meyN chhooT dey gee
Recursive chunking rule (S CJR1 S CJR2 S) separates the second part of
above sentences into two more smaller chunks by separating sentence at pair of
correlative conjunctions, CJR1, agar if, and , CJR2, tao then, thus the result is
the smaller chunks as shown below:
S wazeer e aAzzam paakestaan shaokat Aazeez ney aenTarneyshnal karekeT
kaonsal kao yaqeen dahaanee karaaee hay
CJS keh
CJR1 aagar
S paakestaan kao bhaarat kee jagah aaee see see chaympeeanz Taraafee kee
meyzbaanee mel jaaey
CJR2 tao
S Hakoomat e paakestaan aaee see see kao teyks meyN chhooT dey gee
Now by applying the chunking rules for postpositional phrase, PPs, noun
phrases, NPs, and for verb phrases VPs, the resultant chunks of the original sentence
are shown as follows:
191
CJS
CJR1
S
CJR2
S
NP
NP
The major limitation for adopting this algorithm for general parsing is to tackle
compound nouns containing spaces. In this research, these compound nouns, like
aenTarneyshnal karekeT kaonsal, Hakoomat e paakestaan, aaee see see
chaympeeanz Taraafee, etc. having been included in the lexicon with hard spaces,
so that these are treated as single nouns. Similarly, the title wazeer e aAzzam
paakestaan has been treated as a single word in the lexicon.
9.7 Results and Analysis
Successful Parses
85%
75%
192
The method has been tested on the sentences given in the Appendix C and on all
the example sentences given in this document. Please note that compound words are
manually added in lexicon as single token. The given results may not be considered
precise, until the method is tested on a standard large corpus of sentences and after a
better tokenization algorithm is developed. For news website sentences, manual
tokenization of compound nouns and their inclusion in the lexicon is performed
before the processes of chunking and parsing are carried out.
The chunking method makes NP, PPP and VP chunks based on case markers,
possession markers, postpositions and verbal auxiliaries and morphemes. It utilizes
broad POS categories divided into closed and open class words. The open class words
are collected in separate files to reduce the search space. Although, tagging closed
class words ahead of open class words requires two passes on the input token array,
but it results in 50% improvement in efficiency, because the search space is reduced
by using separate files for closed class words, nouns, verbs and adjectives. Moreover,
closed class words provide useful hint, like for a case marker, its predecessor must be
a noun, therefore searching is required only for a noun. Moreover, after NP, PPP and
VP chunks have been formed, the remaining parsing rules are less and simpler.
The parsing method presented does not require calculations and storage
requirements for finding a parsing table, which is required in tabular parsing methods.
Neither it generates all the possible parses of the given sentence, as generated by
some methods, e.g., chart parser. It needs to store only the n token objects in an
array and m rule objects. Therefore, space efficiency of this method may be
considered good.
9.8 Conclusions
The language oriented parsing method presented in this Chapter for Urdu
language through the mechanism of chunking utilizes linguistic characteristics of the
morphologically closed word classes in Urdu language to make chunks. The simple
tokenization algorithm presented in this chapter, manually includes compound Urdu
words with soft spaces into lexicon. However, a better tokenization algorithm is
needed to be developed for Urdu based script. Tagging is initiated with closed classes
of words, which not only reduces search space, but also useful in guessing and
chunking neighbor open class words through linguistics characteristics. The chunking
results in a shallow parsing of the sentence and reduces number of rules for the final
parsing stage. Proper identification of NP, PPP and VP phrases through chunking also
results in the reduction in ambiguity for most of the sentences containing case
markers and postpositions. However, for declarative (or news) mood sentences lexical
ambiguity is not resolved. Reduction in ambiguity in natural language parsing results
in more reliable machine translation.
193
The method generates only one parse tree for a given sentence, therefore, lexical
and syntactic ambiguous sentences for which more than one parses are acceptable
may not be handled by this method. Moreover, this method shows poor results for
verbal conjunctions and also for sentences having long distance dependencies. To
improve the accuracy of the method it is suggested that LFG feature based unification
during parsing may be carried out to make sure proper agreement. Alternatively, some
statistical technique may also be adopted for the tagging and chunking of Urdu text.
Chapter 10
CONCLUSIONS
10.1 Summary and Conclusions
The work has been done on the modeling of computational grammar for the
formation of Urdu words and sentences. The frequently used constructions in Urdu
have been investigated under the framework of Lexical Functional Grammar (LFG)
and proposals have been presented for handling Urdu specific issues. The grammar
formulation proposed in this work can be utilized for many natural language
processing applications, such as, grammar checker, machine translation, text
summarization, text categorization, information extraction, speech processing and
knowledge engineering.
The morphological analysis of verbs, nouns and adjectives has been performed
and implemented using Xerox finite state lexicon compiler LEXC and Xerox finite
state tool XFST. The XFST is a morphological analysis tool, which is useful for
analyzing the lexical data and morphological information, and it builds a finite state
network usually referred to as a lexical transducer. The lexical transducer looks-up
surface morphological form of a word into a lexicon and finds lexical form of a word
and looks-down lexical word and gives corresponding morphological form.
For the syntactical analysis, most frequently used sentence constructions in
Urdu have been modeled. Mainly, Lexical Functional Grammar LFG has been used
for the mathematical formulation of Urdu grammar, the implementation of which has
been carried out using Xerox Linguistic Environment XLE. For the development and
testing of language grammars, XLE is a useful tool, which can be used to incorporate
morphological analysis from XFST, and by implementing syntactic rules the parsing
of sentences into c-structures and f-structures is achieved. Some Urdu syntactic
concepts has also been modeled under Head-driven Phrase Structure Grammar
HPSG, which also serve as a comparison between LFG and HPSG.
Urdu verb has a rich morphology and its verb forms can be divided into five
categories, i.e., infinitive, perfective, repetitive, subjunctive and imperative. Urdu has
three stem forms named as the root form, the causative form 1 and the causative form
2. Each of these three stem forms, by the attachment of various morphemes, results in
20 verb forms, making a total of 60 verb forms for a single Urdu verb. A finite-state-
194
195
automaton has been presented in Chapter 3 to represent these 60 forms. The attributes
and corresponding value sets have been selected for representing verbal information
in Urdu Lexicon. As in most languages, these attributes are person, gender, number,
tense, aspect, mood, verb form, and honor form. As compared with English, the
honor form attribute for imperative verb forms is additionally required in Urdu, and
similarly verb form, mood and aspect attribute in Urdu have some different values.
Urdu nouns and adjective morphology has been investigated and the attributes
necessary to represent lexical information relating to nouns and adjectives have been
collected. A noun in Urdu bears a gender attribute for all nouns, which can take
either feminine or masculine value, unlike English, which does not have such an
attribute for inanimate nouns and unlike German, which take additional neutral
value. However, only some nouns in Urdu have overt gender morpheme, therefore for
most of the Urdu nouns gender attribute is required to be adopted from traditional
dictionaries. The nouns have nominative form if they appear without a case-marker or
post-position, have oblique form if they appear with a case-marker or post-position
and have vocative form in subjunctive mood. Again, not all nouns have visible
morpheme to distinguish nominative, oblique and vocative forms. The adjectives also
have gender, number and form morphemes, which require agreement with the
noun. The attributes required to represent lexical information related to various noun
categories or characteristics and corresponding values they take have been collected.
The semantic class of the noun, which tells the type of noun, i.e., animate, instrument,
location, etc. is also selected, which is found useful in classifying different cases.
The review and implementation of algorithms for constructing a computational
lexicon has been carried out. Some hash functions have been implemented for
constructing a lexicon without morphological considerations. Similarly, some
deterministic-finite-state-automaton minimization algorithms have been implemented
to construct lexicon using lexical transducers. A comparison between the two
approaches showed that hash table implementation requires more memory space,
however, it has fast access time, requires lesser morphological knowledge and is more
dynamically adjustable, while lexical transducers based implementation requires
morphological analysis, lesser space in memory and it has fast access time.
The formulation of the noun-phrase syntax in Urdu has been carried out. As
Urdu is rich case-marked language, therefore nouns accompany various case-markers
and post-position to form phrase that fill various grammatical roles in the argument
structure of the verb. To better differentiate various roles adopted by noun-phrases a
classification of case-markers and post-position has been proposed. This classification
is based on the difference in modeling and conceptualization, such as whether a noun
phrase should be handled morphologically or syntactically, whether it should be
196
197
A roman-script has also been proposed, which is used for the transcription of
Urdu sentences in this thesis. The characters of this roman-script are selected in such
a way that computerized transfer of text to this roman-script from Urdu-script is
possible and vice versa. It is also taken care that the mapped characters in these scripts
be phonetically the same or as close as possible.
10.2 Future Directions
A standard large corpus of Urdu text in Urdu script may be developed, which
may contain sentences from various constructions in Urdu. The same corpus may be
made available in the roman script, using which an automatic conversion from roman
script to Urdu script and vice versa may be tested. The corpus may be utilized for
automatic tagging, chunking and parsing applications and for comparing and
evaluating these applications for various proposals. The corpus may be utilized for
automatic or semi-automatic extraction of world knowledge from the given text. The
same corpus may also be made bilingual, using which various statistics-based and
example-based machine translation studies may be made between Urdu and other
languages. Moreover, this bilingual parallel corpus may be utilized for machine
translation testing and validation studies. It can be utilized to evaluate and compare
two machine translation systems.
Moreover, text corpora taken from various sources, such as, newspapers,
literary work, editorial work, older Urdu books, text books, Islamic books and TV
plays may be developed and may be compared for various difference, which may
exist, between these corpora. Using these corpora, a more systematic computational
linguistic study may be made, such as, for the usage of case markers and postpositions ney and kao.
A specialized morphological analyzer, based on finite state transducers, for
Urdu text may be developed, that covers various aspects of Urdu morphology. In this
thesis, only inflectional morphology has been studied. The Urdu morphological
analyzer may cover the basic verb, noun and adjective forms covered in this thesis, as
well as it may cover derivational morphology, such as, formation of nouns from
verbs, verbs from nouns, or adjectives from nouns. The work of (Kaplan and Kay
1994) may be utilized to cover irregular morphological constructions in Urdu. The
morphological analyzer may cover various other morphological conversions of nouns
to verbs, like, N-V complex predicate formation. It may also cover construction of
compound nouns and adjectives. The morphological analyzer may be built with such
an interface that its output may be utilized with other modules, e.g., its output may be
utilized by a parser or syntax analyzer.
LFG based Grammar implementation of syntax may be further improved by
studying and analyzing more sentence constructions from Urdu texts, by collecting
198
more example data for the particular construction, e.g., by collecting more usages of
case markers and post-positions in Urdu. Many other sentence constructions in Urdu
that are not covered in the thesis may be studied and the rules for those may be
incorporated in the syntax grammar, which may include conditional statements,
correlatives, complementizers, multi-gap constructions, anaphora resolution etc.
The parsing algorithm presented based on OCFG needs further improvement.
The tokenization and tagging algorithms needs enhancements. The chunking may be
improved by incorporating LFG based unification information and if the unification
fails, the parse may be rejected. For robust testing of the parsing algorithm, based on
Urdu chunking, a standard corpus of Urdu text may be useful. Statistical chunking
techniques may be implemented to validate the rule order and results based on OCFG.
Alternatively, the standard parsing techniques may also be employed, like chart
parsing, along with specialized Urdu rules to eliminate unwanted parse trees.
As discussed in Chapter 3, the context free grammar (CFG) may be used to
model natural languages, however, it will require more part of speech (POS)
categories as well as more rules. To model the same linguistics phenomena, lexical
functional grammar (LFG) based modeling requires fewer POS categories and fewer
rules. However, LFG has an overhead of attribute unification. A detailed time and
space complexity study may be made to compare implementation of natural language
grammars using CFG, LFG or HPSG.
The ideas of Urdu computational grammar developed for machine translation
may be utilized for various other Urdu NLP applications, such as, grammar checker,
text summarization, question-answer systems, expert systems, text categorization,
information extraction, intelligent search applications, speech processing and
knowledge engineering.
Appendix A
ROMAN SCRIPT FOR URDU LANGUAGE
To use Latin characters as a script for writing languages that use other script
characters is commonly referred to as roman script. Various character sets of roman
script for representing Urdu and Hindi languages already exist, but mostly to transfer
text bi-directionally in these script, using a computer, is difficult, especially without a
dictionary. In this appendix, a new character set is being proposed so that the
computational transfer of text is possible between Urdu script and proposed roman
script.
Urdu is written in Arabic-Persian script with some additional characters, while
Hindi is written in the Devanagari Script. Otherwise, Urdu and Hindi have common
syntactic structure and most of the commonly used vocabulary is the same. For the
proposed roman script, the mapping is also given for Hindi language, however, the
transfer of text between roman script and Hindi may require a small dictionary to
disambiguate some Urdu characters.
The characters are selected so that these are phonetically close to English
characters so that the English reader may read the script easily, however, at some
places it was not possible to reduce the ambiguity in the script, especially for vowel
sounds. The following Tables give mapping of characters between Urdu, Roman and
Hindi scripts.
Table A.1: Mapping of Unambiguous Consonants
Urdu
Unicode
Hindi
Unicode
IPA
Unicode
Roman
Hex
Dec
0628
092C
62
98
0628
0062
092D
b 0062+02B0
bh
62+68
98+104
067E
092A
70
112
067E
092B
p 0070+02B0
ph
70+68
112+104
062A
0924
74
116
062A
0925
t 0074+02B0
th
74+68
116+104
0679
091F
54
84
0679
0920
Th
54+68
84+104
062C
0288+02B0
091C
6A
106
0070
0074
0288
199
02A4
200
Unicode
Hindi
Unicode
062C
091D
0686
091A
0686
091B
062E
02A7+02B0
0959
0078
4B
75
062F
0926
0064
64
100
062F
0927
d 0064+02B0
dh
64+68
100+104
0688
0921
44
68
0688
0922
0256+02B0
Dh
44+68
68+104
0631
0930
0072
72
114
0691
095C
027D
52
82
0691
095D
Rh
52+68
82+104
0633
027D+02B0
0938
0073
73
115
0634
0937
0283
sh
9A
154
063A
095A
0263
47
71
0641
095E
0066
66
102
0642
0958
0071
71
113
06A9
0915
006B
6B
107
06A9
0916
kh
6B+68
107+104
06AF
k 006B+02B0
0917
67
103
06AF
0918
gh
67+68
103+104
0644
g 0261+02B0
0932
006C
6C
108
0645
092E
006D
6D
109
0646
0928
006E
6E
110
IPA
Unicode
Roman
Hex
Dec
jh
6A+68
106+104
ch
63+68
99+104
02A4+02B0
02A7
chh
0256
0261
63+68+68 99+104+68
Unicode
Hindi
Unicode
IPA
Unicode
Roman
Hex
Dec
0648
0935
028B
77
119
06BE,0647
0939
0266
68
104
06C1
0939
0266
68
104
0621, 0654
06CC
092F
006A
79
121
06D2
092F
006A
59
89
Unicode
062B
Hindi
Unicode
0938+093C
IPA
Unicode
Roman
th
C
Hex
Dec
74+68 116+104
43
67
201
062D
0630
0939+093C
0127
48
72
095B
007A
5A
90
0632
095B
007A
7A
122
0698
095B
0292
0635
0938
0073
0636
z
zh
X
S
095B
007A
0637
0924
0074
tt
74+74 116+116
0638
095B
007A
7A+7A 122+122
0639
06BA
0901
zz
A
N
7A+68 122+104
58
88
53
83
4A
74
41
65
4E
78
Unicode
Hindi
Unicode
IPA
Unicode
Roman
Hex
Dec
0936
Sh
53+68
83+104
0283
091E
0272
nn
6E+6E
110+110
0923
0273
4E
78
0919
014B
ng
6E+67
110+103
0929
nNn
0933
0934
lLl
0931
rr
6E+4E+6E 110+78+110
4C
76
6C+4C+6C 108+76+108
72+72
114+114
Unicode
0627
064E
0622
+a
064E+0627
a +a
0627+0650
0650
+ a +a 0627+0650
+06CC
+ a a
0650+06CC
0626
+a
0627+064F
initial only
after
093E
0259
61
97
consonant
word
0906
0061
a
word
0907
0069
i
ae initial only 61+65 97+101
after
093F
0069
65
101
i
e
consonant
97+101+
word
0908
after
0940
ee
only
word
0909
202
Unicode
consonant
word
+
+a 0627+064F
090A
oo initial only
+0648
after
+
064F+0648
0942
oo consonant
word final
0624
0942
oo
only
+ a +a
090B
re
Hex
Dec
6F
111
6F+6F
111+111
6F+6F
111+111
6F+6F
111+111
72+65
114+101
+ a
+ a
+ a
+ a
+ a+a
+ a+a
a
a
a
Unicode
Hindi
Unicode
IPA
Unicode
Roman
Hex
Dec
0650+06D2
090F
ey
65+79 101+121
0650+06D2
0947
ey
65+79 101+121
064E+06D2
0910
ay
61+79 97+121
064E+06D2
0948
ay
61+79 97+121
064E+0648
0913
ao
61+6F 97+111
064E+0648
094B
ao
61+6F 97+111
0627+064E
+0648
0627+064E
+0648
0914
ao
61+6F 97+111
094C
ao
61+6F 97+111
064D
064B
064C
Appendix B
ALGORITHMS FOR WORD REPRESENTATION
In this Appendix, algorithms related to word insertion into a trie and the
minimization of DFA (Mihov; Daciuk 1998; Ciura and Deorowicz 2001), which are
used and implemented in Chapter 6 are given as follows.
B.1 Algorithm for Word Insertion into a Trie
automaton, is as follows:
1:
2:
3:
4:
5:
6:
7:
203
1:
2:
3:
4:
5:
6:
204
For minimal acyclic DFSA, there could be more than one final state. Therefore
states are divided into two classes (a) the terminal final state (TFS), having no outgoing transition and there is only one TFS in an automaton (b) the intermediate final
state (IFS) that can have out-going transitions as well as in-coming transitions and
there can be many IFS in an automaton. Each state is stored with a flag to tell whether
it is a final state or not. The algorithm to find prefix of the word needs slight
modification for such final states and the algorithm to Check for minimization is as
follows.
Algorithm B.3.1:
1:
2:
3:
4:
5:
Maintain a list of recently traversed states for the current word while
finding prefix and adding rest of the word. The length of the list is
equal to one greater than length of word.
FOR each state in the list starting from last to first
IF there is an equivalent state in the automaton
Replace transitions that are coming to current state from current state to
the state already in the automaton and delete the current state.
ELSE add the current state to automaton.
Appendix C
SAMPLE SENTENCES FOR PARSING
In this Appendix, the sentences used for the parsing by chunking algorithm
presented in Chapter 10 have been listed.
C.1 Basic Sentences
This section presents chunking of basic sentences. In the following list, each
sentence on left side is presented with its transliteration and on the right side
corresponding tokens and chunking is described.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
aab woh bohat xaosh hay
woh aapnee behan key ghar jaa rahee hay
a
Teyksee aa rahee hay
a a a a a a a
shaahed kaa beyTaa saRak key kenaarey khaRaa hay
a a a a a a a
shaahed kaa beyTaa aapnaa haath helaa rahaa hay
a a
a a
a a a
shaahed kaa beyTaa Teyksee meyN bayTh rahaa hay
a a a
a
Teyksee samanaabaad jaa rahee hay
205
[PNS] [N PM N] AUX
J NP NP AUX
PNS N AUX
J NP NP AUX
Adv PNS [Adj NP] AUX
J AJP NP NP AUX
N N AUX
J NP NP AUX
PNS [[PNP N] PM N] [VB VM]
J NP NP V2
PNS [N PM N] [VB VM]
J NP NP V2
[PNS PM N] [N PP] AUX
J NP PPP AUX
[[PNS PM N] PM N] [N PP] AUX
J NP PPP AUX
[Adj N] N [VB VM]
J NP NP NP V2
N [VB VM]
J NP V2
[Adj N] [VB VM]
J NP V2
[N PM N] [N PM N] [VB VM]
J NP NP V2
[N PM N] [PNP N] [VB VM]
J NP NP V2
[N PM N] [N PP] [VB VM]
J NP PPP V2
N N [VB VM]
J NP NP V2
206
aab woh aes kee Teyksee meyN bayTh rahee hay
17.
a a a
shaahed kaa beyTaa aapney maamooN key ghar
pohanch gayaa hay
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
deewaaraoN par chaarT lagey hayN
hamaarey sakool meyN aeyk kholaa meydaan hay
ham baRey shaoq sey sakool jaatey hayN
aostaad hameyN peyaar aaor meHnat sey paRhaatey
hayN
ham aapney aostaadooN kaa aadab kartey hayN
a
meyN ney ketaab xareedee hay
a
207
This section presents chunking of news website sentences. These sentences are
relatively complex and require manual tokenization before performing chunking. In
the following list, a few sentences on left side are presented with the transliteration
and on the right side chunking is described.
a a
aa
a a
a a
a
a a a a
a a a a
a a
a
a
a a a
a a a
a
saomaalee kaoryaa ney jommaAah kao aeTmee boHraan key Hal keyleeey aamreekaa kao
Gayr mashroott moZaakraat key peyshkash kee hay aaor kahaa hay keh aagar aamreekaa
boHraan kaa Hal chaahtaa hay tao faoree moZaakraat karey
a
a a a a a a
a
a a a a a
a
a a a a a
a a
a
a a a a
molk kee maAaashee Saorat e Haal kaa Zekar kartey hooey Sadar mosharaf ney kahaa
keh paakestaan kashkaol ley kar naheeN ghoom rahaa aaor aeqteSaadee ttaor par
aobhartaa hooaa molk hay
a
aa a
a a
aa a
a a
Sadar ney kahaa key woh Hakoomat sey kehtey hayN keh mehngaaee kanTraol karao aaor
qeematooN meyN kamee laaoo
a
a a
a a
a a
a aa
a a
aa a a
mangal kao paakestaan Teem kao aos waqt shadeed dhachkah lagaa jab aal raaoonDar
Aabdaolrazaaq ghooTney kee aenjree kee wajah sey warlD kap sey baahar hao gaaey
a a
a a a a
a a a a aa
a a a aa a
a a
a a
a a a
a
DaakTarooN ney farekchar kee tashxeeS kee aaor Aabdaolrazaaq kao teen haftey aaraam
kaa mashwarah deeyaa hay jabkeh aenheeN fezeeao tharaapee keeleeey mazeed teen
haftey darkaar haoN gey
aa a a a a a a a a a a
a
a a a a a a
a a a a a
a
a a
a a
a a aa
aeyk sawaal key jawaab meyN aenJamaamolHaq ney kahaa keh aen kao aopanneng
karney kaa mashwarah deyney waaley faJool baat kar rahey hayN aalbatah woh beyTeng
aarDar meyN aopar khelney kee kaoshesh Jaroor kareeN gey
a
a a a
a a
a a
a a a
a a
S
CJS
S
CJC
S
S
CJS
S
CJS
S
CJC
S
a a
a
S
CJS
CJR1
S
CJR2
S
a a
aenJamaamolHaq ney kahaa keh meych meyN jeet baolar delwaatey hayN leyken saarey
gayaarah khelaaReeyooN kao aachchee kaarkardagee kaa moJaaharah karnaa hao gaa
S
CJS
S
S
CJC
S
CJS
S
S
CJS
S
CJS
S
S
CJS
S
CJS
S
Appendix D
CONSTITUENT STRUCTURES
In this Appendix, constitutent structures (c-structures) for corresponding
feature-structures (f-structures) shown in Chapter 7 are given to elaborate the creation
of f-structures.
D.1 C-Structure for the F-Structure in Figure 7.5
S
NP
(K OBJ) = L
(LCASE) = nom
N
laRkaa
N
ketaab
(KPRED) = laRkaa
(KNUM) = sg
(KGEN) = masc
(LN-CON) = anim
V1
NP
(K SUBJ) = L
(LCASE) = nom
(LN-CON) =c anim
K=L
V
xareedey
(KPRED) = ketaab
(K NUM) = sg
(K GEN) = fem
(LN-CON) = thing
AUX
gaa
(KPRED) = xareednaa<S,O>
(KSUBJ NUM) =c sg
(KV-FORM) = subjunctive3
(KTENSE) = future
(KSUBJ NUM) =c sg
(KSUBJ GEN) =c masc
(KSUBJ CASE) =C nom
S
NP
N
laRkey
(KPRED) = laRkaa
(KNUM) = sg
(KGEN) = masc
(LN-CON) = anim
NP
N
ketaab
CM
ney
(K CASE) = erg
(KN-CON) =c anim
(KV-FORM) =C perfective
(KV-VAL) ~= 1
(KSUBJ) = L
(KPRED) = ketaab
(K NUM) = sg
(K GEN) = fem
(LN-CON) = thing
208
V1
K=L
V
xareedee
(KPRED) = xareednaa<S,O>
(KOBJ NUM) =c sg
(KOBJ GEN) =c fem
(KV-FORM) = perfective
(KV-VAL) = 2
209
S
NP
NP
CM
ney
N
mayN
CM
kao
N
laRkey
(KPRED) = laRkaa
(KNUM) = sg
(KGEN) = masc
(LN-CON) = anim
(KPRED) = mayN
(KPERS) = 1
(KNUM) = sg
(KGEN) = masc
(LN-CON) = anim
V1
NP
K=L
N
ketaab
V
dee
(KPRED) = ketaab
(K NUM) = sg
(K GEN) = fem
(LN-CON) = thing
(KPRED) = deynaa<S,O,IO>
(KOBJ NUM) =c sg
(KOBJ GEN) =c fem
(KV-FORM) = perfective
(KV-VAL) = 3
(K CASE) = dat
(KN-CON) =c anim
(KV-VAL) =c 3
(K OBJgoal) = L
(K CASE) = erg
(KN-CON) =c anim
(KV-FORM) =C perfective
(KV-VAL) ~= 1
(KSUBJ) = L
S
NP
NP
V1
K=L
N
aakmal
CM
ney
CM
kao
N
kott-ey
(KPRED) = kottaa
(KNUM) = sg
(KGEN) = masc
(LN-CON) = anim
(K CASE) = acc
(KN-CON) =c anim
(KV-VAL) =c 2
(K OBJ) = L
V
maraa
(KPRED) = marnaa<S,O>
(KSUBJ CASE) =c erg
(KOBJ CASE) =c acc
(KV-FORM) = perfective
(KV-VAL) = 2
S
NP
V1
K=L
N
xatt
(KPRED) = xatt
(KNUM) = sg
(KGEN) = masc
(LN-CON) = thing
V
lekh-aa
(KPRED) = lekhnaa<S,O>
(KV-VAL) = 2
(KSUBJ CASE) =c agent
(KOBJ CASE) =c nom
AUX
gayaa
(K VOICE) = passive
(KV-VAL) =c 2
(KNUM) = sg
(KGEN) = masc
(KSUBJ N-CON) =c animate
(KOBJ N-CON) =c thing
210
S
NP
N
maaN
CM
ney
(KPRED) = maaN
(KNUM) = sg
(KGEN) = fem
(LN-CON) = anim
NP
N
bachchey
CM
kao
(KPRED) = bachchaa
(KNUM) = sg
(KGEN) = masc
(LN-CON) = anim
(K CASE) = erg
(KN-CON) =c anim
(KV-FORM) =C perfective
(KV-VAL) ~= 1
(KSUBJ) = L
NP
N
baap
NP
CM
sey
(KPRED) = khaanaa
(KNUM) = sg
(KGEN) = masc
(LN-CON) = thing
(KPRED) = baap
(KNUM) = sg
(KGEN) = masc
(LN-CON) = anim
(K CASE) = dat
(KN-CON) =c anim
(KV-VAL) =c 4
(K OBJ2) = L
N
khaanaa
(K CASE) = agent
(KN-CON) =c anim
(KV-VAL) =c 4
(KV-FORM2) =c caus2
(K SUBJ2) = L
V1
K=L
V
khelwaayaa
(KPRED) = khaanaa< >
(KOBJ NUM) =c sg
(KOBJ GEN) =c masc
(KV-FORM) = perfective
(KV-FORM2) = caus2
(KV-VAL) = 4
S
NP
N
maaN
CM
ney
(KPRED) = maaN
(KNUM) = sg
(KGEN) = fem
(LN-CON) = anim
NP
N
bachchey
CM
kao
(KPRED) = bachchaa
(KNUM) = sg
(KGEN) = masc
(LN-CON) = anim
(K CASE) = erg
(KN-CON) =c anim
(KV-FORM) =C perfective
(KV-VAL) ~= 1
(KSUBJ) = L
NP
N
chamchey
NP
CM
sey
(KPRED) = chamchey
(KNUM) = sg
(KGEN) = masc
(LN-CON) = instrument
(K CASE) = dat
(KN-CON) =c anim
(KV-VAL) =c 3
(K OBJ2) = L
N
khaanaa
(KPRED) = khaanaa
(KNUM) = sg
(KGEN) = masc
(LN-CON) = thing
(K CASE) = instrument
(KN-CON) =c instrument
(OBL K)
V1
K=L
V
khelaayaa
(KPRED) = khaanaa< >
(KOBJ NUM) =c sg
(KOBJ GEN) =c masc
(KV-FORM) = perfective
(KV-FORM2) = caus1
(KV-VAL) = 3
Appendix E
URDU GRAMMAR IMPLEMENTATION
The concepts of Urdu grammar proposed in this research work are implemented
using Xerox linguistic tools. The Xerox Finite State Tool (XFST) and Finite State
Lexicon Compiler (LEXC) are used to implement Urdu morphology. The Xerox
Linguistic Environment (XLE) is used for syntactical analysis based on Lexical
Functional Grammar (LFG). The XLE does tokenization and morphological analysis
of the given sentences using output of LEXC, and then used syntax rules to parse
sentences into c-structures and f-structures. In this appendix, the morphological and
syntactical rules for Urdu grammar are listed in Xerox linguistic tools format.
E.1 Morphology Implementation
The finite state Lexicon Compiler (LEXC) compiles input source file into a
lexical transducer using command compile-source filename. After the finite state
transducer is successfully compiled, it may be saved using command save-source
filename, then various commands may be used to analyse the transducer. XFST also
takes input of LEXC and may be used to analyse and extend the automata for irregular
forms. The following is the source file that may be input to LEXC.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Urdu Morphology
Multichar_Symbols
+Verb
+Root +Caus1 +Caus2 +Repeat +Perf +Inf +Subj +Impr ! Verb Forms
+Rude +Formal +Polite +Request ! 2nd Person Honorific Forms
+Noun
+Common +Proper ! Noun class
+Abstract +Group +Spatial +Temporal
+Instrument +Animate +Thing ! Noun Concept
+Mass +Count ! Noun Type
+Nominative +Oblique +Vocative ! Noun Form
+Arabic +Persian +Hindi +Turkish +English ! Base Language
+Adjective
+Aux
+Future +Present +Past ! Tense
+Perfect +Progress +Cont +Incept ! Aspect
+Comp +Decl ! Mood
211
212
verb root
from verb
VerbRoot1
VerbRoot2
VerbRoot3
VerbRoot4
forms
root form we
= verb roots
= verb roots
= verb roots
= verb roots
go to stem form
that can generate Caus1 but not Caus2 Form
that can generate Caus1 and Caus2 Form
that can generate Caus2 but not Caus1 Form
that cannot generate Causative Forms
LEXICON Verbs
! VerbRoot1 = verb roots that can generate Caus1 but not Caus2 Form
dokhnaa+Verb:dokh
VerbRoot1;
! pain
behnaa+Verb:beh
VerbRoot1;
! flow
! VerbRoot2 = verb roots that can generate Caus1 and Caus2 Form
hansnaa+Verb:hans
VerbRoot2;
! laugh
paRhnaa+Verb:paRh
VerbRoot2;
! read
maangnaa+Verb:maang
VerbRoot2;
! ask for
lekhnaa+Verb:lekh
VerbRoot2;
! write
kaatnaa+Verb:kaat
VerbRoot2;
! cut
pohanchnaa+Verb:pohanch
VerbRoot2;
! reach
darnaa+Verb:dar
VerbRoot2;
! fear
chalnaa+Verb:chal
VerbRoot2;
! walk
deykhnaa+Verb:deykh
VerbRoot2;
! look
chakhnaa+Verb:chakh
VerbRoot2;
! taste
lagnaa+Verb:lag
VerbRoot2;
! touch
! VerbRoot3 = verb roots that can generate Caus2 but not Caus1 Form
kholnaa+Verb:khol
VerbRoot3;
! open
bolnaa+Verb:bol
VerbRoot3;
! speak
nekalnaa+Verb:nekal
VerbRoot3;
! come out
xareednaa+Verb:xareed
VerbRoot3;
! buy
poochhnaa+Verb:poochh
VerbRoot3;
! ask about, ask
! VerbRoot4
bataanaa+Verb:bataa
deynaa+Verb:d
leynaa+Verb:l
deynaa+Verb:dee
leynaa+Verb:lee
VerbRoot4;
VerbRoot4a;
VerbRoot4a;
VerbRoot4b;
VerbRoot4b;
!
!
!
!
!
tell
give
take
give
take
213
VerbStem; ! want
VerbStem; ! fell
VerbStem; ! cough
saonaa+Verb:sao
raonaa+Verb:rao
VerbStem1; ! sleep
VerbStem1; ! weep
khaanaa+Verb:khaa
khelaanaa+Verb+Caus1:khelaa
khelwaanaa+Verb+Caus2:khelwaa
VerbStem1; ! eat
VerbStem1; ! eat causative form 1
VerbStem1; ! eat causative form 2
seenaa+Verb:see
VerbStem2;
! sew
karnaa+Verb:kee
karnaa+Verb:kar
safar=karnaa+Verb:safar=kee
safar=karnaa+Verb:safar=kar
aenteZaar=karnaa+Verb:aenteZaar=kee
aenteZaar=karnaa+Verb:aenteZaar=kar
VerbStem3a;
VerbStem3b;
VerbStem3a;
VerbStem3b;
VerbStem3a;
VerbStem3b;
jaa+Verb+Perf:ga
jaanaa+Verb:jaa
GendNumb3; ! go
VerbStem5; ! go
LEXICON VerbRoot1
+Caus1:aa
0:0
!
VerbStem1;
VerbStem1;
LEXICON VerbRoot2 !
+Caus1:aa
+Caus2:waa
0:0
VerbStem1;
VerbStem1;
VerbStem;
LEXICON VerbRoot3 !
+Caus2:waa
0:0
VerbStem1;
VerbStem;
LEXICON VerbRoot4 !
0:0
VerbStem;
LEXICON VerbStem
+Root:0
+Inf:n
+Repeat:t
+Perf:0
+Subj:0
+Impr:0
#;
Infinitive;
Repetitive;
Perfective;
Subjunctive;
Imperative;
LEXICON Infinitive
0:0
GendNumb1;
LEXICON Repetitive
0:0
GendNumb2;
LEXICON Perfective
0:0
GendNumb2;
LEXICON Subjunctive
+1st+Sg:ooN
+1st+Pl:eyN
+2nd+Rude:0
#;
#;
#;
!
!
!
!
!
!
do
do
travel
travel
wait
wait
#;
#;
#;
#;
#;
LEXICON Imperative
+2nd+Rude:0
+2nd+Formal:ao
+2nd+Polite:eyN
+2nd+Request:eeey
#;
#;
#;
#;
LEXICON VerbStem1
+Root:0
+Inf:n
+Repeat:t
+Perf:0
+Subj:0
+Impr:0
#;
Infinitive;
Repetitive;
Perfective2;
Subjunctive;
Imperative;
LEXICON Perfective2
0:0
GendNumb3;
LEXICON VerbStem2
+Root:0
+Inf:n
+Repeat:t
+Perf:0
+Subj:0
+Impr:0
#;
Infinitive;
Repetitive;
Perfective3;
Subjunctive2;
Imperative2;
LEXICON Perfective3
0:0
GendNumb4;
LEXICON Subjunctive2
+1st+Sg:ooN
+1st+Pl:eyN
+2nd+Rude:0
+2nd+Formal:ao
+2nd+Polite:eyN
+2nd+Request:jeeey
+3rd+Sg:ey
+3rd+Pl:eyN
#;
#;
#;
#;
#;
#;
#;
#;
LEXICON Imperative2
+2nd+Rude:0
+2nd+Formal:ao
+2nd+Polite:eyN
+2nd+Request:jeeey
#;
#;
#;
#;
LEXICON GendNumb1
+Masc:aa
+Fem:ee
+Obl:ey
#;
#;
#;
LEXICON GendNumb2
+Sg+Masc:aa
+Sg+Fem:ee
+Pl+Masc:ey
+Pl+Fem:eeN
#;
#;
#;
#;
214
#;
#;
#;
#;
LEXICON GendNumb4
+Sg+Masc:aa
+Sg+Fem:0
+Pl+Masc:ey
+Pl+Fem:N
#;
#;
#;
#;
LEXICON VerbStem3a
+Perf:0
Perfective3;
+Subj+2nd+Request:jeeey
#;
+Impr+2nd+Request:jeeey
#;
LEXICON VerbStem3b
+Root:0
+Inf:n
+Repeat:t
+Subj+1st+Sg:ooN
+Subj+3rd+Sg:ey
+Subj+1st+Pl:eyN
+Subj+3rd+Pl:eyN
+Subj+2nd+Rude:0
+Subj+2nd+Formal:ao
+Subj+2nd+Polite:eyN
+Impr+2nd+Rude:0
+Impr+2nd+Formal:ao
+Impr+2nd+Polite:eyN
LEXICON VerbRoot4a
+Perf:0
+Subj+1st+Sg:ooN
+Subj+3rd+Sg:ey
+Subj+1st+Pl:eyN
+Subj+3rd+Pl:eyN
+Subj+2nd+Formal:ao
+Subj+2nd+Polite:eyN
+Subj+2nd+Rude:ey
+Impr+2nd+Formal:ao
+Impr+2nd+Polite:eyN
+Impr+2nd+Rude:ey
#;
Infinitive;
Repetitive;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
GendNumb2a;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
LEXICON VerbRoot4b
+Inf:n
GendNumb1;
+Repeat:t
GendNumb2;
+Subj+2nd+Request:jeeey #;
+Impr+2nd+Request:jeeey #;
LEXICON VerbStem5
+Root:0
+Inf:n
+Repeat:t
+Subj+1st+Sg:ooN
+Subj+3rd+Sg:ey
+Subj+1st+Pl:eyN
+Subj+3rd+Pl:eyN
+Subj+2nd+Rude:0
+Subj+2nd+Formal:ao
#;
GendNumb1;
GendNumb2;
#;
#;
#;
#;
#;
#;
215
216
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
LEXICON Nouns
! CAT 1a: animate nouns that end in suffix -aa
fem
laRkaa+Noun+Animate:laRk
N_Cat1a;
daadaa+Noun+Animate:daad
N_Cat1a;
kotaa+Noun+Animate:kot
N_Cat1a;
bakraa+Noun+Animate:bakr
N_Cat1a;
that generetates to
!
!
!
!
boy
grand father
dog
(masc.) goat
! CAT 1b: animate nouns that end in suffix -ah that generates to fem
bachchah+Noun+Animate:bachch
N_Cat1b; ! child
! CAT 2: inanimate masc nouns that do not end with a suffix
xatt+Noun+Thing+Masc+Count:xatt
N_Cat2;
! letter
jahaaz+Noun+Thing+Masc+Count:jahaaz N_Cat2;
! plane
den+Noun+Temporal+Masc:den
N_Cat2;
! day
sawaal+Noun+Abstract+Masc:sawaal
N_Cat2;
! question
saRak+Noun+Saptial+Masc:saRak
N_Cat2;
! road
teyl+Noun+Thing+Masc+Mass:teyl
N_Cat2;
! oil
seyb+Noun+Thing+Sg+Masc+Count:seyb
N_Cat2;
! apple
! CAT 3: inanimate fem nouns that do
ketaab+Noun+Thing+Fem+Count:ketaab
pencel+Noun+Instrument+Fem:pencel
baat+Noun+Abstract+Fem:baat
madad+Noun+Abstract+Fem:madad
SobaH+Noun+Temporal+Fem:SobaH
moddat+Noun+Temporal+Fem:moddat
suffix -ah
N_Cat4b; ! door
N_Cat4b; ! promise
N_Cat4b; ! room
N_Cat4b; !
N_Cat4b; ! spoon
N_Cat4b; ! spice,
! CAT 4b: animate masc nouns that end in suffix -ah and don't be fem
parendah+Noun+Animate:parend
N_Cat4b; ! bird
! CAT 5: inanimate fem nouns that end in suffix -ee
cheThee+Noun+Thing+Fem+Count:cheTh
N_Cat5; ! note
seeRhee+Noun+Spatial+Fem+Count:seeRh
N_Cat5; ! stairs
217
N_Cat5;
! knife (smaller)
#;
#;
#;
#;
#;
#;
#;
#;
218
#;
#;
LEXICON N_Cat5
! Noun Category 5
+Sg+Fem:ee
#;
+Sg+Fem+Oblique:ee
#;
+Pl+Fem:eeaN
#;
+Pl+Fem+Oblique:eeooN
#;
LEXICON Adjectives
! CAT 1-a: Adjectives that end in a suffix
achchaa+Adjective:achch
Adj_Cat1a;
neelaa+Adjective:neel
Adj_Cat1a;
haraa+Adjective:har
Adj_Cat1a;
teesraa+Adjective:teesr
Adj_Cat1a;
kaRwaa+Adjective:kaRw
Adj_Cat1a;
-aa
! good
! blue
! green
! third
! harsh
Aux_Suffix1;
Aux_Suffix1;
! future tense
! perfective aspect
219
Aux_Suffix1;
Aux_Suffix1;
Aux_Suffix1;
Aux_Suffix1;
Aux_Suffix1;
!
!
!
!
!
thaa+Aux+Past:th
hooaa+Aux+Decl:hoo
Aux_Suffix2;
Aux_Suffix2;
! past tense
! declarative mood
hay+Aux+Present:h
Aux_Suffix3;
! present tense
LEXICON Aux_Suffix1
+1st+Sg+Masc:aa
+2nd+Rude+Masc:aa
+3rd+Sg+Masc:aa
+1st+Pl+Masc:ey
+3rd+Pl+Masc:ey
+2nd+Formal+Masc:ey
+2nd+Polite+Masc:ey
+1st+Sg+Fem:ee
+2nd+Rude+Fem:ee
+3rd+Sg+Fem:ee
+1st+Pl+Fem:ee
+3rd+Pl+Fem:ee
+2nd+Formal+Fem:ee
+2nd+Polite+Fem:ee
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
LEXICON Aux_Suffix2
+1st+Sg+Masc:aa
+2nd+Rude+Masc:aa
+3rd+Sg+Masc:aa
+1st+Pl+Masc:ey
+3rd+Pl+Masc:ey
+2nd+Formal+Masc:ey
+2nd+Polite+Masc:ey
+1st+Sg+Fem:ee
+2nd+Rude+Fem:ee
+3rd+Sg+Fem:ee
+2nd+Formal+Fem:ee
+1st+Pl+Fem:eeN
+3rd+Pl+Fem:eeN
+2nd+Polite+Fem:eeN
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
LEXICON Aux_Suffix3
+1st+Sg+Masc:ooN
+1st+Sg+Fem:ooN
+1st+Pl+Masc:ayN
+1st+Pl+Fem:ayN
+3rd+Sg+Masc:ay
+3rd+Pl+Masc:ay
+3rd+Sg+Fem:ay
+3rd+Pl+Fem:ay
+2nd+Rude+Masc:ay
+2nd+Rude+Fem:ay
+2nd+Formal+Masc:ao
+2nd+Formal+Fem:ao
+2nd+Polite+Masc:ayN
+2nd+Polite+Fem:ayN
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
#;
progressive aspect
continuing progressive
inceptive aspect
inceptive aspect
compulsive mood
220
N-S_BASE: ^ = ! ;
N-T_BASE: ^ = ! ;
N-F_BASE*: ^ = ! ,
C-F_BASE*: ^ = ! .
V -->
V-S_BASE: ^ = ! ;
V-T_BASE: ^ = ! ;
V-F_BASE*: ^ = ! ,
C-F_BASE*: ^ = ! .
221
paRhnaa
"read"
maangnaa
"ask for"
lekhnaa
kaatnaa
"laugh"
"write"
"cut"
pohanchnaa
ONLY. "reach"
darnaa
ONLY. "fear"
chalnaa
ONLY. "walk"
deykhnaa
chakhnaa
chakhaanaa
"look"
ONLY. "taste"
ONLY. "taste, caus1"
kholnaa
bolnaa
ONLY.
ONLY. "taste,
"touch"
"open"
"speak"
nekalnaa
xareednaa
ONLY. "buy"
poochhnaa
question"
chalaanaa
ONLY. "drive"
banaanaa
ONLY. "make"
nahaanaa
ONLY. "bathe"
ONLY. "come"
bolaanaa
ONLY. "call"
bolwaanaa
222
kehnaa
"say"
"want"
cahnaa
ONLY.
"stay"
gernaa
ONLY.
khaansnaa
ONLY. "cough"
khaanaa
ONLY. "eat"
khelaanaa
khelwaanaa
ONLY.
"fell"
"food"
saonaa
ONLY. "sleep"
raonaa
ONLY. "weep"
seenaa
ONLY. "sew"
deynaa
ONLY. "give"
leynaa
ONLY. "take"
karnaa
ONLY. "do"
safar=karnaa
ONLY. "travel"
aenteZaar=karnaa
ONLY. "wait"
jaanaa
ONLY.
ONLY.
ONLY.
+Masc
ONLY. "go"
223
ONLY.
+2nd
ONLY.
+3rd
ONLY.
+Past
ONLY.
ONLY.
ONLY.
+Repeat
ONLY.
+Perf
ONLY.
+Inf
+Obl
ONLY.
+Caus1
ONLY.
+Caus2
ONLY.
+Subj
+Impr
ONLY.
+Formal
ONLY.
+Polite
ONLY.
+Request
ONLY.
"Noun Features"
" ... Noun Class ... "
ONLY.
ONLY.
ONLY.
224
+Proper
+Common
+Mass
ONLY.
ONLY.
+Group
ONLY.
+Spatial
ONLY.
ONLY.
+Animate
+Thing
ONLY.
ONLY.
ONLY.
ONLY.
+Persian
+Hindi
+Turkish
+English
ONLY.
"Auxiliary Features"
" ... Aspect ... "
+Perfect
+Progress
+Cont
ONLY.
+Incept
ONLY.
ONLY.
225
+Comp
ONLY.
----
The Xerox Linguistics Environment (XLE) takes LFG based syntax rules to
generate c-structures and f-structures. The following is the listing of rules to analyse
Urdu sentences.
PIEAS URDU CONFIG (1.0)
ROOTCAT
S.
FILES Pronouns.lfg Templates.lfg VerbMorphemes.lfg .
LEXENTRIES (all all).
RULES
(PIEAS URDU_SYN) (PIEAS URDU_MORPH).
TEMPLATES (PIEAS URDU).
MORPHOLOGY (PIEAS URDU).
FEATURES
(PIEAS URDU).
GOVERNABLERELATIONS SUBJ SUBJ2 OBJ OBJ2 OBL-?+ COMP.
SEMANTICFUNCTIONS
ADJUNCT TOPIC FOCUS.
EPSILON
e.
---PIEAS URDU FEATURES (1.0)
NUM: -> $ { sg pl }.
PERS: -> $ { 1 2 3 }.
GEND: -> $ { fem masc }.
CASE: -> $ { nom erg dat acc agent mutual instrument temporal
movement adverbial}.
N-SEM: -> << [ N-CONCEPT N-TYPE N-CLASS N-LANG H-MOOD DIST ].
N-TYPE: -> $ { count mass }.
N-CLASS: -> $ { common proper }.
N-LANG: -> $ { arabic persian hindi turkish english }.
N-CONCEPT: -> $ {abstract group spatial temporal instrument animate
thing}.
H-MOOD: -> $ { rude formal polite request }.
DIST: -> $ { near far }.
N-FORM: -> $ { nominative oblique vocative }.
V-FORM: -> $ { root perfective repetitive infinitive subjunctive
imperative }.
V-FORM2: -> $ { oblique causative1 causative2 }.
V-HFORM: -> $ { rude formal polite request }.
V-TENSE: -> $ { present past future }.
V-VAL: -> $ { 1 2 3 4 }.
TNS-ASP: -> << [ TENSE MOOD ASPECT ACTION VOICE].
TENSE: -> $ { present past future }.
MOOD: -> $ { indicative subjunctive permissive imperative }.
ASPECT.
ACTION.
VOICE: -> $ { active passive }.
P-CASE.
SPEC: -> $ { definite indefinite }.
----
226
227
| (^ OBJ2)=!
}.
OBLF = {
(^ OBL-sey-inst)=!
| (^ OBL-sey-temp)=!
| (^ SUBJ2)=!
"| (^ OBJ)=!"
}.
228
}
}.
S_perf --> NP: (^ SUBJ)=!
(^ CASE) =c nom
(^ N-SEM N-CONCEPT) =c animate;
NP: (^ OBJ)=!
(^ CASE) =c nom;
V: ^ = !
(^ V-FORM) =c root;
Aux: ^ = !
(^ TNS-ASP ASPECT) =c perfective
(^ SUBJ PERS) =c (! PERS)
(^ SUBJ NUM) =c (! NUM);
Aux: ^ = !
(^ SUBJ PERS) =c (! PERS)
(^ SUBJ NUM) =c (! NUM).
V2 --> VS VM.
" without finite state morphology, auxiliaries
lumped into morphmes which are being joined at syntax level "
V1 --> V : ^ = !; " with finite state morphology "
(Aux*: ^ = !).
NP --> (Adj:
(^ NUM) =c (! NUM)
(^ GEND) =c (! GEND)
) " optional Adjective "
{ N: ^ = !; "either a case marked noun"
Case: ^ = !
|
Pronoun: ^ = !; "or a case marked pronoun"
Case: ^ = !
|
N: (^ CASE) = nom "or an unmarked noun"
{ (OBJ ^) | (SUBJ ^)}
|
Pronoun: (^ CASE) = nom "or an unmarked pronoun"
{ (OBJ ^) | (SUBJ ^)}
}.
PP --> N: (^ OBJ) = !;
PostPos: ^ = !
(ADJUNCT ($) ^).
---URDU LEX LEXICON (1.0)
" ~~~~~~~~~~~~~~~~~
Case Clitics
~~~~~~~~~~~~~~~~~ "
ney
229
Case * @(N-FORM
{
" either
@(N-CASE
(^ N-SEM
oblique)
'kao' marks a dative case "
dat)
N-CONCEPT) =c animate
| " or "
" marks an instrumental noun "
230
@(N-CASE instrument)
(^ N-SEM N-CONCEPT) =c instrument
(OBL-sey-inst ($) ^)
| " or "
" marks a temporal noun "
@(N-CASE temporal)
(^ N-SEM N-CONCEPT) =c temporal
(OBL-sey-temp ($) ^)
}.
"
ee
VM * @(V-TENSE past)
(^ OBJ NUM) = sg
(^ OBJ GEND) = fem
(^ SUBJ CASE) =c erg.
aa
VM * @(V-TENSE past)
{
(^ OBJ CASE) ~= acc
(^ OBJ GEND) = masc "object-verb agreement"
(^ OBJ NUM) = sg
|
(^ OBJ CASE) =c acc
(^ GEND) = masc
"default sg-masc agreement"
(^ NUM) = sg
}
(^ SUBJ CASE) =c erg.
ey
VM * @(V-TENSE past)
(^ OBJ NUM) = pl
(^ OBJ GEND) = masc
(^ SUBJ CASE) =c erg.
eeN
VM * @(V-TENSE past)
(^ OBJ NUM) = pl
(^ OBJ GEND) = fem
(^ SUBJ CASE) =c erg.
ee-hay
VM * @(V-TENSE present)
@(V-ASPECT perfect)
(^ OBJ NUM) = sg
(^ OBJ GEND) = fem
(^ SUBJ CASE) =c erg
(^ OBJ CASE) ~= acc.
aa-hay
VM * @(V-TENSE present)
@(V-ASPECT perfect)
(^ SUBJ CASE) =c erg
{ (^ OBJ NUM) = sg
| (^ OBJ NUM) = pl
(^ OBJ CASE) =c acc}
VM * @(V-TENSE present)
@(V-ASPECT perfect)
(^ OBJ NUM) = pl
(^ OBJ GEND) = fem
(^ OBJ CASE) ~= acc
(^ SUBJ CASE) =c erg.
ey-hayN
VM * @(V-TENSE present)
@(V-ASPECT perfect)
(^ OBJ NUM) = pl
(^ OBJ GEND) = masc
(^ SUBJ CASE) =c erg.
ee-thee
VM * @(V-TENSE past)
@(V-ASPECT perfect)
(^ OBJ NUM) = sg
(^ OBJ GEND) = fem
(^ SUBJ CASE) =c erg.
aa-thaa
VM * @(V-TENSE past)
@(V-ASPECT perfect)
(^ SUBJ CASE) =c erg
{ (^ OBJ NUM) = sg
| (^ OBJ NUM) = pl
(^ OBJ CASE) =c acc}
{ (^ OBJ GEND) = masc
| (^ OBJ GEND) = fem
(^ OBJ CASE) =c acc}.
VM * @(V-TENSE past)
@(V-ASPECT perfect)
(^ OBJ NUM) = pl
(^ OBJ GEND) = masc
(^ SUBJ CASE) =c erg.
ooN-gaa
VM * @(V-TENSE future)
(^ SUBJ PERS) = 1
(^ SUBJ NUM) = sg
(^ SUBJ GEND) = masc
(^ SUBJ CASE) =c nom.
ooN-gee
VM * @(V-TENSE future)
(^ SUBJ PERS) = 1
(^ SUBJ NUM) = sg
(^ SUBJ GEND) = fem
(^ SUBJ CASE) =c nom.
eyN-gey
VM * @(V-TENSE future)
{(^ SUBJ PERS) = 1
|(^ SUBJ PERS) = 3}
(^ SUBJ NUM) = pl
231
VM * @(V-TENSE future)
{ (^ SUBJ PERS) = 1
| (^ SUBJ PERS) = 3 }
(^ SUBJ NUM) = pl
(^ SUBJ GEND) = fem
(^ SUBJ CASE) =c nom.
ao-gey
VM * @(V-TENSE future)
(^ SUBJ PERS) = 2
{ (^ SUBJ NUM) = sg
| (^ SUBJ NUM) = pl }
(^ SUBJ GEND) = masc
(^ SUBJ CASE) =c nom.
ao-gee
VM * @(V-TENSE future)
(^ SUBJ PERS) = 2
{ (^ SUBJ NUM) = sg
| (^ SUBJ NUM) = pl }
(^ SUBJ GEND) = fem
(^ SUBJ N-SEM N-CONCEPT) =c animate
(^ SUBJ CASE) =c nom.
ey-gaa
VM * @(V-TENSE future)
(^ SUBJ PERS) = 3
(^ SUBJ NUM) = sg
(^ SUBJ GEND) = masc
(^ SUBJ N-SEM N-CONCEPT) =c animate
(^ SUBJ CASE) =c nom.
ey-gee
VM * @(V-TENSE future)
(^ SUBJ PERS) = 3
(^ SUBJ NUM) = sg
(^ SUBJ GEND) = fem
(^ SUBJ N-SEM N-CONCEPT) =c animate
(^ SUBJ CASE) =c nom.
aa-gayaa
VM * @(V-TENSE past)
@(V-VOICE passive)
@(V-ASPECT perfect)
(^ SUBJ CASE) =c agent.
baat-kee
@(V-SUBJ-OBJ khaanaa)
(^ SUBJ CASE) =c agent.
V2 * @(V-SUBJ-OBJ baat-karnaa)
(^ SUBJ CASE) =c erg
(^ OBJ CASE) =c mutual
(^ SUBJ N-SEM N-CONCEPT) =c animate
(^ OBJ N-SEM N-CONCEPT) =c animate.
232
waAdah-keeaa
V2 * @(V-SUBJ-OBJ waAdah-karnaa)
(^ SUBJ CASE) =c erg
(^ OBJ CASE) =c mutual
(^ SUBJ N-SEM N-CONCEPT) =c animate
(^ OBJ N-SEM N-CONCEPT) =c animate.
sawaal-poochhaa
(^
(^
(^
(^
V2 * @(V-SUBJ-OBJ sawaal-poochhnaa)
SUBJ CASE) =c erg
OBJ CASE) =c mutual
SUBJ N-SEM N-CONCEPT) =c animate
OBJ N-SEM N-CONCEPT) =c animate.
aap
aes
233
Det
* (^ SPEC) = definite
@(DIST near).
aos
Det
* (^ SPEC) = definite
@(DIST far).
yeh
aos
woh
aenhooN
aonhooN
aen
234
235
---" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Post Positions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
meyN
sey_pp
" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Verb Stems (without FSM)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
xareed
maar
lekh
"
VS * @(V-SUBJ-OBJ xareednaa).
VS * @(V-SUBJ-OBJ maarnaa)
{ (^ OBJ CASE) =c acc
| (^ OBJ CASE) =c nom }.
VS * @(V-SUBJ-OBJ lekhnaa).
" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nouns (without FSM)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
skool
N * @(PRED %stem).
jaldee
shaoq
tawajah
sardee
N
N
N
N
----
"
*
*
*
*
@(PRED
@(PRED
@(PRED
@(PRED
%stem).
%stem).
%stem).
%stem).
"
" school "
"
"
"
"
hurriedly "
interest "
concentration "
cold "
REFERENCES
Abdul-Haq, M. (1991). Qwaed-e-Urdu. New Delhi, Anjuman Taraqi-e-Urdu.
Abney, S., Ed. (1991). Parsing by chunks. Principle-Based Parsing, Kluwer Academic
Publishers.
Afzal, M. and S. Hussain (2001). Urdu Computing Standards: Urdu Zabta Takhti
(UZT) 1.01. IEEE International Multitopic Conference INMIC 2001, Lahore,
LUMS.
Aho, A. V., R. Sethi, et al. (1986). Compilers: Principles, Techniques, and Tools.
Boston, MA, USA, Addison-Wesley Longman Publishing Co., Inc.
Arnold, D., L. Balkan, et al. (1994). Machine Translation: An Introductory Guide.
London, NCC Blackwell.
Arsenault, P. E. (2002). Toward an HPSG Account of Case in Hindi, University of
Hyderabad.
Beaven, J. L. (1992). Shake and Bake Machine Translation. Proceedings of the 14th
conference on Computational linguistics, Nantes, France, Association for
Computational Linguistics.
Beesley, K. R. and L. Karttunen (2003). Finite State Morphology, CSLI Publications.
Bhatt, R. and D. Embick (2003). Causative Derivations in Hindi. Austin.
Bod, R., R. Scha, et al. (2003). Data-Oriented Parsing. California, CSLI Publications.
Bresnan, J., Ed. (1982). The Mental Representation of Grammatical Relations, MIT
Press.
Bresnan, J. (2001). Lexical-Functional Syntax. Oxford, Blackwell Publishers.
Bresnan, J. (2001). Lexical Functional Syntax. Surrey, Blackwell.
Buchholz, S., J. Veenstra, et al. (1999). Cascaded Grammatical Relation Assignment.
EMNLP/VLC-99, University of Maryland, USA.
Butt, M. (1995). The Structure of Complex Predicates in Urdu. Stanford, California,
CSLI Publications.
Butt, M. (2003). The Morpheme That Would'nt Go Away. London.
Butt, M. (2003). Tense and Aspect in Urdu. Paris.
236
References
237
References
238
References
239
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
240
List of Publications
[11]
[12]
[13]
241
INDEX
Aspect
inceptive, 153
perfective, 148
progressive, 150
repetitive, 151
Case
accusative, 108
agentive, 112
dative, 106
ergative, 102
infinitive, 120
instrumental, 116
participant, 114
temporal, 118
travel, 117
Causative Verbs, 125
Head-driven Phrase Structure
Grammar, 20, 43, 91
lexical entries, 45
sign, 43
valance, 48
Lexical Functional Grammar, 20, 23,
28
a-structure, 29
c-structure, 30
f-structure, 31
Mapping Theory, 91
Mood
capacitive, 164
compulsive, 166
declarative, 154
imperative, 162
permissive, 159
presumptive, 166
prohibitive, 161
subjunctive, 167
suggestive, 165
Morphology
derivational, 53
inflectional, 52
morphemes, 52
root, 53
stem, 53
Noun
case, 75
form, 74
forms, 93
gender, 73
HPSG, 98
morphology, 76
number, 74
phrase structure, 97
types, 71
Passive Voice, 113
Possession Markers, 122
Thematic Roles, 91
Verb
agreement, 136
aspect, 147
coordination, 168
ditransitive, 55
intransitive, 54
mood, 153
tense, 138
transitive, 54
transitivity, 53
valancy, 54
Verb Aspect, 69
Verb Form, 55
causative, 56
imperative, 62
infinitive, 58
perfective, 60
repetitive, 59
root, 55
stem, 56
subjunctive, 62
Verb Mood, 69
242