0% found this document useful (0 votes)
7 views8 pages

F Learning Morphological Disambiguation Rules For

Uploaded by

Bilim Seven Kız
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views8 pages

F Learning Morphological Disambiguation Rules For

Uploaded by

Bilim Seven Kız
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220816639

F.: Learning Morphological Disambiguation Rules for Turkish

Conference Paper · January 2006


DOI: 10.3115/1220835.1220877 · Source: DBLP

CITATIONS READS
98 222

2 authors, including:

Deniz Yuret
Koc University
95 PUBLICATIONS 3,138 CITATIONS

SEE PROFILE

All content following this page was uploaded by Deniz Yuret on 27 May 2014.

The user has requested enhancement of the downloaded file.


Learning Morphological Disambiguation Rules for Turkish

Deniz Yuret Ferhan Türe


Dept. of Computer Engineering Dept. of Computer Engineering
Koç University Koç University
İstanbul, Turkey İstanbul, Turkey
[email protected] [email protected]

Abstract in a given context. The possible parses of a word


are generated by a morphological analyzer. In Turk-
In this paper, we present a rule based ish, close to half the words in running text are mor-
model for morphological disambiguation phologically ambiguous. Below is a typical word
of Turkish. The rules are generated by a “masalı” with three possible parses.
novel decision list learning algorithm us-
ing supervised training. Morphological masal+Noun+A3sg+Pnon+Acc (= the story)
ambiguity (e.g. lives = live+s or life+s) masal+Noun+A3sg+P3sg+Nom (= his story)
is a challenging problem for agglutinative masa+Noun+A3sg+Pnon+NomˆDB+Adj+With
languages like Turkish where close to half (= with tables)
of the words in running text are morpho- Table 1: Three parses of the word “masalı”
logically ambiguous. Furthermore, it is
possible for a word to take an unlimited The first two parses start with the same root,
number of suffixes, therefore the number masal (= story, fable), but the interpretation of the
of possible morphological tags is unlim- following +ı suffix is the Accusative marker in one
ited. We attempted to cope with these case, and third person possessive agreement in the
problems by training a separate model for other. The third parse starts with a different root,
each of the 126 morphological features masa (= table) followed by a derivational suffix +lı
recognized by the morphological analyzer. (= with) which turns the noun into an adjective. The
The resulting decision lists independently symbol ˆDB represents a derivational boundary and
vote on each of the potential parses of a splits the parse into chunks called inflectional groups
word and the final parse is selected based (IGs).1
on our confidence on these votes. The We will use the term feature to refer to individual
accuracy of our model (96%) is slightly morphological features like +Acc and +With; the
above the best previously reported results term IG to refer to groups of features split by deriva-
which use statistical models. For compari- tional boundaries (ˆDB), and the term tag to refer to
son, when we train a single decision list on the sequence of IGs following the root.
full tags instead of using separate models Morphological disambiguation is a useful first
on each feature we get 91% accuracy. step for higher level analysis of any language but it
is especially critical for agglutinative languages like
Turkish, Czech, Hungarian, and Finnish. These lan-
1 Introduction
guages have a relatively free constituent order, and
Morphological disambiguation is the task of select- 1
See (Oflazer et al., 1999) for a detailed description of the
ing the correct morphological parse for a given word morphological features used in this paper.

328
Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 328–334,
New York, June 2006. c 2006 Association for Computational Linguistics
syntactic relations are partly determined by morpho- To predict the tag of an unknown word, first the
logical features. Many applications including syn- morphological analyzer is used to generate all its
tactic parsing, word sense disambiguation, text to possible parses. The decision lists are then used to
speech synthesis and spelling correction depend on predict the presence or absence of each of the fea-
accurate analyses of words. tures contained in the candidate parses. The results
An important qualitative difference between part are probabilistically combined taking into account
of speech tagging in English and morphological dis- the accuracy of each decision list to select the best
ambiguation in an agglutinative language like Turk- parse. The resulting tagging accuracy is 96% on a
ish is the number of possible tags that can be as- hand tagged test set.
signed to a word. Typical English tag sets include A more direct approach would be to train a single
less than a hundred tag types representing syntac- decision list using the full tags as the target classifi-
tic and morphological information. The number of cation. Given a word in context, such a decision list
potential morphological tags in Turkish is theoret- assigns a complete morphological tag instead of pre-
ically unlimited. We have observed more than ten dicting individual morphological features. As such,
thousand tag types in our training corpus of a mil- it does not need the output of a morphological ana-
lion words. The high number of possible tags poses lyzer and should be considered a tagger rather than
a data sparseness challenge for the typical machine a disambiguator. For comparison, such a decision
learning approach, somewhat akin to what we ob- list was built, and its accuracy was determined to be
serve in word sense disambiguation. 91% on the same test set.
One way out of this dilemma could be to ignore The main reason we chose to work with decision
the detailed morphological structure of the word and lists and the GPA algorithm is their robustness to ir-
focus on determining only the major and minor parts relevant or redundant features. The input to the deci-
of speech. However (Oflazer et al., 1999) observes sion lists include the suffixes of all possible lengths
that the modifier words in Turkish can have depen- and character type information within a five word
dencies to any one of the inflectional groups of a window. Each instance ends up with 40 attributes on
derived word. For example, in “mavi masalı oda” (= average which are highly redundant and mostly irrel-
the room with a blue table) the adjective “mavi” (= evant. GPA is able to sort out the relevant features
blue) modifies the noun root “masa” (= table) even automatically and build a fairly accurate model. Our
though the final part of speech of “masalı” is an ad- experiments with Naive Bayes resulted in a signif-
jective. Therefore, the final part of speech and in- icantly worse performance. Typical statistical ap-
flection of a word do not carry sufficient information proaches include the tags of the previous words as
for the identification of the syntactic dependencies inputs in the model. GPA was able to deliver good
it is involved in. One needs the full morphological performance without using the previous tags as in-
analysis. puts, because it was able to extract equivalent infor-
Our approach to the data sparseness problem is mation implicit in the surface attributes. Finally, un-
to consider each morphological feature separately. like most statistical approaches, the resulting models
Even though the number of potential tags is un- of GPA are human readable and open to interpreta-
limited, the number of morphological features is tion as Section 3.1 illustrates.
small: The Turkish morphological analyzer we use The next section will review related work. Sec-
(Oflazer, 1994) produces tags that consist of 126 tion 3 introduces decision lists and the GPA training
unique features. For each unique feature f , we take algorithm. Section 4 presents the experiments and
the subset of the training data in which one of the the results.
parses for each instance contain f . We then split this
subset into positive and negative examples depend- 2 Related Work
ing on whether the correct parse contains the feature
f . These examples are used to learn rules using the There is a large body of work on morphological dis-
Greedy Prepend Algorithm (GPA), a novel decision ambiguation and part of speech tagging using a va-
list learner. riety of rule-based and statistical approaches. In the

329
rule-based approach a large number of hand crafted of a pattern and a classification (Rivest, 1987). In
rules are used to select the correct morphological our application the pattern specifies the surface at-
parse or POS tag of a given word in a given context tributes of the words surrounding the target such as
(Karlsson et al., 1995; Oflazer and Tür, 1997). In suffixes and character types (e.g. upper vs. lower
the statistical approach a hand tagged corpus is used case, use of punctuation, digits). The classification
to train a probabilistic model which is then used to indicates the presence or absence of a morphological
select the best tags in unseen text (Church, 1988; feature for the center word.
Hakkani-Tür et al., 2002). Examples of statisti-
cal and machine learning approaches that have been 3.1 A Sample Decision List
used for tagging include transformation based learn- We will explain the rules and their patterns using the
ing (Brill, 1995), memory based learning (Daele- sample decision list in Table 2 trained to identify the
mans et al., 1996), and maximum entropy models feature +Det (determiner).
(Ratnaparkhi, 1996). It is also possible to train sta-
tistical models using unlabeled data with the ex- Rule Class Pattern
pectation maximization algorithm (Cutting et al., 1 1 W=˜çok R1=+DA
1992). Van Halteren (1999) gives a comprehensive 2 1 L1=˜pek
overview of syntactic word-class tagging. 3 0 W=+AzI
Previous work on morphological disambiguation 4 0 W=˜çok
of inflectional or agglutinative languages include 5 1 –
unsupervised learning for of Hebrew (Levinger
et al., 1995), maximum entropy modeling for Czech Table 2: A five rule decision list for +Det
(Hajič and Hladká, 1998), combination of statistical
and rule-based disambiguation methods for Basque The value in the class column is 1 if word W
(Ezeiza et al., 1998), transformation based tagging should have a +Det feature and 0 otherwise. The
for Hungarian (Megyesi, 1999). pattern column describes the required attributes of
Early work on Turkish used a constraint-based ap- the words surrounding the target word for the rule
proach with hand crafted rules (Oflazer and Kuruöz, to match. The last (default) rule has no pattern,
1994). A purely statistical morphological disam- matches every instance, and assigns them +Det.
biguation model was recently introduced (Hakkani- This default rule captures the behavior of the ma-
Tür et al., 2002). To counter the data sparseness jority of the training instances which had +Det in
problem the morphological parses are split across their correct parse. Rule 4 indicates a common
their derivational boundaries and certain indepen- exception: the frequently used word “çok” (mean-
dence assumptions are made in the prediction of ing very) should not be assigned +Det by default:
each inflectional group. “çok” can be also used as an adjective, an adverb,
A combination of three ideas makes our approach or a postposition. Rule 1 introduces an exception to
unique in the field: (1) the use of decision lists and rule 4: if the right neighbor R1 ends with the suffix
a novel learning algorithm that combine the statis- +DA (the locative suffix) then “çok” should receive
tical and rule based techniques, (2) the treatment of +Det. The meanings of various symbols in the pat-
each individual feature separately to address the data terns are described below.
sparseness problem, and (3) the lack of dependence When the decision list is applied to a window of
on previous tags and relying on surface attributes words, the rules are tried in the order from the most
alone. specific (rule 1) to the most general (rule 5). The first
rule that matches is used to predict the classification
3 Decision Lists of the center word. The last rule acts as a catch-all;
if none of the other rules have matched, this rule as-
We introduce a new method for morphological dis- signs the instance a default classification. For exam-
ambiguation based on decision lists. A decision list ple, the five rule decision list given above classifies
is an ordered list of rules where each rule consists the middle word in “pek çok alanda” (matches rule

330
W target word A [ae] GPA(data)
L1, L2 left neighbors I [ıiuü] 1 dlist ← NIL
R1, R2 right neighbors D [dt] 2 default-class ← M OST-C OMMON -C LASS(data)
== exact match B [bp] 3 rule ← [if TRUE then default-class]
=˜ case insensitive match C [cç] 4 while G AIN(rule, dlist, data) > 0
=+ is a suffix of K [kgğ] 5 do dlist ← prepend(rule, dlist)
Table 3: Symbols used in the rule patterns. Capital 6 rule ← M AX -G AIN -RULE(dlist , data)
letters on the right represent character groups useful 7 return dlist
in identifying phonetic variations of certain suffixes,
e.g. the locative suffix +DA can surface as +de, +da, The gain of a candidate rule in GPA is defined
+te, or +ta depending on the root word ending. as the increase in the number of correctly classified
instances in the training set as a result of prepend-
ing the rule to the existing decision list. This is
in contrast with the original P REPEND algorithm
1) and “pek çok insan” (matches rule 2) as +Det, which uses the less direct Laplace preference func-
but “insan çok daha” (matches rule 4) as not +Det. tion (Webb and Brkic, 1993; Clark and Boswell,
1991).
One way to interpret a decision list is as a se-
To find the next rule with the maximum gain, GPA
quence of if-then-else constructs familiar from pro-
uses a heuristic search algorithm. Candidate rules
gramming languages. Another way is to see the last
are generated by adding a single new attribute to the
rule as the default classification, the previous rule as
pattern of each rule already in the decision list. The
specifying a set of exceptions to the default, the rule
candidate with the maximum gain is prepended to
before that as specifying exceptions to these excep-
the decision list and the process is repeated until no
tions and so on.
more positive gain rules can be found. Note that if
the best possible rule has more than one extra at-
tribute compared to the existing rules in the decision
3.2 The Greedy Prepend Algorithm (GPA) list, a suboptimal rule will be selected. The origi-
nal P REPEND uses an admissible search algorithm,
O PUS, which is guaranteed to find the best possible
To learn a decision list from a given set of training candidate (Webb, 1995), but we found O PUS to be
examples the general approach is to start with a de- too slow to be practical for a problem of this scale.
fault rule or an empty decision list and keep adding We picked GPA for the morphological disam-
the best rule to cover the unclassified or misclassi- biguation problem because we find it to be fast and
fied examples. The new rules can be added to the fairly robust to the existence of irrelevant or redun-
end of the list (Clark and Niblett, 1989), the front of dant attributes. The average training instance has
the list (Webb and Brkic, 1993), or other positions 40 attributes describing the suffixes of all possible
(Newlands and Webb, 2004). Other design decisions lengths and character type information in a five word
include the criteria used to select the “best rule” and window. Most of this information is redundant or
how to search for it. irrelevant to the problem at hand. The number of
The Greedy Prepend Algorithm (GPA) is a variant distinct attributes is on the order of the number of
of the P REPEND algorithm (Webb and Brkic, 1993). distinct word-forms in the training set. Nevertheless
It starts with a default rule that matches all instances GPA is able to process a million training instances
and classifies them using the most common class in for each of the 126 unique morphological features
the training data. Then it keeps prepending the rule and produce a model with state of the art accuracy
with the maximum gain to the front of the grow- in about two hours on a regular desktop PC. 2
ing decision list until no further improvement can be
made. The algorithm can be described as follows: 2
Pentium 4 CPU 2.40GHz

331
4 Experiments and Results certain characters are replaced with capital let-
ters representing character groups mentioned in
In this section we present the details of the data,
Table 3. These groups help the algorithm rec-
the training and testing procedures, the surface at-
ognize different forms of a suffix created by the
tributes used, and the accuracy results.
phonetic rules of Turkish: for example the loca-
4.1 Training Data tive suffix +DA can surface as +de, +da, +te, or
+ta depending on the ending of the root word.
documents 2383
4. Attributes indicating the types of characters at
sentences 50673
various positions of the word (e.g. Ali’nin
tokens 948404
would be described with W=UPPER - FIRST,
parses 1.76 per token
W=LOWER - MID, W=APOS - MID, W=LOWER -
IGs 1.33 per parse
LAST)
features 3.29 per IG
unique tokens 111467 Each training instance is represented by 40 at-
unique tags 11084 tributes on average. The GPA procedure is responsi-
unique IGs 2440 ble for picking the attributes that are relevant to the
unique features 126 decision. No dictionary information is required or
ambiguous tokens 399223 (42.1%) used, therefore the models are fairly robust to un-
known words. One potentially useful source of at-
Table 4: Statistics for the training data
tributes is the tags assigned to previous words which
Our training data consists of about 1 million we plan to experiment with in future work.
words of semi-automatically disambiguated Turkish 4.3 The Decision Lists
news text. For each one of the 126 unique morpho-
At the conclusion of the training, 126 decision lists
logical features, we used the subset of the training
are produced of the form given in Table 2. The num-
data in which instances have the given feature in at
ber of rules in each decision list range from 1 to
least one of their generated parses. We then split this
6145. The longer decision lists are typically for part
subset into positive and negative examples depend-
of speech features, e.g. distinguishing nouns from
ing on whether the correct parse contains the given
adjectives, and contain rules specific to lexical items.
feature. A decision list specific to that feature is cre-
The average number of rules is 266. To get an esti-
ated using GPA based on these examples.
mate on the accuracy of each decision list, we split
Some relevant statistics for the training data are
the one million word data into training, validation,
given in Table 4.
and test portions using the ratio 4:1:1. The train-
4.2 Input Attributes ing set accuracy of the decision lists is consistently
Once the training data is selected for a particular above 98%. The test set accuracies of the 126 deci-
morphological feature, each instance is represented sion lists range from 80% to 100% with the average
by surface attributes of five words centered around at 95%. Table 5 gives the six worst features with test
the target word. We have tried larger window sizes set accuracy below 89%; these are the most difficult
but no significant improvement was observed. The to disambiguate.
attributes computed for each word in the window 4.4 Correct Tag Selection
consist of the following:
To evaluate the candidate tags, we need to combine
1. The exact word string (e.g. W==Ali’nin) the results of the decision lists. We assume that the
presence or absence of each feature is an indepen-
2. The lowercase version (e.g. W=˜ali’nin) Note:
dent event with a probability determined by the test
all digits are replaced by 0’s at this stage.
set accuracy of the corresponding decision list. For
3. All suffixes of the lowercase version (e.g. example, if the +P3pl decision list predicts YES,
W=+n, W=+In, W=+nIn, W=+’nIn, etc.) Note: we assume that the +P3pl feature is present with

332
87.89% +Acquire To acquire (noun) In our second experiment we used only unam-
86.18% +PCIns Postposition subcat. biguous instances for training. Decision list training
85.11% +Fut Future tense requires negative examples, so we selected random
84.08% +P3pl 3. plural possessive unambiguous instances for positive and negative ex-
80.79% +Neces Must amples for each feature. The accuracy of the result-
79.81% +Become To become (noun) ing model on the test set was 82.57%. The problem
with selecting unambiguous instances is that certain
Table 5: The six features with the worst test set ac- common disambiguation decisions are never repre-
curacy. sented during training. More careful selection of
negative examples and a sophisticated bootstrapping
probability 0.8408 (See Table 5). If the +Fut deci- mechanism may still make this approach workable.
sion list predicts NO, we assume the +Fut feature is Finally, we decided to see if our decision lists
present with probability 1 − 0.8511 = 0.1489. To could be used for tagging rather than disambigua-
avoid zero probabilities we cap the test set accura- tion, i.e. given a word in a context decide on the full
cies at 99%. tag without the help of a morphological analyzer.
Each candidate tag indicates the presence of cer- Even though the number of possible tags is unlim-
tain features and the absence of others. The prob- ited, the most frequent 1000 tags cover about 99%
ability of the tag being correct under our indepen- of the instances. A single decision list trained with
dence assumption is the product of the probabilities the full tags was able to achieve 91.23% accuracy
for the presence and absence of each of the 126 fea- using 10000 rules. This is a promising result and
tures as determined by our decision lists. For effi- will be explored further in future work.
ciency, one can neglect the features that are absent
from all the candidate tags because their contribu-
5 Contributions
tion will not effect the comparison.

4.5 Results We have presented an automated approach to learn


morphological disambiguation rules for Turkish us-
The final evaluation of the model was performed on
ing a novel decision list induction algorithm, GPA.
a test data set of 958 instances. The possible parses
The only input to the rules are the surface attributes
for each instance were generated by the morpholog-
of a five word window. The approach can be gener-
ical analyzer and the correct one was picked manu-
alized to other agglutinative languages which share
ally. 40% of the instances were ambiguous, which
the common challenge of a large number of poten-
on the average had 3.9 parses. The disambiguation
tial tags. Our approach for resolving the data sparse-
accuracy of our model was 95.82%. The 95% confi-
ness problem caused by the large number of tags is
dence interval for the accuracy is [0.9457, 0.9708].
to generate a separate model for each morphologi-
An analysis of the mistakes in the test data show
cal feature. The predictions for individual features
that at least some of them are due to incorrect tags
are probabilistically combined based on the accu-
in our training data. The training data was semi-
racy of each model to select the best tag. We were
automatically generated and thus contained some er-
able to achieve an accuracy around 96% using this
rors. Based on hand evaluation of the differences be-
approach.
tween the training data tags and the GPA generated
tags, we estimate the accuracy of the training data to
be below 95%. We ran two further experiments to Acknowledgments
see if we could improve on the initial results.
In our first experiment we used our original model We would like to thank Kemal Oflazer of Sabancı
to re-tag the training data. The re-tagged training University for providing us with the Turkish mor-
data was used to construct a new model. The result- phological analyzer, training and testing data for dis-
ing accuracy on the test set increased to 96.03%, not ambiguation, and valuable feedback.
a statistically significant improvement.

333
References on Empirical Methods in Natural Language and Very
Large Corpora, pages 275–284, College Park, Mary-
Brill, E. (1995). Transformation-based error-driven land, USA.
learning and natural language processing: A case
study in part-of-speech tagging. Computational Lin- Newlands, D. and Webb, G. I. (2004). Alternative strate-
guistics, 21(4):543–565. gies for decision list construction. In Proceedings of
the Fourth Data Mining Conference (DM IV 03), pages
Church, K. W. (1988). A stochastic parts program and 265–273.
noun phrase parser for unrestricted text. In Proceed-
ings of the Second Conference on Applied Natural Oflazer, K. (1994). Two-level description of turkish
Language Processing, pages 136–143. morphology. Literary and Linguistic Computing,
9(2):137–148.
Clark, P. and Boswell, R. (1991). Rule induction with
CN2: Some recent improvements. In Kodratoff, Oflazer, K., Hakkani-Tür, D. Z., and Tür, G. (1999).
Y., editor, Machine Learning – Proceedings of the Design for a turkish treebank. In Proceedings of
Fifth European Conference (EWSL-91), pages 151– the Workshop on Linguistically Interpreted Corpora,
163, Berlin. Springer-Verlag. EACL 99, Bergen, Norway.
Clark, P. and Niblett, T. (1989). The CN2 induction al- Oflazer, K. and Kuruöz, İ. (1994). Tagging and morpho-
gorithm. Machine Learning, 3:261–283. logical disambiguation of turkish text. In Proceedings
Cutting, D., Kupiec, J., Pedersen, J., and Sibun, P. (1992). of the 4th Applied Natural Language Processing Con-
A practical part-of-speech tagger. In Proceedings of ference, pages 144–149. ACL.
the 3rd Conference on Applied Language Processing, Oflazer, K. and Tür, G. (1997). Morphological disam-
pages 133–140. biguation by voting constraints. In Proceedings of the
Daelemans, W. et al. (1996). MBT: A memory-based 35th Annual Meeting of the Association for Computa-
part of speech tagger-generator. In Ejerhead, E. and tional Linguistics (ACL97, EACL97), Madrid, Spain.
Dagan, I., editors, Proceedings of the Fourth Workshop Ratnaparkhi, A. (1996). A maximum entropy model for
on Very Large Corpora, pages 14–27. part-of-speech tagging. In Proceedings of the Confer-
Ezeiza, N. et al. (1998). Combining stochastic and rule- ence on Empirical Methods in Natural Language Pro-
based methods for disambiguation in agglutinative lan- cessing.
guages. In Proceedings of the 36th Annual Meeting of
Rivest, R. L. (1987). Learning decision lists. Machine
the Association for Computational Linguistics (COL-
Learning, 2:229–246.
ING/ACL98), pages 379–384.
van Halteren, H., editor (1999). Syntactic Wordclass Tag-
Hajič, J. and Hladká, B. (1998). Tagging inflective lan-
ging. Text, Speech and Language Technology. Kluwer
guages: Prediction of morphological categories for a
Academic Publishers.
rich, structured tagset. In Proceedings of the 36th
Annual Meeting of the Association for Computational Webb, G. I. (1995). Opus: An efficient admissible algo-
Linguistics (COLING/ACL98), pages 483–490, Mon- rithm for unordered search. JAIR, 3:431–465.
treal, Canada.
Webb, G. I. and Brkic, N. (1993). Learning decision lists
Hakkani-Tür, D. Z., Oflazer, K., and Tür, G. (2002). by prepending inferred rules. In Proceedings of the AI
Statistical morphological disambiguation for aggluti- 93 Workshop on Machine Learning and Hybrid Sys-
native languages. Computers and the Humanities, tems, pages 6–10, Melbourne.
36:381–410.
Karlsson, F., Voutialinen, A., Heikkilä, J., and Anttila, A.
(1995). Constraint Grammar - A Language Indepen-
dent System for Parsing Unrestricted Text. Mouton de
Gruyter.
Levinger, M., Ornan, U., and Itai, A. (1995). Learning
morpho-lexical probabilities from an untagged corpus
with an application to hebrew. Computational Lin-
guistics, 21(3):383–404.
Megyesi, B. (1999). Improving brill’s pos tagger for an
agglutinative language. In Pascale, F. and Joe, Z., ed-
itors, Proceedings of the Joing SIGDAT Conference

View publication stats


334

You might also like