A Hybrid Model For Part-of-Speech Tagging and Its Application To Bengali
A Hybrid Model For Part-of-Speech Tagging and Its Application To Bengali
application to Bengali
Sandipan Dandapat, Sudeshna Sarkar, Anupam Basu
1
Ami bhAta khAi (I rice eat) ? PRP NN VB have used unsupervised learning to learn a HMM model for
Ami khAi bhAta (I eat rice) ? PRP VB NN POS tagging. Baum-Welch algorithm [Baum, 1972] [4] can be
bhAta Ami khAi (Rice I eat ) ? NN PRP VB used to learn a HMM from un-annotated data. The maximum
bhAta khAi Ami (Rice eat I ) ? NN VB PRP entropy model is powerful enough to achieve accuracy in
khAi Ami bhAta ( Eat I rice ) ? VB PRP NN tagging task [Ratnaparkhi, 1996] [5]. It uses a rich feature
khAi bhAta Ami ( Eat rice I ) ? VB NN PRP representation and generates a tag probability distribution for
Part of speech tagging using linguistic rules is a difficult each word.
problem for such a free word order language. A HMM model
can capture the language model from the perspective of POS [Cutting et al., 1992] [6] used a Hidden Markov Model for
tagging. Part of speech tagging. The HMM model use a lexicon and an
untagged corpus. The methodology uses a lexicon and some
We are considering 40 different tags for POS tagging. POS untagged text for accurate and robust tagging. There are three
tagger is the most essential tool for design and development of modules in this system – tokenizer, training and tagging.
Natural Language Processing application. A major problem of Tokenizer identifies an ambiguity class (set of tags) for each
NLP is word sense disambiguation. A larger tag set reduces word. The training module takes a sequence of ambiguity
the ambiguity problem but it also reduces the parsing classes as input. It uses Baum-Welch algorithm to produce a
complexity. An important task in natural language processing trained HMM. Training is performed on a large corpus. The
is parsing. Given a POS tagged sentence, local word groups are tagging module buffers sequence of ambiguity classes
easier to identify if we have a large number of tags. A large tag between sentence boundaries. These sequence are
set also facilitates shallow parsing. Our goal is to achieve high dis ambiguated by computing the maximal path through the
accuracy using a large tag set. HMM with the Viterbi algorithm. In our POS tagging for
Bengali we are using Baum-Welch algorithm for learning from
III. BACKGROUND W ORK an untagged corpus. But instead of learning completely from
There are different approaches have been used for Part-of- the untagged data we are als o using a tagged data to determine
speech tagging. Some previous work has focused on rule the initial HMM model. Like Cutting we are also taking help of
based linguistically motivated Part-of-Speech tagging worked ambiguity class. But our ambiguity class is taken from the
by Brill (1992, 1994) [1]. Brill’s tagger uses a two-stage Morphological Analyzer. Instead of using ambiguity class
architecture. The in put tokens are initially tagged with their both at the time of learning and decoding we are using the
most likely tags. It employs an automatically acquired set of ambiguity class only at the time of decoding.
lexical rules to identify unknown words. TNT is a stochastic
HMM tagger which uses a suffix analysis technique to Another model is designed for the tagging task by
estimate lexical probabilities for unknown tokens based on combining unsupervised Hidden Markov Model with maximum
properties of words in the training corpus which share the entropy [kazama et al, 2001] [7]. The methodology uses
same suffix. unsupervised learning of an HMM and a maximum entropy
model. Training an HMM is done by Baum-Welch algorithm
Recent stochastic methods achieve high accuracy in part-of with an un-annotated corpus. It uses 320 states for the initial
speech tagging tasks. They resolve the ambiguity on the basis HMM model. These HMM parameters are used as the features
of the most likely interpretation. Markov model has been of Maximum Entropy model. The system uses a small
widely used to disambiguate part-of-speech category. There annotated corpus to assign actual tag corresponds each state.
have been two types of work – one using tagged corpus and
other using untagged corpus. IV. HIDDEN M ARKOV M ODELING
Hidden Markov Models (HMMs) have been widely used in
The first model uses a pre-tagged corpus. A bootstrapping various NLP task. Hidden Markov Model is a probabilistic
method for training was designed by Deroualt and Merialdo finite state machine having a set of sates (Q), an output
[Deroult and Merialdo, 1986] [2]. In this model they used a alphabet (O), transition probabilities (A), output probabilities
small pre-tagged corpus to determine the initial model. This (B) and initial state probabilities (?).
initial model is used to tag more text. The tags are manually
corrected to retrain the model. Church used Brown corpus to Q = {q 1, q 2… qn} is the set of states and O = {o 1, o 2… o 3} is
estimate the probabilities [Church, 1988] [3]. Existing methods the set of observations.
assume a large annotated corpus and/or a dictionary. It is often
the case that we have no annotated corpus or a small corpus at A = {aij = P(q j at t+1 | q i at t)}, where P(a | b) is the
the time of developing a part-of speech tagger for new conditional probability of a given b, t = 1 is time, and q i
language. belongs to Q. aij is the probability that the next state is q j given
that the current state is q i.
The second model uses an untagged corpus. Supervised
methods are not always applicable when a large annotated
corpus is not available. There have been several works that
2
B = {b ik = P(o k | q i)}, where o k belongs to O. b j k is the counts from untagged data is achieved using the Baum-Welch
probability that the output is o k given that the current state is algorithm. In each iteration of the Baum-Welch algorithm we
q i. get some expected counts and add them to the previous
counts. For the first iteration previous counts are actually the
? = {p i = P(q i at t=1)} denotes the initial probability counts from the tagged data. In the second iteration the
distribution over states. previous counts are the counts after first iteration. Finally
Baum-Welch algorithm ends up by holding training plus raw
In our HMM model, states correspond to part-of-speech counts. We use of ten iterations of the algorithm for modifying
tags and observations correspond to words. We aim to learn the initial counts estimated from tagged data.
the parameter of the HMM using our corpus. The HMM will be
used to assign the most probable tag to the word of an input We calculate the transition probabilities ‘A’ and emission
sentence. We use a bi-gram model. We tried supervised probabilities ‘B’ from the above counts. We calculate the
learning from the tagged corpus. But, possibly because the transition probability of next state given the current state. The
corpus size is so small we have achieved accuracy of 65%.
transition probability is calculated simply by the following
Therefore we decide to use a raw corpus in addition to the
formula.
tagged corpus.
The HMM probabilities are updated using both tagged as well P(t i| t i-1) = C(t i-1ti) / Total number of bi-grams starts with t i-1
as the untagged corpus. For the tagged corpus, sampling is Where t i is the current tag and t i-1 is the previous tag.
used to update the probabilities. When using untagged corpus
the EM algorithm is used to update the probabilities. For calculating emission probability we calculate the
unigram of a word along with its tag assigned in the tagged
V. A HYBRID TAGGING MODEL data. We are also calculating the emission probability of a
We will first outline our training method. The training word given a particular tag by using the above formula where t i
module is based on partially supervised learning. It makes use is the tag and ti-1 is the word. We are also using add one
of some tagged data and more untagged data. We are smoothing for avoiding zero transition and emission
estimating the transition and emission probabilities from the probabilities.
partially supervised learning.
A. Training B. Decoding
In training module we use both types of sentences – tagged The decoding module finds the best probable tag sequence of
and untagged. a sentence. We use Viterbi algorithm to calculate the best
probable path (best tag sequence) for a given word sequence
Tagged Data: Five hundred tagged sentences for supervised
(sentence). Instead of considering all possible tags for each
learning.
word in the test data we consider the most possible tags given
Untagged Data: Raw data for re-estimating parameter (50,000
by the Morphological Analyzer. We feed each word to our
words) Morphological Analyzer that outputs all possible part-of-
speech of that word. Considering all possible tags from the
First we describe how we learn using tagged data and then tagset increases the number of paths. But the use of
we will outline the learning process from untagged data. Morphological Analyzer reduces the number of paths as given
in following figure. For example we are considering a sentence
Our algorithm runs on a number of iterations. First we “Ami ekhana chA khete yAba”.
process the tagged data by supervised learning then in each
iteration it processes the untagged data and updates the Khet
transition probabilities i.e. p (tag | previous tag) and emission chA e(NN
ekhana
probabilities i.e. p (word | tag) for the Hidden Markov Model. (NN)
(NN)
Using tagged data each word maps to one state as the correct
part-of-speech is known. But using untagged data each word Ami Khet yAba
will map to all states because part-of-speech tags are not (PP) e(NN (VF)
known i.e. all states we considered possible. In supervised ekhana
learning, we calculate the bi-gram counts of a particular tag (PT) chA
(VIS) Khet
given a previous tag from the tagged corpus. e(NN
3
A word is unknown to the HMM if it has not occurred during Precision Method 1 Method 2 Method 3
the training. However even for an unknown word the 59.93 61.79 84.37
Morphological Analyzer gives all possible tags of the word.
These possible part-of-speech tags are used during training. In In the above data set the precision is much lower. Many errors
fig.1, each word has different possible tags given by are due to incomplete lexicon used in our Morphological
Morphological Analyzer. For example word chA has two Analyzer and also the unavailability a proper noun identifier.
different tags NN and VIS. Using the above restriction on tags Morphological errors are of two types – a particular word is
for each word and the transition and emission probabilities not found in the Morphological Analyzer or Morphological
from a partially supervised model we are finding the best Analyzer does not cover all possible tags of a word. To find
probable path (best tag sequence) for a given word sequence out the actual accuracy of our model we manually entered the
is found out by using the Viterbi tagging algorithm. The best possible part-of-speech for all the words of the test set that are
probable path is calculated by the following formula. not covered by the Morphological Analyzer. We also made a
n list of all possible proper nouns in our test data set. At the time
argmax = ? p(t i | t i-1 ) p(wi | t i) of evaluation we marked all proper nouns from that list. We
i=1 tested the above modification over Method 3 and we got an
This approach offers an overall high accuracy even if a small average percentage of precision 95.18%
set of tagged corpus is used for the purpose.
Method 3
Precision
VI. EXPERIMENT RESULTS 95.18
The system performance is evaluated in two ways. Firstly, the
VII. CONCLUSION AND FUTURE W ORK
system is tested in one Leave One Out Correctness Validation
(LOOCV) method i.e. from N tagged files we use N-1 for This paper presents a model for POS tagging for a relatively
training and 1 file for testing. This is done for each individual free word order language, Bengali. On the basis of our
file from N tagged files. The above technique for evaluation is preliminary experiment the system is found to have an
applied on three approaches to determine the precession. In accuracy of 95%. The system uses a small set of tagged
our POS tagging evaluation we use 20 files each consist of 25 sentences. It also uses an untagged corpus and a
sentences. morphological analyzer. The precision is affected by
incomplete lexicon in Morphological Analyzer and errors in the
untagged corpus. It is expected that system accuracy will
Correctly tagged words by the system
precision = increase by correcting the typographical errors in the untagged
Total no. of words in the evaluation set corpus and also by increasing the accuracy of Morphological
analyzer. Some rule-based component can also be applied to
the model to detect and correct the existing errors. The POS
We have tested three different approaches of POS
tagger is useful for chunking, clause boundary identification
tagging.
and other NLP applications.
Method 1: POS tagging using only supervised learning
Method 2: POS tagging using a partially supervised learning
REFERENCES
and decoding the best tag sequence without using
[1] E. Brill, “A simple Rule-Based Part-of-Speech Tagger”, University
Morphological Analyzer restriction.
of Pennsylvania, 1992.
Method 3: POS tagging using a partially supervised learning [2]A. M. Deroualt and B. Merialdo, “Natural Language modeling for
and decoding the best tag sequence without using phoneme-to-text transposition”, IEEE transactions on Pattern Analysis
Morphological Analyzer restriction. and Machine Intelligence, 1986.
[3] K.W. Church, “A statistical parts program and noun phrase parser
The evaluation results are given in the following table: for unrestricted text”, Proceedings of the second conference on Applied
Natural Language Processing (ACL), 1988.
[4] L. E. Baum, “An inequality and associated maximization technique
Method 1 Method 2 Method 3
Precision in statistical estimation on probabilistic functions of a Markov process”,
64.31 67.6 96.28 Inequalities, 1972.
[5] A. Ratnaparkhi, “A maximum entropy Part -of-speech tagger”,
The above table indicates the high 96.28% accuracy of the Proceedings of the Empirical Methods in NLP conference, University of
Hybrid system. To ensure the correctness of the precision we Pennsylvania, 1996.
tried another approach for evaluating the system. We took 100 [6] D. Cutting, “A practical part -of-speech tagger”, Proceedings of third
conference on Applied Natural Language processing, 1992.
sentences (1003 words) randomly from the CIIL corpus and
[7] J. Kazama, “A maximum entropy tagger with unsupervised Hidden
tagged it manually; the sentences taken from CIIL corpus Markov Model”, NLPRS, 2001
being more complex sentences compare to the sentences used [8] J. Allen, “Natural Language Understanding”, pages {195-203}
in tagged data. The precision is calculated using the above [9] D. Jurafsky and J. H. Martin, “Speech and Language Processing”
formula. pages {287-320}, Pearson Edition.