A Hybrid Model For POS Tagging
A Hybrid Model For POS Tagging
Abstract— This paper describes our work on Bengali Part of Rule based system needs context rule for POS tagging.
Speech (POS) tagging using a corpus-based approach. There are Typical rule based approaches use contextual information to
several approaches for part of speech tagging. This paper deals with a assign tags to unknown or ambiguous words. These rules are
model that uses a combination of supervised and unsupervised
often known as context frame rules.
learning using a Hidden Markov Model (HMM). We make use of
small tagged corpus and a large untagged corpus. We also make use
of Morphological Analyzer. Bengali is a highly ambiguous and Stochastic tagging technique makes use of a corpus. The
relatively free word order language. We have obtained an overall most common stochastic tagging technique uses a Hidden
accuracy of 95%. Markov Model (HMM). The states usually denote the POS
tags. The probabilities are estimated from a tagged training
Keywords—Natural Language Processing, Machine Learning corpus or an untagged corpus in order to compute the most
and Statistical Technology . likely POS tags for the word of an input sentence. Stochastic
tagging techniques can be of two types depending on the
I. INTRODUCTION training data. Supervised stochastic tagging techniques use
bhAta khAi Ami (Rice eat I ) ĺ NN VB PRP [5]. It uses a rich feature representation and generates a tag
khAi Ami bhAta ( Eat I rice ) ĺ VB PRP NN probability distribution for each word.
khAi bhAta Ami ( Eat rice I ) ĺ VB NN PRP
Part of speech tagging using linguistic rules is a difficult [Cutting et al., 1992] [6] used a Hidden Markov Model for
problem for such a free word order language. A HMM model Part of speech tagging. The HMM model use a lexicon and an
can capture the language model from the perspective of POS untagged corpus. The methodology uses a lexicon and some
tagging. untagged text for accurate and robust tagging. There are three
modules in this system – tokenizer, training and tagging.
We are considering 40 different tags for POS tagging. POS Tokenizer identifies an ambiguity class (set of tags) for each
tagger is the most essential tool for design and development of word. The training module takes a sequence of ambiguity
Natural Language Processing application. A major problem of classes as input. It uses Baum-Welch algorithm to produce a
NLP is word sense disambiguation. A larger tag set reduces trained HMM. Training is performed on a large corpus. The
the ambiguity problem but it also reduces the parsing tagging module buffers sequence of ambiguity classes
complexity. An important task in natural language processing between sentence boundaries. These sequence are
is parsing. Given a POS tagged sentence, local word groups disambiguated by computing the maximal path through the
are easier to identify if we have a large number of tags. A HMM with the Viterbi algorithm. In our POS tagging for
large tag set also facilitates shallow parsing. Our goal is to Bengali we are using Baum-Welch algorithm for learning
achieve high accuracy using a large tag set. from an untagged corpus. But instead of learning completely
from the untagged data we are also using a tagged data to
III. BACKGROUND WORK determine the initial HMM model. Like Cutting we are also
There are different approaches have been used for Part-of- taking help of ambiguity class. But our ambiguity class is
speech tagging. Some previous work has focused on rule taken from the Morphological Analyzer. Instead of using
based linguistically motivated Part-of-Speech tagging worked ambiguity class both at the time of learning and decoding we
by Brill (1992, 1994) [1]. Brill’s tagger uses a two-stage are using the ambiguity class only at the time of decoding.
architecture. The input tokens are initially tagged with their
most likely tags. It employs an automatically acquired set of Another model is designed for the tagging task by
lexical rules to identify unknown words. TNT is a stochastic combining unsupervised Hidden Markov Model with
HMM tagger which uses a suffix analysis technique to maximum entropy [7]. The methodology uses unsupervised
estimate lexical probabilities for unknown tokens based on learning of an HMM and a maximum entropy model. Training
properties of words in the training corpus which share the an HMM is done by Baum-Welch algorithm with an un-
same suffix. annotated corpus. It uses 320 states for the initial HMM
model. These HMM parameters are used as the features of
Recent stochastic methods achieve high accuracy in part-of Maximum Entropy model. The system uses a small annotated
speech tagging tasks. They resolve the ambiguity on the basis corpus to assign actual tag corresponds each state.
of the most likely interpretation. Markov model has been
widely used to disambiguate part-of-speech category. There IV. HIDDEN MARKOV MODELING
have been two types of work – one using tagged corpus and Hidden Markov Models (HMMs) have been widely used in
other using untagged corpus. various NLP task. Hidden Markov Model is a probabilistic
finite state machine having a set of sates (Q), an output
The first model uses a pre-tagged corpus. A bootstrapping alphabet (O), transition probabilities (A), output probabilities
method for training was designed by Deroualt and Merialdo (B) and initial state probabilities (ɉ).
[2]. In this model they used a small pre-tagged corpus to
determine the initial model. This initial model is used to tag Q = {q1, q2… qn} is the set of states and O = {o1, o2… o3} is
more text. The tags are manually corrected to retrain the the set of observations.
model. Church used Brown corpus to estimate the
probabilities [3]. Existing methods assume a large annotated A = {aij = P(qj at t+1 | qi at t)}, where P(a | b) is the
corpus and/or a dictionary. It is often the case that we have no conditional probability of a given b, t 1 is time, and qi
annotated corpus or a small corpus at the time of developing a belongs to Q. aij is the probability that the next state is qj given
part-of speech tagger for new language. that the current state is qi.
The second model uses an untagged corpus. Supervised
methods are not always applicable when a large annotated B = {bik = P(ok | qi)}, where ok belongs to O. bjk is the
corpus is not available. There have been several works that probability that the output is ok given that the current state is
have used unsupervised learning to learn a HMM model for qi.
POS tagging. Baum-Welch algorithm [4] can be used to learn
a HMM from un-annotated data. The maximum entropy Ȇ = {pi = P(qi at t=1)} denotes the initial probability
model is powerful enough to achieve accuracy in tagging task distribution over states.
In our HMM model, states correspond to part-of-speech modifying the initial counts estimated from tagged data.
tags and observations correspond to words. We aim to learn
the parameter of the HMM using our corpus. The HMM will We calculate the transition probabilities ‘A’ and emission
be used to assign the most probable tag to the word of an input probabilities ‘B’ from the above counts. We calculate the
sentence. We use a bi-gram model. We tried supervised transition probability of next state given the current state. The
learning from the tagged corpus. But, possibly because the transition probability is calculated simply by the following
corpus size is so small we have achieved accuracy of 65%. formula.
Therefore we decide to use a raw corpus in addition to the
tagged corpus. P(ti| ti-1) = C(ti-1ti) ⁄ Total number of bi-grams starts with ti-1
Where ti is the current tag and ti-1 is the previous tag.
The HMM probabilities are updated using both tagged as well
as the untagged corpus. For the tagged corpus, sampling is
For calculating emission probability we calculate the
used to update the probabilities. When using untagged corpus
the EM algorithm is used to update the probabilities. unigram of a word along with its tag assigned in the tagged
data. We are also calculating the emission probability of a
V. A HYBRID TAGGING MODEL word given a particular tag by using the above formula where
ti is the tag and ti-1 is the word. We are also using add one
We will first outline our training method. The training
smoothing for avoiding zero transition and emission
module is based on partially supervised learning. It makes use
probabilities.
of some tagged data and more untagged data. We are
estimating the transition and emission probabilities from the
partially supervised learning. B. Decoding
A. Training The decoding module finds the best probable tag sequence of
a sentence. We use Viterbi algorithm to calculate the best
In training module we use both types of sentences – tagged probable path (best tag sequence) for a given word sequence
and untagged. (sentence). Instead of considering all possible tags for each
Tagged Data: Five hundred tagged sentences for word in the test data we consider the most possible tags given
supervised learning. by the Morphological Analyzer. We feed each word to our
Untagged Data: Raw data for re-estimating parameter Morphological Analyzer that outputs all possible part-of-
(50,000 words) speech of that word. Considering all possible tags from the
tagset increases the number of paths. But the use of
First we describe how we learn using tagged data and then Morphological Analyzer reduces the number of paths as given
we will outline the learning process from untagged data. in following figure. For example we are considering a
sentence “Ami ekhana chA khete yAba”.
Our algorithm runs on a number of iterations. First we
process the tagged data by supervised learning then in each Khete
iteration it processes the untagged data and updates the chA (NN)
ekhana (NN)
transition probabilities i.e. p (tag | previous tag) and emission (NN)
probabilities i.e. p (word | tag) for the Hidden Markov Model.
Using tagged data each word maps to one state as the correct Ami Khete yAba
part-of-speech is known. But using untagged data each word (PP) (NN) (VF)
will map to all states because part-of-speech tags are not ekhana
known i.e. all states we considered possible. In supervised (PT) chA
(VIS) Khete
learning, we calculate the bi-gram counts of a particular tag (NN)
given a previous tag from the tagged corpus.
probable path (best tag sequence) for a given word sequence possible part-of-speech for all the words of the test set that are
is found out by using the Viterbi tagging algorithm. The best not covered by the Morphological Analyzer. We also made a
probable path is calculated by the following formula. list of all possible proper nouns in our test data set. At the time
n of evaluation we marked all proper nouns from that list. We
tested the above modification over Method 3 and we got an
argmax = ɉ p(ti | ti-1 ) p(wi | ti)
average percentage of precision 95.18%
i=1
This approach offers an overall high accuracy even if a small
Method 3
set of tagged corpus is used for the purpose. Precision
95.18
VI. EXPERIMENT RESULTS
VII. CONCLUSION AND FUTURE WORK
The system performance is evaluated in two ways. Firstly, the
This paper presents a model for POS tagging for a relatively
system is tested in one Leave One Out Correctness Validation
free word order language, Bengali. On the basis of our
(LOOCV) method i.e. from N tagged files we use N-1 for
preliminary experiment the system is found to have an
training and 1 file for testing. This is done for each individual
accuracy of 95%. The system uses a small set of tagged
file from N tagged files. The above technique for evaluation is
sentences. It also uses an untagged corpus and a
applied on three approaches to determine the precession. In
morphological analyzer. The precision is affected by
our POS tagging evaluation we use 20 files each consist of 25
incomplete lexicon in Morphological Analyzer and errors in
sentences.
the untagged corpus. It is expected that system accuracy will
increase by correcting the typographical errors in the untagged
Correctly tagged words by the system corpus and also by increasing the accuracy of Morphological
precision analyzer. Some rule-based component can also be applied to
Total no. of words in the evaluation set
the model to detect and correct the existing errors. The POS
tagger is useful for chunking, clause boundary identification
We have tested three different approaches of POS and other NLP applications.
tagging.
Method 1: POS tagging using only supervised learning REFERENCES
Method 2: POS tagging using a partially supervised learning [1] E. Brill, “A simple Rule-Based Part-of-Speech Tagger”, University of
and decoding the best tag sequence without using Pennsylvania, 1992.
Morphological Analyzer restriction. [2]A. M. Deroualt and B. Merialdo, “Natural Language modeling for
Method 3: POS tagging using a partially supervised learning phoneme-to-text transposition”, IEEE transactions on Pattern Analysis and
Machine Intelligence, 1986.
and decoding the best tag sequence without using [3] K.W. Church, “A statistical parts program and noun phrase parser for
Morphological Analyzer restriction. unrestricted text”, Proceedings of the second conference on Applied Natural
Language Processing (ACL), 1988.
The evaluation results are given in the following table: [4] L. E. Baum, “An inequality and associated maximization technique in
statistical estimation on probabilistic functions of a Markov process”,
Inequalities, 1972.
Method 1 Method 2 Method 3 [5] A. Ratnaparkhi, “A maximum entropy Part-of-speech tagger”, Proceedings
Precision of the Empirical Methods in NLP conference, University of Pennsylvania,
64.31 67.6 96.28
1996.
[6] D. Cutting, “A practical part-of-speech tagger”, Proceedings of third
The above table indicates the high 96.28% accuracy of the conference on Applied Natural Language processing, 1992.
Hybrid system. To ensure the correctness of the precision we [7] J. Kazama, “A maximum entropy tagger with unsupervised Hidden
tried another approach for evaluating the system. We took 100 Markov Model”, NLPRS, 2001
[8] J. Allen, “Natural Language Understanding”, pages {195-203}
sentences (1003 words) randomly from the CIIL corpus and [9] D. Jurafsky and J. H. Martin, “Speech and Language Processing” pages
tagged it manually; the sentences taken from CIIL corpus {287-320}, Pearson Edition.
being more complex sentences compare to the sentences used
in tagged data. The precision is calculated using the above
formula.
In the above data set the precision is much lower. Many errors
are due to incomplete lexicon used in our Morphological
Analyzer and also the unavailability a proper noun identifier.
Morphological errors are of two types – a particular word is
not found in the Morphological Analyzer or Morphological
Analyzer does not cover all possible tags of a word. To find
out the actual accuracy of our model we manually entered the