0% found this document useful (0 votes)
60 views4 pages

A Hybrid Model For POS Tagging

Uploaded by

yrm yrm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views4 pages

A Hybrid Model For POS Tagging

Uploaded by

yrm yrm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY V1 DECEMBER 2004 ISSN 1305-5313

A Hybrid Model for Part-of-Speech Tagging and its


Application to Bengali
Sandipan Dandapat, Sudeshna Sarkar and Anupam Basu


Abstract— This paper describes our work on Bengali Part of Rule based system needs context rule for POS tagging.
Speech (POS) tagging using a corpus-based approach. There are Typical rule based approaches use contextual information to
several approaches for part of speech tagging. This paper deals with a assign tags to unknown or ambiguous words. These rules are
model that uses a combination of supervised and unsupervised
often known as context frame rules.
learning using a Hidden Markov Model (HMM). We make use of
small tagged corpus and a large untagged corpus. We also make use
of Morphological Analyzer. Bengali is a highly ambiguous and Stochastic tagging technique makes use of a corpus. The
relatively free word order language. We have obtained an overall most common stochastic tagging technique uses a Hidden
accuracy of 95%. Markov Model (HMM). The states usually denote the POS
tags. The probabilities are estimated from a tagged training
Keywords—Natural Language Processing, Machine Learning corpus or an untagged corpus in order to compute the most
and Statistical Technology . likely POS tags for the word of an input sentence. Stochastic
tagging techniques can be of two types depending on the
I. INTRODUCTION training data. Supervised stochastic tagging techniques use

P art-of-Speech (POS) tagging is a technique for automatic


annotation of lexical categories. Part-of –Speech tagging
assigns an appropriate part of speech tag for each word in a
only tagged data. However the supervised method requires
large amount of tagged data so that high level of accuracy can
be achieved. Unsupervised stochastic techniques, on the other
sentence of a language. POS tagging is widely used for hand, are those which do not require a pre-tagged corpus but
linguistic text analysis. Part-of-speech tagging is an essential instead use sophisticated computational methods to
task for all the natural language processing activities. A POS automatically induce word groupings (i.e. tag sets), and based
tagger takes a sentence as input and assigns a unique part of on these automatic groupings, they calculate the probabilistic
speech tag to each lexical item of the sentence. POS tagging is values needed by stochastic taggers.
used as an early stage of linguistic text analysis in many Our approach is a combination of both supervised and
applications including subcategory acquisition; text to speech unsupervised stochastic techniques for training a HMM. We
synthesis; and alignment of parallel corpora. There are a are using a Morphological Analyzer for Bengali in our POS
variety of techniques for POS tagging. Two approaches to tagging technique. The Morphological Analyzer takes a word
POS tagging are as input and gives all possible POS tags for the word.

1. Supervised POS Tagging II. LINGUISTIC CHARACTERISTICS OF BENGALI


2. Unsupervised POS Tagging Present day Bengali has two literary styles. One is called
"Sadhubhasa" (elegant language) and the other "Chaltibhasa"
Supervised tagging technique requires a pre tagged corpora (current language). The former is the traditional literary style
where as unsupervised tagging technique do not require a pre based on Middle Bengali of the sixteenth century. The later is
tagged corpora. Both supervise and unsupervised tagging can practically a creation of the present century, and is based on
be of two types Rule based and stochastic. the cultivated form of the dialect spoken in Kolkata by the
educated people originally coming from districts bordering on
Manuscript submited on November 04, 2004. the lower reaches of the Hoogly. Our POS tagger deals with
Chalitbhasa.
Sandipan Dandapat is with the Dept. of Computer Sc. and Engg., Indian
Institute of Technology – Kharagpur, West Bengal, India; e-mail: Bengali is a relatively free word order language compare with
sandipan_242@ yahoo.com
European languages. For example:
Dr. Sudeshna Sarkar is with the Dept. of Computer Sc. and Engg., Indian Consider the simple English sentence
Institute of Technology-Kharagpur, West Bengal, India; e-mail: I eat rice ĺ PRP VB NN
[email protected] The possible Bengali equivalents of the above English
Prof. Anupam Basu is with the Dept. of Computer Sc. and Engg., Indian
sentence are
Institute of Technology-Kharagpur, West Bengal, India; e-mail: Ami bhAta khAi (I rice eat) ĺ PRP NN VB
[email protected] Ami khAi bhAta (I eat rice) ĺ PRP VB NN
bhAta Ami khAi (Rice I eat ) ĺ NN PRP VB

ENFORMATIKA V1 2004 ISSN 1305-5313 169 © 2004 WORLD ENFORMATIKA SOCIETY


TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY V1 DECEMBER 2004 ISSN 1305-5313

bhAta khAi Ami (Rice eat I ) ĺ NN VB PRP [5]. It uses a rich feature representation and generates a tag
khAi Ami bhAta ( Eat I rice ) ĺ VB PRP NN probability distribution for each word.
khAi bhAta Ami ( Eat rice I ) ĺ VB NN PRP
Part of speech tagging using linguistic rules is a difficult [Cutting et al., 1992] [6] used a Hidden Markov Model for
problem for such a free word order language. A HMM model Part of speech tagging. The HMM model use a lexicon and an
can capture the language model from the perspective of POS untagged corpus. The methodology uses a lexicon and some
tagging. untagged text for accurate and robust tagging. There are three
modules in this system – tokenizer, training and tagging.
We are considering 40 different tags for POS tagging. POS Tokenizer identifies an ambiguity class (set of tags) for each
tagger is the most essential tool for design and development of word. The training module takes a sequence of ambiguity
Natural Language Processing application. A major problem of classes as input. It uses Baum-Welch algorithm to produce a
NLP is word sense disambiguation. A larger tag set reduces trained HMM. Training is performed on a large corpus. The
the ambiguity problem but it also reduces the parsing tagging module buffers sequence of ambiguity classes
complexity. An important task in natural language processing between sentence boundaries. These sequence are
is parsing. Given a POS tagged sentence, local word groups disambiguated by computing the maximal path through the
are easier to identify if we have a large number of tags. A HMM with the Viterbi algorithm. In our POS tagging for
large tag set also facilitates shallow parsing. Our goal is to Bengali we are using Baum-Welch algorithm for learning
achieve high accuracy using a large tag set. from an untagged corpus. But instead of learning completely
from the untagged data we are also using a tagged data to
III. BACKGROUND WORK determine the initial HMM model. Like Cutting we are also
There are different approaches have been used for Part-of- taking help of ambiguity class. But our ambiguity class is
speech tagging. Some previous work has focused on rule taken from the Morphological Analyzer. Instead of using
based linguistically motivated Part-of-Speech tagging worked ambiguity class both at the time of learning and decoding we
by Brill (1992, 1994) [1]. Brill’s tagger uses a two-stage are using the ambiguity class only at the time of decoding.
architecture. The input tokens are initially tagged with their
most likely tags. It employs an automatically acquired set of Another model is designed for the tagging task by
lexical rules to identify unknown words. TNT is a stochastic combining unsupervised Hidden Markov Model with
HMM tagger which uses a suffix analysis technique to maximum entropy [7]. The methodology uses unsupervised
estimate lexical probabilities for unknown tokens based on learning of an HMM and a maximum entropy model. Training
properties of words in the training corpus which share the an HMM is done by Baum-Welch algorithm with an un-
same suffix. annotated corpus. It uses 320 states for the initial HMM
model. These HMM parameters are used as the features of
Recent stochastic methods achieve high accuracy in part-of Maximum Entropy model. The system uses a small annotated
speech tagging tasks. They resolve the ambiguity on the basis corpus to assign actual tag corresponds each state.
of the most likely interpretation. Markov model has been
widely used to disambiguate part-of-speech category. There IV. HIDDEN MARKOV MODELING
have been two types of work – one using tagged corpus and Hidden Markov Models (HMMs) have been widely used in
other using untagged corpus. various NLP task. Hidden Markov Model is a probabilistic
finite state machine having a set of sates (Q), an output
The first model uses a pre-tagged corpus. A bootstrapping alphabet (O), transition probabilities (A), output probabilities
method for training was designed by Deroualt and Merialdo (B) and initial state probabilities (ɉ).
[2]. In this model they used a small pre-tagged corpus to
determine the initial model. This initial model is used to tag Q = {q1, q2… qn} is the set of states and O = {o1, o2… o3} is
more text. The tags are manually corrected to retrain the the set of observations.
model. Church used Brown corpus to estimate the
probabilities [3]. Existing methods assume a large annotated A = {aij = P(qj at t+1 | qi at t)}, where P(a | b) is the
corpus and/or a dictionary. It is often the case that we have no conditional probability of a given b, t • 1 is time, and qi
annotated corpus or a small corpus at the time of developing a belongs to Q. aij is the probability that the next state is qj given
part-of speech tagger for new language. that the current state is qi.
The second model uses an untagged corpus. Supervised
methods are not always applicable when a large annotated B = {bik = P(ok | qi)}, where ok belongs to O. bjk is the
corpus is not available. There have been several works that probability that the output is ok given that the current state is
have used unsupervised learning to learn a HMM model for qi.
POS tagging. Baum-Welch algorithm [4] can be used to learn
a HMM from un-annotated data. The maximum entropy Ȇ = {pi = P(qi at t=1)} denotes the initial probability
model is powerful enough to achieve accuracy in tagging task distribution over states.

ENFORMATIKA V1 2004 ISSN 1305-5313 170 © 2004 WORLD ENFORMATIKA SOCIETY


TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY V1 DECEMBER 2004 ISSN 1305-5313

In our HMM model, states correspond to part-of-speech modifying the initial counts estimated from tagged data.
tags and observations correspond to words. We aim to learn
the parameter of the HMM using our corpus. The HMM will We calculate the transition probabilities ‘A’ and emission
be used to assign the most probable tag to the word of an input probabilities ‘B’ from the above counts. We calculate the
sentence. We use a bi-gram model. We tried supervised transition probability of next state given the current state. The
learning from the tagged corpus. But, possibly because the transition probability is calculated simply by the following
corpus size is so small we have achieved accuracy of 65%. formula.
Therefore we decide to use a raw corpus in addition to the
tagged corpus. P(ti| ti-1) = C(ti-1ti) ⁄ Total number of bi-grams starts with ti-1
Where ti is the current tag and ti-1 is the previous tag.
The HMM probabilities are updated using both tagged as well
as the untagged corpus. For the tagged corpus, sampling is
For calculating emission probability we calculate the
used to update the probabilities. When using untagged corpus
the EM algorithm is used to update the probabilities. unigram of a word along with its tag assigned in the tagged
data. We are also calculating the emission probability of a
V. A HYBRID TAGGING MODEL word given a particular tag by using the above formula where
ti is the tag and ti-1 is the word. We are also using add one
We will first outline our training method. The training
smoothing for avoiding zero transition and emission
module is based on partially supervised learning. It makes use
probabilities.
of some tagged data and more untagged data. We are
estimating the transition and emission probabilities from the
partially supervised learning. B. Decoding
A. Training The decoding module finds the best probable tag sequence of
a sentence. We use Viterbi algorithm to calculate the best
In training module we use both types of sentences – tagged probable path (best tag sequence) for a given word sequence
and untagged. (sentence). Instead of considering all possible tags for each
Tagged Data: Five hundred tagged sentences for word in the test data we consider the most possible tags given
supervised learning. by the Morphological Analyzer. We feed each word to our
Untagged Data: Raw data for re-estimating parameter Morphological Analyzer that outputs all possible part-of-
(50,000 words) speech of that word. Considering all possible tags from the
tagset increases the number of paths. But the use of
First we describe how we learn using tagged data and then Morphological Analyzer reduces the number of paths as given
we will outline the learning process from untagged data. in following figure. For example we are considering a
sentence “Ami ekhana chA khete yAba”.
Our algorithm runs on a number of iterations. First we
process the tagged data by supervised learning then in each Khete
iteration it processes the untagged data and updates the chA (NN)
ekhana (NN)
transition probabilities i.e. p (tag | previous tag) and emission (NN)
probabilities i.e. p (word | tag) for the Hidden Markov Model.
Using tagged data each word maps to one state as the correct Ami Khete yAba
part-of-speech is known. But using untagged data each word (PP) (NN) (VF)

will map to all states because part-of-speech tags are not ekhana
known i.e. all states we considered possible. In supervised (PT) chA
(VIS) Khete
learning, we calculate the bi-gram counts of a particular tag (NN)
given a previous tag from the tagged corpus.

We use untagged data (50,000 words) to re-estimate the bi-


Figure 1: Possible tags are taken from Morphological
gram counts from tag to tag and also re-estimate the unigram Analyzer
counts of a word given a particular tag. This re-estimation of
counts from untagged data is achieved using the Baum-Welch A word is unknown to the HMM if it has not occurred during
algorithm. In each iteration of the Baum-Welch algorithm we the training. However even for an unknown word the
get some expected counts and add them to the previous Morphological Analyzer gives all possible tags of the word.
counts. For the first iteration previous counts are actually the These possible part-of-speech tags are used during training. In
counts from the tagged data. In the second iteration the fig.1, each word has different possible tags given by
previous counts are the counts after first iteration. Finally Morphological Analyzer. For example word chA has two
Baum-Welch algorithm ends up by holding training plus raw different tags NN and VIS. Using the above restriction on tags
counts. We use of ten iterations of the algorithm for for each word and the transition and emission probabilities
from a partially supervised model we are finding the best

ENFORMATIKA V1 2004 ISSN 1305-5313 171 © 2004 WORLD ENFORMATIKA SOCIETY


TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY V1 DECEMBER 2004 ISSN 1305-5313

probable path (best tag sequence) for a given word sequence possible part-of-speech for all the words of the test set that are
is found out by using the Viterbi tagging algorithm. The best not covered by the Morphological Analyzer. We also made a
probable path is calculated by the following formula. list of all possible proper nouns in our test data set. At the time
n of evaluation we marked all proper nouns from that list. We
tested the above modification over Method 3 and we got an
argmax = ɉ p(ti | ti-1 ) p(wi | ti)
average percentage of precision 95.18%
i=1
This approach offers an overall high accuracy even if a small
Method 3
set of tagged corpus is used for the purpose. Precision
95.18
VI. EXPERIMENT RESULTS
VII. CONCLUSION AND FUTURE WORK
The system performance is evaluated in two ways. Firstly, the
This paper presents a model for POS tagging for a relatively
system is tested in one Leave One Out Correctness Validation
free word order language, Bengali. On the basis of our
(LOOCV) method i.e. from N tagged files we use N-1 for
preliminary experiment the system is found to have an
training and 1 file for testing. This is done for each individual
accuracy of 95%. The system uses a small set of tagged
file from N tagged files. The above technique for evaluation is
sentences. It also uses an untagged corpus and a
applied on three approaches to determine the precession. In
morphological analyzer. The precision is affected by
our POS tagging evaluation we use 20 files each consist of 25
incomplete lexicon in Morphological Analyzer and errors in
sentences.
the untagged corpus. It is expected that system accuracy will
increase by correcting the typographical errors in the untagged
Correctly tagged words by the system corpus and also by increasing the accuracy of Morphological
precision analyzer. Some rule-based component can also be applied to
Total no. of words in the evaluation set
the model to detect and correct the existing errors. The POS
tagger is useful for chunking, clause boundary identification
We have tested three different approaches of POS and other NLP applications.
tagging.
Method 1: POS tagging using only supervised learning REFERENCES
Method 2: POS tagging using a partially supervised learning [1] E. Brill, “A simple Rule-Based Part-of-Speech Tagger”, University of
and decoding the best tag sequence without using Pennsylvania, 1992.
Morphological Analyzer restriction. [2]A. M. Deroualt and B. Merialdo, “Natural Language modeling for
Method 3: POS tagging using a partially supervised learning phoneme-to-text transposition”, IEEE transactions on Pattern Analysis and
Machine Intelligence, 1986.
and decoding the best tag sequence without using [3] K.W. Church, “A statistical parts program and noun phrase parser for
Morphological Analyzer restriction. unrestricted text”, Proceedings of the second conference on Applied Natural
Language Processing (ACL), 1988.
The evaluation results are given in the following table: [4] L. E. Baum, “An inequality and associated maximization technique in
statistical estimation on probabilistic functions of a Markov process”,
Inequalities, 1972.
Method 1 Method 2 Method 3 [5] A. Ratnaparkhi, “A maximum entropy Part-of-speech tagger”, Proceedings
Precision of the Empirical Methods in NLP conference, University of Pennsylvania,
64.31 67.6 96.28
1996.
[6] D. Cutting, “A practical part-of-speech tagger”, Proceedings of third
The above table indicates the high 96.28% accuracy of the conference on Applied Natural Language processing, 1992.
Hybrid system. To ensure the correctness of the precision we [7] J. Kazama, “A maximum entropy tagger with unsupervised Hidden
tried another approach for evaluating the system. We took 100 Markov Model”, NLPRS, 2001
[8] J. Allen, “Natural Language Understanding”, pages {195-203}
sentences (1003 words) randomly from the CIIL corpus and [9] D. Jurafsky and J. H. Martin, “Speech and Language Processing” pages
tagged it manually; the sentences taken from CIIL corpus {287-320}, Pearson Edition.
being more complex sentences compare to the sentences used
in tagged data. The precision is calculated using the above
formula.

Method 1 Method 2 Method 3


Precision
59.93 61.79 84.37

In the above data set the precision is much lower. Many errors
are due to incomplete lexicon used in our Morphological
Analyzer and also the unavailability a proper noun identifier.
Morphological errors are of two types – a particular word is
not found in the Morphological Analyzer or Morphological
Analyzer does not cover all possible tags of a word. To find
out the actual accuracy of our model we manually entered the

ENFORMATIKA V1 2004 ISSN 1305-5313 172 © 2004 WORLD ENFORMATIKA SOCIETY

You might also like