0% found this document useful (0 votes)
60 views4 pages

A Hybrid Model For Part-of-Speech Tagging and Its Application To Bengali

This document describes a hybrid model for part-of-speech tagging of Bengali text that combines supervised and unsupervised learning using a Hidden Markov Model. The authors develop a tagger using a small tagged corpus and a large untagged corpus, as well as a morphological analyzer. They achieve an overall accuracy of 95% tagging Bengali, a language with relatively free word order and high ambiguity.

Uploaded by

Tapan Chowdhury
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views4 pages

A Hybrid Model For Part-of-Speech Tagging and Its Application To Bengali

This document describes a hybrid model for part-of-speech tagging of Bengali text that combines supervised and unsupervised learning using a Hidden Markov Model. The authors develop a tagger using a small tagged corpus and a large untagged corpus, as well as a morphological analyzer. They achieve an overall accuracy of 95% tagging Bengali, a language with relatively free word order and high ambiguity.

Uploaded by

Tapan Chowdhury
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A hybrid model for Part-of-Speech tagging and its

application to Bengali
Sandipan Dandapat, Sudeshna Sarkar, Anupam Basu

be of two types Rule based and stochastic.


Abstract— This paper describes our work on Bengali Part of
Speech (POS) tagging using a corpus-based approach. There are Rule based system needs context rule for POS tagging.
several approaches for part of speech tagging. This paper deals with a Typical rule based approaches use contextual information to
model that uses a combination of supervised and unsupervised assign tags to unknown or ambiguous words. These rules are
learning using a Hidden Markov Model (HMM). We make use of often known as context frame rules.
small tagged corpus and a large untagged corpus. We also make use of
Morphological Analyzer. Bengali is a highly ambiguous and relatively Stochastic tagging technique makes use of a corpus. The
free word order language. We have obtained an overall accuracy of most common stochastic tagging technique uses a Hidden
95%.
Markov Model (HMM). The states usually denote the POS
tags. The probabilities are estimated from a tagged training
Keywords—About four key words or phrases in alphabetical
corpus or an untagged corpus in order to compute the most
order, separated by commas.
likely POS tags for the word of an input sentence. Stochastic
I. INTRODUCTION tagging techniques can be of two types depending on the
training data. Supervised stochastic tagging techniques use
P art-of-Speech (POS) tagging is a technique for automatic
annotation of lexical categories. Part-of –Speech tagging
only tagged data. However the supervised method requires
large amount of tagged data so that high level of accuracy can
assigns an appropriate part of speech tag for each word in a
be achieved. Unsupervised stochastic techniques, on the other
sentence of a language. POS tagging is widely used for
hand, are those which do not require a pre-tagged corpus but
linguistic text analysis. Part-of-speech tagging is an essential
instead use sophisticated computational methods to
task for all the natural language processing activities. A POS
automatically induce word groupings (i.e. tag sets), and based
tagger takes a sentence as input and assigns a unique part of
on these automatic groupings, they calculate the probabilistic
speech tag to each lexical item of the sentence. POS tagging is
values needed by stochastic taggers.
used as an early stage of linguistic text analysis in many
Our approach is a combination of both supervised and
applications including subcategory acquisition; text to speech
unsupervised stochastic techniques for training a HMM. We
synthesis; and alignment of parallel corpora. There are a
are using a Morphological Analyzer for Bengali in our POS
variety of techniques for POS tagging. Two approaches to POS
tagging technique. The Morphological Analyzer takes a word
tagging are
as input and gives all possible POS tags for the word.

1. Supervised POS Tagging II. LINGUISTIC CHARACTERISTICS OF BENGALI


2. Unsupervised POS Tagging
Present day Bengali has two literary styles. One is called
"Sadhubhasa" (elegant language) and the other "Chaltibhasa"
Supervised tagging technique requires a pre tagged corpora
(current language). The former is the traditional literary style
where as unsupervised tagging technique do not require a pre
based on Middle Bengali of the sixteenth century. The later is
tagged corpora. Both supervise and unsupervised tagging can practically a creation of the present century, and is based on
the cultivated form of the dialect spoken in Kolkata by the
Manuscript submited on November 04, 2004. educated people originally coming from districts bordering on
the lower reaches of the Hoogly. Our POS tagger deals with
Sandipan Dandapat, Indian Institute of Technology – Kharagpur, West Chalitbhasa.
Bengal, India; e-mail: sandipan_242@ yahoo.com
Bengali is a relatively free word order language compare with
Prof. Sudeshna Sarkar, Dept. of Computer Sc. and Engg., Indian Institute
European languages. For example:
of Technology -Kharagpur
Consider the simple English sentence
Prof. Anupam Basu, Dept. of Computer Sc. and Engg., Indian Institute I eat rice ? PRP VB NN
of Technology -Kharagpur The possible Bengali equivalents of the above English
sentence are

1
Ami bhAta khAi (I rice eat) ? PRP NN VB have used unsupervised learning to learn a HMM model for
Ami khAi bhAta (I eat rice) ? PRP VB NN POS tagging. Baum-Welch algorithm [Baum, 1972] [4] can be
bhAta Ami khAi (Rice I eat ) ? NN PRP VB used to learn a HMM from un-annotated data. The maximum
bhAta khAi Ami (Rice eat I ) ? NN VB PRP entropy model is powerful enough to achieve accuracy in
khAi Ami bhAta ( Eat I rice ) ? VB PRP NN tagging task [Ratnaparkhi, 1996] [5]. It uses a rich feature
khAi bhAta Ami ( Eat rice I ) ? VB NN PRP representation and generates a tag probability distribution for
Part of speech tagging using linguistic rules is a difficult each word.
problem for such a free word order language. A HMM model
can capture the language model from the perspective of POS [Cutting et al., 1992] [6] used a Hidden Markov Model for
tagging. Part of speech tagging. The HMM model use a lexicon and an
untagged corpus. The methodology uses a lexicon and some
We are considering 40 different tags for POS tagging. POS untagged text for accurate and robust tagging. There are three
tagger is the most essential tool for design and development of modules in this system – tokenizer, training and tagging.
Natural Language Processing application. A major problem of Tokenizer identifies an ambiguity class (set of tags) for each
NLP is word sense disambiguation. A larger tag set reduces word. The training module takes a sequence of ambiguity
the ambiguity problem but it also reduces the parsing classes as input. It uses Baum-Welch algorithm to produce a
complexity. An important task in natural language processing trained HMM. Training is performed on a large corpus. The
is parsing. Given a POS tagged sentence, local word groups are tagging module buffers sequence of ambiguity classes
easier to identify if we have a large number of tags. A large tag between sentence boundaries. These sequence are
set also facilitates shallow parsing. Our goal is to achieve high dis ambiguated by computing the maximal path through the
accuracy using a large tag set. HMM with the Viterbi algorithm. In our POS tagging for
Bengali we are using Baum-Welch algorithm for learning from
III. BACKGROUND W ORK an untagged corpus. But instead of learning completely from
There are different approaches have been used for Part-of- the untagged data we are als o using a tagged data to determine
speech tagging. Some previous work has focused on rule the initial HMM model. Like Cutting we are also taking help of
based linguistically motivated Part-of-Speech tagging worked ambiguity class. But our ambiguity class is taken from the
by Brill (1992, 1994) [1]. Brill’s tagger uses a two-stage Morphological Analyzer. Instead of using ambiguity class
architecture. The in put tokens are initially tagged with their both at the time of learning and decoding we are using the
most likely tags. It employs an automatically acquired set of ambiguity class only at the time of decoding.
lexical rules to identify unknown words. TNT is a stochastic
HMM tagger which uses a suffix analysis technique to Another model is designed for the tagging task by
estimate lexical probabilities for unknown tokens based on combining unsupervised Hidden Markov Model with maximum
properties of words in the training corpus which share the entropy [kazama et al, 2001] [7]. The methodology uses
same suffix. unsupervised learning of an HMM and a maximum entropy
model. Training an HMM is done by Baum-Welch algorithm
Recent stochastic methods achieve high accuracy in part-of with an un-annotated corpus. It uses 320 states for the initial
speech tagging tasks. They resolve the ambiguity on the basis HMM model. These HMM parameters are used as the features
of the most likely interpretation. Markov model has been of Maximum Entropy model. The system uses a small
widely used to disambiguate part-of-speech category. There annotated corpus to assign actual tag corresponds each state.
have been two types of work – one using tagged corpus and
other using untagged corpus. IV. HIDDEN M ARKOV M ODELING
Hidden Markov Models (HMMs) have been widely used in
The first model uses a pre-tagged corpus. A bootstrapping various NLP task. Hidden Markov Model is a probabilistic
method for training was designed by Deroualt and Merialdo finite state machine having a set of sates (Q), an output
[Deroult and Merialdo, 1986] [2]. In this model they used a alphabet (O), transition probabilities (A), output probabilities
small pre-tagged corpus to determine the initial model. This (B) and initial state probabilities (?).
initial model is used to tag more text. The tags are manually
corrected to retrain the model. Church used Brown corpus to Q = {q 1, q 2… qn} is the set of states and O = {o 1, o 2… o 3} is
estimate the probabilities [Church, 1988] [3]. Existing methods the set of observations.
assume a large annotated corpus and/or a dictionary. It is often
the case that we have no annotated corpus or a small corpus at A = {aij = P(q j at t+1 | q i at t)}, where P(a | b) is the
the time of developing a part-of speech tagger for new conditional probability of a given b, t = 1 is time, and q i
language. belongs to Q. aij is the probability that the next state is q j given
that the current state is q i.
The second model uses an untagged corpus. Supervised
methods are not always applicable when a large annotated
corpus is not available. There have been several works that

2
B = {b ik = P(o k | q i)}, where o k belongs to O. b j k is the counts from untagged data is achieved using the Baum-Welch
probability that the output is o k given that the current state is algorithm. In each iteration of the Baum-Welch algorithm we
q i. get some expected counts and add them to the previous
counts. For the first iteration previous counts are actually the
? = {p i = P(q i at t=1)} denotes the initial probability counts from the tagged data. In the second iteration the
distribution over states. previous counts are the counts after first iteration. Finally
Baum-Welch algorithm ends up by holding training plus raw
In our HMM model, states correspond to part-of-speech counts. We use of ten iterations of the algorithm for modifying
tags and observations correspond to words. We aim to learn the initial counts estimated from tagged data.
the parameter of the HMM using our corpus. The HMM will be
used to assign the most probable tag to the word of an input We calculate the transition probabilities ‘A’ and emission
sentence. We use a bi-gram model. We tried supervised probabilities ‘B’ from the above counts. We calculate the
learning from the tagged corpus. But, possibly because the transition probability of next state given the current state. The
corpus size is so small we have achieved accuracy of 65%.
transition probability is calculated simply by the following
Therefore we decide to use a raw corpus in addition to the
formula.
tagged corpus.

The HMM probabilities are updated using both tagged as well P(t i| t i-1) = C(t i-1ti) / Total number of bi-grams starts with t i-1
as the untagged corpus. For the tagged corpus, sampling is Where t i is the current tag and t i-1 is the previous tag.
used to update the probabilities. When using untagged corpus
the EM algorithm is used to update the probabilities. For calculating emission probability we calculate the
unigram of a word along with its tag assigned in the tagged
V. A HYBRID TAGGING MODEL data. We are also calculating the emission probability of a
We will first outline our training method. The training word given a particular tag by using the above formula where t i
module is based on partially supervised learning. It makes use is the tag and ti-1 is the word. We are also using add one
of some tagged data and more untagged data. We are smoothing for avoiding zero transition and emission
estimating the transition and emission probabilities from the probabilities.
partially supervised learning.
A. Training B. Decoding
In training module we use both types of sentences – tagged The decoding module finds the best probable tag sequence of
and untagged. a sentence. We use Viterbi algorithm to calculate the best
probable path (best tag sequence) for a given word sequence
Tagged Data: Five hundred tagged sentences for supervised
(sentence). Instead of considering all possible tags for each
learning.
word in the test data we consider the most possible tags given
Untagged Data: Raw data for re-estimating parameter (50,000
by the Morphological Analyzer. We feed each word to our
words) Morphological Analyzer that outputs all possible part-of-
speech of that word. Considering all possible tags from the
First we describe how we learn using tagged data and then tagset increases the number of paths. But the use of
we will outline the learning process from untagged data. Morphological Analyzer reduces the number of paths as given
in following figure. For example we are considering a sentence
Our algorithm runs on a number of iterations. First we “Ami ekhana chA khete yAba”.
process the tagged data by supervised learning then in each
iteration it processes the untagged data and updates the Khet
transition probabilities i.e. p (tag | previous tag) and emission chA e(NN
ekhana
probabilities i.e. p (word | tag) for the Hidden Markov Model. (NN)
(NN)
Using tagged data each word maps to one state as the correct
part-of-speech is known. But using untagged data each word Ami Khet yAba
will map to all states because part-of-speech tags are not (PP) e(NN (VF)
known i.e. all states we considered possible. In supervised ekhana
learning, we calculate the bi-gram counts of a particular tag (PT) chA
(VIS) Khet
given a previous tag from the tagged corpus. e(NN

We use untagged data (50,000 words) to re-estimate the bi-


gram counts from tag to tag and also re-estimate the unigram Figure 1: Possible tags are taken from Morphological Analyzer
counts of a word given a particular tag. This re-estimation of

3
A word is unknown to the HMM if it has not occurred during Precision Method 1 Method 2 Method 3
the training. However even for an unknown word the 59.93 61.79 84.37
Morphological Analyzer gives all possible tags of the word.
These possible part-of-speech tags are used during training. In In the above data set the precision is much lower. Many errors
fig.1, each word has different possible tags given by are due to incomplete lexicon used in our Morphological
Morphological Analyzer. For example word chA has two Analyzer and also the unavailability a proper noun identifier.
different tags NN and VIS. Using the above restriction on tags Morphological errors are of two types – a particular word is
for each word and the transition and emission probabilities not found in the Morphological Analyzer or Morphological
from a partially supervised model we are finding the best Analyzer does not cover all possible tags of a word. To find
probable path (best tag sequence) for a given word sequence out the actual accuracy of our model we manually entered the
is found out by using the Viterbi tagging algorithm. The best possible part-of-speech for all the words of the test set that are
probable path is calculated by the following formula. not covered by the Morphological Analyzer. We also made a
n list of all possible proper nouns in our test data set. At the time
argmax = ? p(t i | t i-1 ) p(wi | t i) of evaluation we marked all proper nouns from that list. We
i=1 tested the above modification over Method 3 and we got an
This approach offers an overall high accuracy even if a small average percentage of precision 95.18%
set of tagged corpus is used for the purpose.
Method 3
Precision
VI. EXPERIMENT RESULTS 95.18
The system performance is evaluated in two ways. Firstly, the
VII. CONCLUSION AND FUTURE W ORK
system is tested in one Leave One Out Correctness Validation
(LOOCV) method i.e. from N tagged files we use N-1 for This paper presents a model for POS tagging for a relatively
training and 1 file for testing. This is done for each individual free word order language, Bengali. On the basis of our
file from N tagged files. The above technique for evaluation is preliminary experiment the system is found to have an
applied on three approaches to determine the precession. In accuracy of 95%. The system uses a small set of tagged
our POS tagging evaluation we use 20 files each consist of 25 sentences. It also uses an untagged corpus and a
sentences. morphological analyzer. The precision is affected by
incomplete lexicon in Morphological Analyzer and errors in the
untagged corpus. It is expected that system accuracy will
Correctly tagged words by the system
precision = increase by correcting the typographical errors in the untagged
Total no. of words in the evaluation set corpus and also by increasing the accuracy of Morphological
analyzer. Some rule-based component can also be applied to
the model to detect and correct the existing errors. The POS
We have tested three different approaches of POS
tagger is useful for chunking, clause boundary identification
tagging.
and other NLP applications.
Method 1: POS tagging using only supervised learning
Method 2: POS tagging using a partially supervised learning
REFERENCES
and decoding the best tag sequence without using
[1] E. Brill, “A simple Rule-Based Part-of-Speech Tagger”, University
Morphological Analyzer restriction.
of Pennsylvania, 1992.
Method 3: POS tagging using a partially supervised learning [2]A. M. Deroualt and B. Merialdo, “Natural Language modeling for
and decoding the best tag sequence without using phoneme-to-text transposition”, IEEE transactions on Pattern Analysis
Morphological Analyzer restriction. and Machine Intelligence, 1986.
[3] K.W. Church, “A statistical parts program and noun phrase parser
The evaluation results are given in the following table: for unrestricted text”, Proceedings of the second conference on Applied
Natural Language Processing (ACL), 1988.
[4] L. E. Baum, “An inequality and associated maximization technique
Method 1 Method 2 Method 3
Precision in statistical estimation on probabilistic functions of a Markov process”,
64.31 67.6 96.28 Inequalities, 1972.
[5] A. Ratnaparkhi, “A maximum entropy Part -of-speech tagger”,
The above table indicates the high 96.28% accuracy of the Proceedings of the Empirical Methods in NLP conference, University of
Hybrid system. To ensure the correctness of the precision we Pennsylvania, 1996.
tried another approach for evaluating the system. We took 100 [6] D. Cutting, “A practical part -of-speech tagger”, Proceedings of third
conference on Applied Natural Language processing, 1992.
sentences (1003 words) randomly from the CIIL corpus and
[7] J. Kazama, “A maximum entropy tagger with unsupervised Hidden
tagged it manually; the sentences taken from CIIL corpus Markov Model”, NLPRS, 2001
being more complex sentences compare to the sentences used [8] J. Allen, “Natural Language Understanding”, pages {195-203}
in tagged data. The precision is calculated using the above [9] D. Jurafsky and J. H. Martin, “Speech and Language Processing”
formula. pages {287-320}, Pearson Edition.

You might also like