Developing A POS Tagger For Magahi: A Comparative Study: Ritesh KUMAR Bornini LAHIRI Deepak ALOK

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Developing a POS tagger for Magahi: A Comparative Study

Ritesh KUMAR1 Bornini LAHIRI1 Deepak ALOK1


(1) CENTRE FOR LINGUISTICS, Jawharlal Nehru University, In ia

[email protected], [email protected], [email protected] !"STR!CT In this paper, we present a comparative study of the four state-of-the-art sequential taggers applied on Magahi data for part-of-speech (POS) annotation Magahi is one of the smaller Indo-!ryan languages spo"en in #astern state of $ihar in India It is an e%tremely resource-poor language and it is the &rst attempt to develop some "ind of 'atural (anguage Processing ('(P) resource for the language )he four taggers that we test are * Support +ector Machines (S+M) ,ased S+M)ool, -idden Mar"ov Model (-MM) ,ased )n) tagger, Ma%imum #ntropy ,ased M%Post tagger and Memory ,ased M$) tagger !ll these taggers are trained on a miniscule dataset of around ./,/// words using 00 tags from the $IS-tagset for Indian languages and tested on around 10,/// words )he performance of all these taggers are tested against a frequency-,ased ,aseline tagger 2hile all these taggers perform worse than on the #nglish data, the ,est performance is given ,y the Ma%imum #ntropy tagger after tuning of certain parameters )he paper discusses the result of the taggers and the ways in which the performance of the taggers could ,e improved for Magahi #E$%OR&S ' (OS T!GGERS, )!G!*I T!GGER, S+), TNT T!GGER, ),(OST T!GGER, )"T T!GGER

Proceedings of the 10th Workshop on Asian Language Resources, pages 105114, COLING 2012, Mumbai, December 2012.

105

Introduction

*ist-ri.ally, )a/ahi has 0een .lassi1ie in i11erent ways 0y i11erent s.h-lars2 %hile Griers-n (1345) 6uts )a/ahi un er the Eastern /r-u6 -1 Outer su070ran.h -1 In -7!ryan lan/ua/es, -thers li8e Turner have .lu00e the 9"ihari9 lan/ua/es with Eastern an %estern *in i ()asi.a 1331)2 ! .lassi1i.ati-n /iven 0y Chatter:i (13;<) where )a/ahi is 8e6t t-/ether with -ther lan/ua/es -1 Eastern /r-u6 whi.h is se6arate 1r-= the %estern *in i2 Je11ers (13><) /ives a .lassi1i.ati-n whi.h is very si=ilar t- that -1 Griers-n2 In the 6resent ti=e, )a/ahi is s6-8en =ainly in Eastern states -1 In ia in.lu in/ "ihar an Jhar8han , al-n/ with s-=e 6arts -1 %est "en/al an Orissa2 There are three =ain varieties -1 )a/ahi s6-8en t- ay (+er=a, 1331)2 Central )a/ahi -1 (atna, Gaya, *a?ari0a/h (in "ihar) S-uth7Eastern )a/ahi -1 Ran.hi (in Jhar8han ) an s-=e 6arts -1 Orissa Eastern )a/ahi -1 "e/usarai an )un/er (in "ihar) S-=e -ther s.h-lars li8e +er=a (;445) an Griers-n (1345) have als- .lassi1ie S-uth7Eastern an Eastern varieties t-/ether2

1.1

Magahi: Socio-Political Situation

S-.ially )a/ahi is .-nsi ere a iale.t -1 *in i even th-u/h hist-ri.ally as well as lin/uisti.ally )a/ahi istin.t en-u/h 1r-= *in i t- 0e .alle a istin.t lan/ua/e2 This s-.ial attitu e t-war s )a/ahi where it is .-nsi ere a iale.t (an in1eri-r@ ist-rte 1-r=) -1 *in i has e=anate lar/ely 1r-= the 6-liti.al re6resentati-n -1 the lan/ua/e as a iale.t (-r, )-ther T-n/ue in the Census) -1 *in i as well as the .l-se leAi.al a11inity -1 the tw- lan/ua/es (#u=ar, et2al2, ;411)2 !s a result -1 this s-.i-76-liti.al attitu e, )a/ahi has re=aine a lar/ely i/n-re lan/ua/e -utsi e lin/uisti. stu ies es6ite the 6resen.e -1 Buite a lar/e 6-6ulati-n (.-untin/ u6 t1523>C,D<D a..-r in/ t- Census -1 In ia, ;441) -1 )a/ahi s6ea8ers2

1.2

Linguistic Features of Magahi

There has 0een very 1ew lin/uisti. stu ies -n )a/ahi2 *-wever a 0asi. (alth-u/h n-t .-=6letely a..urate) es.ri6ti-n -1 )a/ahi is /iven 0y +er=a (;445)2 ! 0asi. es.ri6ti-n -1 the lin/uisti. 1eatures is /iven here2 !n initial analysis -1 the )a/ahi s-un syste= sh-ws that it has 5D 6h-ne=i. s-un s E ;> .-ns-nants an C v-wels2 S-=e -1 the =a:-r 6h-nl-/i.al 1eatures whi.h istin/uish )a/ahi 1r-= *in i in.lu e a0sen.e -1 w-r 7initial .-ns-nant .luster, a0sen.e -1 w-r 7initial /li es an a0sen.e -1 w-r 7=e ial an w-r 71inal ental laterals2 )-r6h-l-/i.ally it is a n-=inative7a..usative, in1le.te lan/ua/e with al=-st 1ree w-r -r er -1 .-nstituents within 6hrases an senten.es2 "-th n-uns an a :e.tives have tw- 0asi. 1-r=s2 %hile -ne 1-r= is the 0asi. 1-r= (as in or 9h-rse9, son 9/-l 9, r FwhiteG, et.), the -ther -ne is the erive 1-r= (as in or-b, son-m, r-k)2 The a11iAes use in the erive 1-r=s are the a11iAal 6arti.les whi.h are use 1-r i11erent lin/uisti. 1un.ti-n li8e s6e.i1i.ity, e1initeness et. (!l-8, ;414, ;41;)2

106

Unli8e *in i (whi.h is a N-un Class lan/ua/e with tw- .lasses, als- eBuate with )as.uline an Fe=inine /en er in the lan/ua/e), )a/ahi is a .lassi1ier lan/ua/e2 It has three =ensural .lassi1iers E o/to, mni, sun (Alok, 2012)2 These .lassi1iers en.- e the in1-r=ati-n a0-ut h-w the re1erent is =easure an are i11erent 1r-= the -ther /enerally 1-un .lassi1iers whi.h .hara.terise n-un in ter=s -1 .ertain inherent 6r-6erties (!i8henval , ;444, ;44<)2 !=-n/ these while o/to measures nouns in terms of lengt or !is"rete #uantit$, mni an! sun are use! for measuring nouns in terms of amount (an! so are use! %it t e mass nouns) (Alok, 2012)& 't is to be note! t at broa!l$ t ese are numeral "lassi(ers sin"e t e$ are al%a$s atta" e! %it t e numeral an! #uanti(ers in a noun ) rase an! ne*er %it t e noun itself& + e )resen"e of "lassi(ers "oul! )ro*e to be a *er$ strong in!i"ator for t e ,art-of-s)ee" annotation of #uanti(ers an! -oun in .aga i& /$nta"ti"all$, .aga i nouns !o not a*e number an! gen!er agreement %it *erbs& + ere are onl$ a fe% nouns in .aga i % i" "oul! be in0e"te! for number& .oreo*er a!1e"ti*es also agree %it su" nouns in terms of number as %ell as se2 (it s oul! be note! ere t at se2 ere refers to t e natural se2 of t e noun in "ase of animates an! not t e -oun "lass as it is use! for 3in!i sin"e su" agreements "oul! o""ur onl$ %it t e animates for % i" males an! females are !istin"tl$ re"ognise! in t e language)& 4erbs also agree %it sub1e"ts in )erson an! onori("it$& 't is to be note! t at *erbs "oul! also agree %it ob1e"t as %ell as a!!ressee onori("it$ of t e ob1e"t or t e a!!ressee are onori("&

1.3

Part-of-speech Annotation and Magahi

(art7-17s6ee.h ann-tati-n is /enerally .-nsi ere the =-st 0asi. ste6 1-r evel-6in/ any 8in -1 NL( a66li.ati-n2 In the re.ent ti=es several statisti.al an =a.hine7learnin/ 0ase a66r-a.hes have 0een a66lie t- the tas8 -1 (OS ann-tati-n2 S-=e -1 the =a:-r an =-st su..ess1ul ta//ers in.lu e 7 *i en )ar8-v )- els ("rants, ;444), )aAi=u= Entr-6y ta//ers (Ratna6ar8hi, 133<), Trans1-r=ati-nE0ase learnin/ ("rill, 133H, 133D), )e=-ryE0ase learnin/ (&aele=ans, et2 al2, ;445), Su66-rt +e.t-r )a.hines (C-rtes I +a6ni8, ;444) 0esi es several -thers2 !ll these ta//ers are traine an evaluate -n the %SJ .-r6us in En/lish2 On this .-r6us all -1 these have a very .-=6ara0le a..ura.y with ea.h /ivin/ -nly sli/htly i11erent a..ura.y 1r-= the -thers2 In this 6a6er, we have a66lie )a/ahi ata t- 1-ur -1 these (OS ta/ers, vi?, *)) ta//er, )aAEnt Ta//er, )e=-ry70ase Ta//er an S+), 1-r the 6ur6-se -1 evel-6in/ a (OS ta//er 1-r )a/ahi2 The i ea is t- test whi.h -1 these /ive the 0est 6er1-r=an.e -n the /iven ataset with their e1ault settin/s2 The 6er1-r=an.e is .-=6are a/ainst a )aAi=u=71reBuen.y 0aseline ta//er2

2 2.1

E peri!ental Setup "ataset

%e have use ar-un D4,444 =anually (OS7ta//e ata 1-r trainin/ ea.h -1 the ta//er an they are teste -n ar-un 11,444 w-r s2 The .-r6us .-nsists -1 ata ta8en 1r-= a .-lle.ti-n -1 )a/ahi 1-l8tales2 Sin.e )a/ahi is lar/ely a s6-8en lan/ua/e an there is very s.ant availa0ility -1 written =aterial, the .-lle.ti-n -1 1-l8tales was the =-st rea ily availa0le as well as stan ar ise written ata availa0le2

107

2.2

#agset

F-r the ann-tati-n -1 )a/ahi ata, we have use a =- i1ie versi-n -1 "IS stan ar ta/set 1-r In ian Lan/ua/es2 The .-=6lete ta/set, whi.h .-nsists -1 55 ta/s, is /iven in Ta0le 1 (the .-r6us ta//e with this ta/set is sa=e as es.ri0e in #u=ar, et al2 (;411) 0ut the ta/set is sli/htly =- i1ie )2

Sl. $o T-6 level 1 121 $oun

%ategor&

La'el

Annotation %on(ention

52am)les (in ',A)

Su0ty6e (level 1) $ C-==-n NN $ NJJNN "678 (bo$) ""ri7 (a small bri!ge-like st&) l9te (nake!)

12; 125 2 ;21 ;2; ;25 ;2H ;2D ;2< Pronoun

(r-6er Nl-.

NN( NST P)

NJJNN( NJJNST P) (RJJ(R( (RJJ(RF (RJJ(RL (RJJ(RC (RJJ(RK (RJJ(RI

)ul*a a8i7, )i"a8i7

(ers-nal Re1leAive Relative Re.i6r-.al %h7w-r In e1inite

(R( (RF (RL (RC (RK (RI

m, mni7 )ne e, ekr )ne k, ke koi, kekr

108

3 521 52; 525 52H * H21

"e!onstrati(e &ei.ti. Relative %h7w-r In e1inite +er' )ain

"M &)& &)R &)K &)I + +)

"M &)JJ&)& &)JJ&)R &)JJ&)K &)JJ&)I + +JJ+) ' : a:, : a: e, un kekr, kun i, lukn (to see) ); :"n (to %as "lot es) urn (to get entangle!)

H2; , Ad-ecti(e

!uAiliary

+!U, ..

+JJ+!U, ..

i, li7, t <i7 "kit (s ort an! %ell-built) bt <)ros (uselessl$ talkati*e) "bk (%it s)las ) "br-"br (a manner of eating)

Ad(er'

)0

)0

1 2 C21 C2; 3 321

Postposition %on-unction C-7-r inat-r Su0-r inat-r Particles &e1ault

PSP %% CC& CCS )P R(&

PSP %% CCJJCC& CCJJCCS )P R(JJR(&

ke, me, )r, ore

, bki7, bluk k eki, t <, ki

t <, bi7

109

32; 325 32H 32D 14 1421 142; 1425 11 1121 112; )esiduals 5uantifiers

Classi1ier Inter:e.ti-n Intensi1ier Ne/ati-n

CL INJ INTF NEG 5#

R(JJCL R(JJINJ R(JJINTF R(JJNEG 5# KTJJKTF KTJJKTC KTJJKTO )" R&JJR&F R&JJS$)

o, to re, e, "i7, b)re t t , tu tu , bk-bk n, mt <, bin ek, ) il, ku" tni7sun, !ermni7 ek, !u, ir ) il, !sr

General Car inals Or inals

KTF KTC KTO )"

F-rei/n w-r Sy=0-l

R&F S$)

! w-r in 1-rei/n s.ri6t2 F-r sy=0-ls su.h as L, I et. Only 1-r 6un.tuati-ns

1125 112H 112D

(un.tuati-n Un8n-wn E.h-w-r s

(UNC UN# EC*

R&JJ(UNC R&JJUN# R&JJEC*

()ni7-) uni7 (kn-) un

T!"LE 1' )a/ahi Ta/set

2.3

#agger #ools
)A(-st 1-r )aAi=u=7entr-6y ta//er (Ratna6ar8hi, 133<)2 It uses the .-nteAtual 1eatures li8e 6re.ee in/ w-r s an ta/s as well as =-r6h-l-/i.al 1eature -1 the w-r s li8e the su11iAes an 6re1iAes 1-r ta//in/ any /iven w-r 2 )"T 1-r )e=-ry70ase ta//er (&aele=ans, et2 al2, ;445)2 It .reates tw- se6arate ta//ers a1ter the trainin/2 One is use eA.lusively 1-r 8n-wn w-r s an it uses the

%e have use the 1-ll-win/ t--ls t- train i11erent ta//ers -n )a/ahi ata 7

110

.-nteAtual 1eatures an the -ther is use eA.lusively 1-r un8n-wn w-r s whi.h als- uses leAi.al in1-r=ati-n al-n/with the .-nteAtual 1eatures2 These 1eatures are .ust-=isa0le as 6er the nee -1 the users TnT 1-r *))70ase ta//er ("rants, 133H)2 !s the na=e -1 the ta//er itsel1 su//est (Tri/ra=s 9n9 Ta/s), it uses tri/ra= an the ta/s -1 the 6re.e in/ w-r s as 1eatures 1-r trainin/ S+)T--l 1-r S+)70ase ta//er (Gi=ene? an )arBue?, ;44H)2 This t--l 6r-vi es an inter1a.e 1-r usin/ S+)7Li/ht (J-a.hi=s, 1333)2 The 1eatures that .-ul 0e use with this t--l is si=ilar t- the -ther t--ls, vi?2, =-r6h-l-/i.al 1eatures -1 the w-r an that -1 the .-nteAt as well as the ta/s -1 the 6re.e in/ w-r s2 In /eneral we have use these t--ls with the [email protected]==en e settin/s 1-r .arryin/ -ut the eA6eri=ents2 Ea.h -1 these t--ls were traine -n eAa.tly the sa=e .-r6us .-nsistin/ -1 ar-un D4,444 t-8ens2

)esults and Anal&sis

The results -0taine 1r-= the 1-ur t--ls are su==arise in Ta0le ; 0el-w 6no7n 8ords 92/ :; <n=no7n 8ords 91*:; #n# M0# M Post S+M#ool 0aseline C32>DM C321DM N! C12C3M N! <>2D>M >;23>M N! 1C211M N! >(erall C<243M C<2;;M 23./1: *1.*/: >121CM

T!"LE ;' C-=6aris-n -1 the ta//ers !s =enti-ne a0-ve ea.h ta//er was als- teste -n eAa.tly the sa=e ataset whi.h .-nsiste -1 ar-un 15,444 t-8ens2 Out -1 these 15,444 t-8ens ar-un C<M were 8n-wn t-8ens (i2e2 they were 6resent in the trainin/ set als-) while 1HM were un8n-wn t-8en (they were en.-untere 0y the ta//er 1-r the 1irst ti=e in the test set itsel1)2 !s it is sh-wn in the ta0le, )A(-st /ives the 0est -verall 6er1-r=an.eN h-wever sin.e the evaluati-n results are n-t .al.ulate se6arately 1-r 8n-wn an un8n-wn w-r s the 0rea87u6 is n-t 8n-wn2 )"T an TnT /ives .-=6ara0le -verall results 0ut )"T is si/ni1i.antly =-re a..urate with un8n-wn w-r s2 The =-st is=al 6er1-r=an.e is /iven 0y S+)T--l whi.h .-ul 0e eA6laine -nly 0y an eAtre=ely s=all ataset an 6resen.e -1 a lar/e nu=0er -1 .lasses whi.h nee s t- 0e .lassi1ie 2 !n err-r analysis -1 the ata sh-ws that the =a:-r s-ur.e -1 err-r in the ann-tati-n is the serial ver0 .-nstru.ti-ns in all the three ta//ers2 %hile )A(-st 6er1-r=s sli/htly 0etter, in /eneral, ete.ti-n -1 se.-n ver0 (i1 it is a .-=6-un ver0) -r the n-un -1 the se.-n ver0 .-=6leA (i1 it

111

is a .-n:un.t ver0) 6r-ve t- 0e very 6r-0le=ati.2 !n-ther s-ur.e -1 err-r was a .-=6lete a0sen.e -1 eAa=6les -1 .ertain .l-se 7.lass .ate/-ries in the trainin/ set vi?2, inter:e.ti-ns2 "esi es these tw-, as eA6e.te , leAi.al a=0i/uity was als- -ne -1 the =in-r s-ur.es -1 ann-tati-n err-r2

%onclusion and 8a& Ahead


In this 6a6er we have 6resente a .-=6aris-n -1 the 1-ur state7-17the7art (OS ta//ers with res6e.t t- their 6er1-r=an.e -n the )a/ahi ata2 The 0est -verall a..ura.y as well as a..ura.y -n the 8n-wn an un8n-wn w-r s in ivi ually is /iven 0y the =aAi=u=7entr-6y 0ase )A(-st ta//er2 *-wever the a..ura.y (:ust 0el-w 34M) is =u.h 0el-w the /eneral eA6e.te a..ura.y -1 (OS ta//ers2 !s the err-r analysis sh-ws all the ta//ers 6er1-r= 6--rly -n very si=ilar 8in s -1 w-r s2 S- .-=0inin/ i11erent ta//ers .-ul n-t s-lve the 6r-0le=2 The tw- ste6s whi.h .-ul 0e ta8en t- in.rease the a..ura.y -1 the ta//er in.lu e ! list -1 .l-se 7.lass w-r s will 0e 6re6are whi.h w-ul 0e a0le t- han le the .ases where the a0sen.e -1 the w-r in trainin/ set has le t- the err-r 0y the ta//er2 S-=e eA6li.it isa=0i/uati-n rules will als- 0e use in .ertain .ases where su11i.iently lar/e nu=0er -1 eAa=6les is n-t 6resent in the trainin/ .-r6us s- as t- is.ri=inate in 0etween the .-nteAts -1 -..urren.e -1 a 6arti.ular ta/ -1 the w-r 2 ! thir 6-ssi0le ste6 .-ul 0e t- in.rease the trainin/ set si?e (whi.h is in any .ase 6retty s=all 0y the /eneral stan ar s)2 *-wever this is very res-ur.e7intensive 0e.ause -1 the la.8 -1 easily availa0le ata in the lan/ua/e2 )-re-ver as 6er -ur .urrent analysis a hy0ri syste= li8e this is eA6e.te t- /ive a 6er1-r=an.e at 6ar with =-st -1 the -ther state7-17the7art (OS ta//ers 1-r In ian lan/ua/es2

Ac=no7ledg!ents
%e w-ul li8e t- than8 -ur su6ervis-rs 1-r their .-nstant su66-rt an /ui an.e2

)eferences
!l-8, &2 (;414)2 )a/ahi N-un (arti.les. (a6er 6resente in 4th International Students' on!eren"e o! Lin#uisti"s in India $S O%LI&4', )u=0ai, In ia, Fe0ruary ;47;;, ;4142 !l-8, &2 (;41;)2 A lan#ua#e (ithout Arti"les) *he &issertati-n, Jawaharlal Nehru University, New &elhi ase o! Ma#ahi2 Un6u0lishe )2(hil2

"rants , T22 (;444)2 TnT O ! Statisti.al (art7-17S6ee.h Ta//er2 In +ro"eedin#s o! the ,th Applied %L+ on!eren"e$A%L+&-...', 6a/es ;;HE;512 "rill, E2 (133H)2 S-=e ! van.es in Trans1-r=ati-n7"ase +ro"eedin#s o! AAAI, +-l2 1, 6a/es >;;E>;>2 (art -1 S6ee.h Ta//in/2

"rill, E2 (133D)2 Trans1-r=ati-n7"ase Err-r7&riven Learnin/ an Natural Lan/ua/e (r-.essin/' ! Case Stu y in (art7-17S6ee.h Ta//in/2 o/putational Lin#uisti"s, ;1(H)' DH5ED<D2 C-rtes , C2 an +a6ni8, +2 (133D)2 Su66-rt +e.t-r Netw-r8s2 Ma"hine Learnin#, ;4' ;>5E;3>2 &aele=ans, %2, Pavrel, J2, van en "-s.h, !2, van er Sl--t, #2 (;445)2 )"T' )e=-ry "ase Ta//er, versi-n ;24, Re!eren"e 0uide. ILK Resear"h 0roup *e"hni"al Report Series .1&11 , Til0ur/2

112

Jesus Gi=ene? an Lluis )arBue?2 (;44H)2 S+)T--l' ! /eneral (OS ta//er /enerat-r 0ase -n su66-rt ve.t-r =a.hines2 In 4th International on!eren"e on Lan#ua#e Resour"es and 23aluation, 6a/es 1<CE1><, Lis0-n, (-rtu/al2 J-a.hi=s, T2 (1333)2 )a8in/ Lar/e7s.ale S+) Learnin/ (ra.ti.al2 In S.hQl8-61, "2, "ur/es, C2, S=-la, !2, e s2' Ad3an"es in Kernel Methods 4 Support 5e"tor Learnin#, 6a/es H1ED<2 )IT (ress, "-st-n, )!, US! 2 #u=ar, R2, Lahiri, "2 an !l-8, &2 (;411)2 Challen/es in &evel-6in/ LRs 1-r N-n7S.he ule Lan/ua/es' ! Case -1 )a/ahi2 In +ro"eedin#s o! the 6th Lan#ua#e and *e"hnolo#7 on!eren"e Hu/an Lan#ua#e *e"hnolo#ies as a hallen#e !or o/puter S"ien"e and Lin#uisti"s $L* '11', ! a= )i.8iewi.? University, 6a/es <47<H2 Ratna6ar8hi, !2 (133<)2 ! =aAi=u= entr-6y =- el 1-r 6art7-17s6ee.h ta//in/2 In +ro"eedin#s o! the on!eren"e on 2/piri"al Methods in %atural Lan#ua#e +ro"essin# , University -1 (ennsylvania, 6a/es 155E1H;2

113

You might also like