DJBJB

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Sāmayik: A Benchmark and Dataset for English-Sanskrit Translation

Ayush Maheshwari1 , Ashim Gupta2 , Amrith Krishna3 , Ganesh Ramakrishnan1 ,


G. Anil Kumar4 , Jitin Singla5
1
Indian Institute of Technology Bombay
2
University of Utah
3
Uniphore Inc.
4
Dept. of Physics, Indian Institute of Technology Roorkee
4
Dept. of Biosciences and Bioengineering, Indian Institute of Technology Roorkee

Abstract from two epics written in the classical era with


their translations sometime in the early half of
Sanskrit is a low-resource language with a the twentieth century. Similarly, the Digital Cor-
rich heritage. Digitized Sanskrit corpora re-
arXiv:2305.14004v1 [cs.CL] 23 May 2023

pus of Sanskrit (DCS) currently forms the largest


flective of the contemporary usage of San-
skrit, specifically that too in prose, is heav-
monolingual dataset in Sanskrit (Hellwig, 2010–
ily under-represented at present. Presently, no 2021). DCS contains more than 600,000 monolin-
such English-Sanskrit parallel dataset is pub- gual sentences in Sanskrit, spanning a chronology
licly available. We release a dataset, Sāmayik of around 2000 years categorized into pre-classical
of more than 42,000 parallel English-Sanskrit literature (1500 BCE - 100 BCE), classical liter-
sentences, from four different corpora that aim ature (300 CE - 800 CE), and modern literature
to bridge this gap. Moreover, we also re- (900 CE to now; Krishna et al., 2018). However,
lease benchmarks adapted from existing multi-
currently, available digitized content in modern lit-
lingual pretrained models for Sanskrit-English
translation. We include training splits from our erature also is mostly confined to that written until
contemporary dataset and the Sanskrit-English the first half of the twentieth century.
parallel sentences from the training split of Identifying a gap in Sanskrit sentences rep-
Itihāsa, a previously released classical era ma- resenting contemporary Sanskrit, mostly focus-
chine translation dataset containing Sanskrit. ing on content written in the second half of the
twentieth century to now, we release our dataset
1 Introduction Sāmayik1 . The dataset is a Sanskrit term that
Sanskrit is a classical language, with a rich her- translates to the “sayings of the contemporary
itage spanning more than three millennia. More- world". Sāmayik consists of 43,000 parallel sen-
over, it is a language in sustenance with more tence pairs, collected from four different sources.
than two million active speakers (McCartney, These are spoken content that covers contempo-
2019; Chandramouli, 2011). While Sanskrit is a rary world affairs, interpretation of literary works,
heritage-rich language, it still is considered a low- pedagogical content, etc. In Table 1, we provide
resource language (Hellwig, 2010–2021; Mahesh- statistics about each of the individual corpus and
wari et al., 2022). Moreover, the available corpora that of our complete dataset. We describe each
often cover content that is vastly divergent in terms corpus in Section 2. The oldest corpus in our col-
of their domains, chronology, stylistic features, us- lection is the English-Sanskrit Bible, where the
age, syntactic features (Hellwig, 2009), and even Sanskrit translation was performed in 1851 and it
in their typological features such as word-order forms less than 20 % of the overall dataset. The
(Krishna et al., 2021; Tubb and Boose, 2007). In Sanskrit component in the rest of the corpora is ei-
this work, we release a parallel Sanskrit-English ther created in the latter half of the twentieth cen-
dataset that covers multiple corpora representing tury or in the current century. The latest corpus
contemporary Sanskrit. in our collection contain content as latest as 2022,
from Sanskrit and English transcriptions of ‘Mann
‘Itihāsa’ currently forms the largest parallel ma-
Ki Baat’ a podcast currently in production.
chine translation corpus containing Sanskrit as one
Sanskrit is a morphologically rich language and
of the languages (Aralikatte et al., 2021). It is a
is lexically productive. Moreover, sentence con-
Sanskrit-English dataset, containing 93,000 pairs
structions in Sanskrit follow relatively free word
of verses in Sanskrit and their corresponding trans-
1
lation in English. These sentences were collected The data will be released after publication.
order. Here, sentences written in verse form Bible - The New Testament: We release the
have to adhere to prescribed meter patterns as per new testament of the Bible aligned with its cor-
prosody. Hence, word order need not adhere to responding English version. We use the Sanskrit
a fixed word-order pattern. However, sentences version released by Calcutta Baptist Missionar-
written in prose tend to form Subject-Object-Verb ies, originally published in 18512 . The new tes-
(SOV) ordering. While Ithihāsa consists of two tament essentially contains 7,840 sentences from
epics written in poetry form, DCS consists of most 260 chapters. Each verse is generally indexed
of its content in poetry. On the contrary, our cor- by the Book-name, chapter name followed by the
pus focuses on sentences written in prose form. verse number. For the English version of the
In addition to our dataset, we release bench- Bible, we rely on Christodouloupoulos and Steed-
marks by adapting pre-trained models for neu- man (2015) where the English sentences also fol-
ral machine translation in Sanskrit-English. Cur- low the same indexing form. Given the one-to-
rently, there exist no pre-trained models that in- one correspondences at the sentence level for both
clude Sanskrit in their benchmarks. Hence, we English and Sanskrit sentences, the mapping was
adapt three pre-trained multilingual seq2seq mod- straightforward. We finally obtain a total of 7840
els for the task, namely ByT5 (Xue et al., 2022), parallel sentences. Further, three fluent speak-
mBART (Liu et al., 2020), and IndicBART (Dabre ers of both English and Sanskrit have verified the
et al., 2022). IndicBART is a pre-trained model alignments for a sample of 100 sentences.
fine-tuned specifically for several Indic languages
and English. Further, all the Indic languages are Mann ki Baat (MKB)3 - MKB is a monthly In-
transliterated into Devanagari script, widely used dian radio program hosted by the Prime Minis-
for Sanskrit as well. Similarly, ByT5 is a token- ter of India originally in the Hindi language from
free model which tokenizes inputs at the Unicode 2014-2022. Each episode is an address to the
byte level and the Devanagari script for Sanskrit is nation discussing social, cultural, and contempo-
part of the Unicode specifications. rary topics including conversation with individu-
With both MBART and IndicBART, we observe als. The official translations of the transcripts are
negligible OOV vocabulary subword tokens, and present in several Indian languages except San-
we observe that IndicBART currently reports the skrit. However, unofficial Sanskrit translation by
best BLEU score on our dataset with a BLEU experts are available in public domain4 . We use
score of 27.25. Further, we include the Ithihāsa’s these expert translations and manually align San-
training split in our training data for comprehen- skrit sentences with official English transcripts
siveness. Additionally, we explore the utility of from the 25 episodes. Additionally, these Sanskrit
Hindi as a bridge language in training NMT mod- translations are further verified by in-house lan-
els for English-Sanskrit. Here, we utilize a sub- guage experts. The MKB corpus consists of 4061
set of the parallel sentences between all three lan- sentences with a total of 47,838 words.
guages, an auxiliary set of Sanskrit-Hindi pairs,
and additional publicly available Hindi-Sanskrit Gītā Sopānaṁ - We extract sentences from the
pairs. Sanskrit learning book, Gītā Sopānaṁ published
in 2009. We ask language experts, well versed
2 Sāmayik in both English and Sanskrit, to translate these
sentences into English. Gītā Sopānaṁ is a self-
Sāmayik is an English-Sanskrit machine transla- learning book to learn Sanskrit through stories. It
tion dataset, consisting of 42,961 sentences from often contains simple and small sentences with a
four different corpora. The primary aim of the focus on learning grammar instead of expanding
dataset is to include translation pairs containing vocabulary. Therefore, the dataset contains 6465
Sanskrit prose written in the modern era. Here, unique words for a total of 6130 sentences.
we give a brief description of each of the datasets
2
involved and the steps involved in processing https://www.bible.com/bible/2104/MAT.1.
SAN-DN
these sentences. All the sentences in Sanskrit 3
https://pmonradio.nic.in/
are aligned with their corresponding parallel sen- 4
https://sanskritdocuments.org/sites/
tence(s) in English. manogatam/
Dataset Mann Ki Baat Spoken Tutorials GitaSopanam Bible Total
#sentences 4061 24930 6130 7840 42961
#words 47838 245666 26581 102526 422770
#unique words 19761 38349 6465 37193 95838

Table 1: Statistics for different corpus in the Sāmayik.

Spoken Tutorials5 - Spoken Tutorial project is Model BLEU ChrF


a large corpus of video tutorials for training stu- mBART 19.4 33.2
dents to use open source software. These tutori- ByT5 18.8 29.5
als are created by domain experts and translated IndicBART 27.3 45.7
into several languages by language experts. We
scraped6 videos and transcripts from their website Table 2: BLEU and ChrF scores for the test set on
English-Sanskrit translation for different pre-trained
for which both English and the corresponding San-
models.
skrit translations are available. We extracted tran-
scripts of 254 videos where each video is of av-
erage 10 minutes in duration. We deploy experts 3.2 Metrics
having knowledge of both English and Sanskrit to We evaluate these models on both BLEU and
manually align transcripts from each video. The ChrF. BLEU is a word-level n-gram precision-
final corpus contains 24,930 sentences comprising based metric whereas ChrF is a character-level
245,666 words. ngram F-score. Here, given that Sanskrit is a mor-
phologically rich language with more than 1,400
3 Preliminary Experiments possible inflected forms (Krishna et al., 2021),
we believe ChrF can be indicative of capturing
3.1 Systems morpho-syntactic aspects.
mBART (Liu et al., 2020): is a multilingual
pretrained seq2seq model trained using similar ob- 3.3 Results
jective as employed in BART (Lewis et al., 2020). We split our dataset comprising of 42k sentences
We employ mBART-50, trained on a large mul- into 80% for the train set and rest for the evalu-
tilingual corpora of 50 languages. In our experi- ation set. The evaluation set is equally split into
ments, we observe use of SLP1 encoding for San- development and test set. We performed prelim-
skrit leads to the best results. inary experiments using pre-trained BART mod-
els mBART, ByT5 and IndicBart. We fine-tune
IndicBART (Dabre et al., 2022) is also a mul- over these pre-trained models for English-Sanskrit
tilingual pretrained seq2seq model following the translation.
pretraining objective of BART. However, here the
corpora used are specifically from Indic languages Implementation Details All models are trained
and ENglish. While different Indic languages use using HuggingFace Transformers (Wolf et al.,
different scripts, these are losslessly converted to 2020). Both source and target sequences are trun-
Devanagari before tokenisation during its pretrain- cated at 512 token length. We use respective
ing. Hence, we use the Devanagari script for en- model pre-trained tokenizers on our dataset. We
coding Sanskrit, and English uses its roman script. use batch size of 128, and use standard cross
entropy loss with label smoothing of 0.1 and
ByT5 (Xue et al., 2022) is a token free pre- AdamW optimizer(Loshchilov and Hutter). The
trained seq2seq model following pretraining ob- model is trained for a maximum of 30 epochs with
jective as that of T5, or more specifically mT5. a learning rate of 1e-3 and weight decay of 1e-4.
However, here it is a token free model that uses a In order to accommodate bigger models (byT5 and
fixed 256 byte values in Unicode as its vocabulary. mBART) into memory, we introduce gradient ac-
cumulation and increasing the number of epochs
5
https://spoken-tutorial.org/ to maintain effective batch size and optimization
6
The website content is licensed under CC4.0 license. steps.
In Table 2, we present test scores on Sanskrit- 2020. BART: Denoising sequence-to-sequence pre-
English translation. We observe that IndicBART training for natural language generation, translation,
and comprehension. In Proceedings of the 58th An-
achieves far better BLEU and ChrF scores than
nual Meeting of the Association for Computational
mBART and ByT5. This suggests that models Linguistics, pages 7871–7880, Online. Association
trained on Indic language data demonstrates more for Computational Linguistics.
closeness to the new Indic language from the same
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
language family than models trained on mixed lan- Edunov, Marjan Ghazvininejad, Mike Lewis, and
guage family. Luke Zettlemoyer. 2020. Multilingual denoising
pre-training for neural machine translation. Trans-
actions of the Association for Computational Lin-
References guistics, 8:726–742.
Rahul Aralikatte, Miryam de Lhoneux, Anoop Ilya Loshchilov and Frank Hutter. Decoupled weight
Kunchukuttan, and Anders Søgaard. 2021. Itihasa: decay regularization. In International Conference
A large-scale corpus for Sanskrit to English trans- on Learning Representations.
lation. In Proceedings of the 8th Workshop on
Asian Translation (WAT2021), pages 191–197, On- Ayush Maheshwari, Nikhil Singh, Amrith Krishna, and
line. Association for Computational Linguistics. Ganesh Ramakrishnan. 2022. A benchmark and
dataset for post-OCR text correction in Sanskrit.
C Chandramouli. 2011. Census of india 2011. Provi- In Findings of the Association for Computational
sional Population Totals. New Delhi: Government of Linguistics: EMNLP 2022, pages 6258–6265, Abu
India. Dhabi, United Arab Emirates. Association for Com-
putational Linguistics.
Christos Christodouloupoulos and Mark Steedman.
2015. A massively parallel corpus: the bible in Patrick McCartney. 2019. Sustainably–speaking yoga:
100 languages. Language resources and evaluation, Comparing sanskrit in the 2001 and 2011 indian cen-
49:375–395. suses. In The GLOCAL in Asia 2019. The GLOCAL
Unit, SOAS University of London.
Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan,
Ratish Puduppully, Mitesh Khapra, and Pratyush Gary A Tubb and Emery R Boose. 2007. Scholastic
Kumar. 2022. IndicBART: A pre-trained model for Sanskrit. American Institute of Buddhist Studies.
indic natural language generation. In Findings of
the Association for Computational Linguistics: ACL Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
2022, pages 1849–1863, Dublin, Ireland. Associa- Chaumond, Clement Delangue, Anthony Moi, Pier-
tion for Computational Linguistics. ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
icz, et al. 2020. Transformers: State-of-the-art nat-
Oliver Hellwig. 2009. Extracting dependency trees ural language processing. In Proceedings of the
from sanskrit texts. In Sanskrit Computational Lin- 2020 conference on empirical methods in natural
guistics: Third International Symposium, Hyder- language processing: system demonstrations, pages
abad, India, January 15-17, 2009. Proceedings, 38–45.
pages 106–115. Springer.
Linting Xue, Aditya Barua, Noah Constant, Rami Al-
Oliver Hellwig. 2010–2021. Dcs - the digital corpus of Rfou, Sharan Narang, Mihir Kale, Adam Roberts,
sanskrit. and Colin Raffel. 2022. ByT5: Towards a token-free
future with pre-trained byte-to-byte models. Trans-
Amrith Krishna, Bishal Santra, Sasi Prasanth Bandaru, actions of the Association for Computational Lin-
Gaurav Sahu, Vishnu Dutt Sharma, Pavankumar Sat- guistics, 10:291–306.
uluri, and Pawan Goyal. 2018. Free as in free word
order: An energy based model for word segmenta-
tion and morphological tagging in Sanskrit. In Pro-
ceedings of the 2018 Conference on Empirical Meth-
ods in Natural Language Processing, pages 2550–
2561, Brussels, Belgium. Association for Computa-
tional Linguistics.
Amrith Krishna, Bishal Santra, Ashim Gupta, Pa-
vankumar Satuluri, and Pawan Goyal. 2021. A
Graph-Based Framework for Structured Prediction
Tasks in Sanskrit. Computational Linguistics,
46(4):785–845.
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.

You might also like