0% found this document useful (0 votes)

46 views105 pages

Natural Language Processing For Indian Language: NASSCOM 2019

A Language relatedness perspective

Uploaded by

Anoop Kunchukuttan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views105 pages

Natural Language Processing For Indian Language: NASSCOM 2019

A Language relatedness perspective

Uploaded by

Anoop Kunchukuttan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 105

Natural Language Processing

for Indian Languages

A Language Relatedness Perspective

Anoop Kunchukuttan
Microsoft AI & Research
Machine Translation & Speech Group, Hyderabad

[email protected]

AI Deep Dive Workshop at NASSCOM DSAI-CoE, 30th August 2019

Outline

• Motivation

• Relatedness between Indian Languages

• Utilizing Relatedness between Indian Languages

• IndicNLP Library

• Datasets, Services and Standards

• Summary
Diversity of Indian Languages

Highly multilingual country

Greenberg Diversity Index 0.9

• 4 major language families Source: Quora

• 1600 dialects
• 22 scheduled languages
• 125 million English speakers
• 8 languages in the world’s top 20 languages
• 11 languages with more than 25 million speakers
• 30 languages with more than 1 million speakers
Sources: Wikipedia, Census of India 2011
Indian Languages on the Internet

Language Internet users 2021 projected (in million)

Internet User Base in India (in million)
Source: Indian Languages: Defining India’s Internet KPMG-Google Report 2017
Major Applications requiring Indian language support
Challenges on language adoption on the Internet

How do we improve support for Indian languages?

Improving Indian Language Support
Applications and websites which support rich experiences:

Search Recommendation Translation

Information
Question &
Extraction &
Answering
Categorization Transliteration

Entity
Entity Linking
Identification
Machine Learning is the dominant NLP Paradigm
Training Pipeline An ML Pipeline Test Pipeline
for a Text
Text Instance Class Classification Text Instance Class

Feature vector Feature vector

Decision Function
sign(f(x))
Train

Classifier ?
Training set Positive Negative
f(x) → Model
Scalability Challenges in ML solutions

• NLP requires human expertise → difficult and expensive to replicate for

every language
• Annotated data

• Linguistic knowledge inputs

• Expense cannot be justified for all languages

• Difficult to deploy and maintain systems for multiple languages

Let’s look at examples of different kinds of annotations …
Monolingual Corpora – easy to collect

Digital Content available varies by languages

Sentiment Analysis - Simple Annotation

Positive

? Negative

Neutral

An example of a text classification problem

Named Entity Annotation –
More time consuming, but does not require a lot of expertise
Parallel Corpora – large requirement, needs good language skills

े़
A boy is sitting in the kitchen एक लडका रसोई म बैठा है

A boy is playing tennis एक लडका टनिस खल रहा है

A boy is sitting on a round table एक लडका एक गोल मज पर बैठा है

Some men are watching tennis कुछ आदमी टनिस दख रह है

A girl is holding a black book एक लडकी ि एक काली ककताब पकडी है

Two men are watching a movie दो आदमी चलचचत्र दख रह है

A woman is reading a book एक औरत एक ककताब पढ रही है

A woman is sitting in a red car एक औरत एक काल कार म बैठी है

Parse Tree – needs good linguistic expertise
Need for a Unified Approach for Indic NLP
Expensive to create datasets for each language

• Can we utilize resources developed for some languages for other

languages?
• Can diverse input from different languages lead to better
generalization?
• Can we support multiple languages with reduced effort & cost for
deployment and maintenance?
• Can we use unsupervised data sources?
• Can we utilize relatedness between Indian languages?
Broad Goal: Build NLP Applications that can work on different languages
English Hindi
Machine Translation System

Tamil Punjabi
Machine Translation System

Can we improve English-Hindi translation using Tamil-Punjabi model?

Can we do English → Punjabi translation even if this data is not seen in training?
Can we train a single model for all translation pairs?
Linguistic
Underpinnings of Standards
Relatedness

Unified
Approaches to
Indic NLP
Algorithms &
Methods Services/APIs

Datasets
A Typical Deep Learning NLP Pipeline
Text Tokens Token Embeddings

Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
How do we transfer information across
languages?
Text Tokens Token Embeddings

Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
A Typical Multilingual NLP Pipeline Similar tokens across
languages should have
similar embeddings
Text Tokens Token Embeddings

Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
A Typical Multilingual NLP Pipeline
Text Tokens Token Embeddings

Similar text across

languages should have
similar embeddings

Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
A Typical Multilingual NLP Pipeline
Pre-process to facilitate
similar embeddings across
languages? Tokens Token Embeddings
Text

Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
A Typical Multilingual NLP Pipeline
Text Tokens Token Embeddings

How to support multiple

target languages?

Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
Outline

• Motivation

• Relatedness between Indian Languages

• Utilizing Relatedness between Indian Languages

• IndicNLP Library

• Datasets, Services and Standards

• Summary
Relatedness between Indian
Languages
Why are Indian languages related?
Related Languages

Related by Genealogy Related by Contact

Linguistic Areas
Language Families
Indian Subcontinent,
Dravidian, Indo-European, Turkic
Standard Average European
(Jones, Rasmus, Verner, 18th & 19th centuries, Raymond ed. (2005))
(Trubetzkoy, 1923)

Related languages may not belong to the same language family! 27

Language Families
Group of languages related through descent from a common ancestor,
called the proto-language of that family

28
Basis of classification

Regularity of sound change is the basis of studying genetic relationships

Source: Eifring & Theil (2005)

These words are called cognates

29
Language Families in India
4 major language families

Indo-Aryan: North India and Sri Lanka (branch of

Indo-European)

Dravidian: South India & pockets in the North

Tibeto-Burman: North-East and along the

Himalayan ranges

Austro-Asiatic: pockets in Central India, North-

East, Nicobar Islands

Plus

Andamanese family
Unknown language of the Sentinelese
Cognates in Indian Languages
English Vedic Sanskrit Hindi Punjabi Gujarati Marathi Odia Bengali
chapāti,
bread Rotika chapātī, roṭī roṭi paũ, roṭlā poli, bhākarī pauruṭi (pau-)ruṭi
fish Matsya Machhlī machhī māchhli māsa mācha machh
bubuksha,
Indo-Aryan hunger kshudhā Bhūkh pukh bhukh bhūkh bhoka khide
bhāshā, boli, zabān,
language bhāshā, vāNī zabān pasha bhāshā bhāshā bhāsā bhasha
ten Dasha Das das, daha das dahā dasa dôsh

English Tamil Malayalam Kannada Telugu

fruit pazham , kanni pazha.n , phala.n haNNu , phala pa.nDu , phala.n

fish mInn matsya.n , mIn, mIna.n mInu , matsya , cepalu , matsyalu

jalavAsi, mIna , jalaba.ndhu
Dravidian
hunger paci vishapp , udarArtti , kShutt , hasivu, hasiv.e, Akali
pashi

language pAShai, m.ozhi bhASha , m.ozhi bhASh.e bhAShA , paluku

Source: Wikipedia and
ten pattu patt,dasha.m,dashaka.m hattu padi
IndoWordNet
Linguistic Area (Sprachbund)
• To the layperson, Dravidian & Indo-Aryan languages would seem closer to
each other than English & Indo-Aryan
• Linguistic Area: A group of languages (at least 3) that have common
structural features due to geographical proximity and language contact
(Thomason 2000)

• Not all features may be shared by all languages in the linguistic area
Examples of linguistic areas:
• Indian Subcontinent (Emeneau, 1956; Subbarao, 2012)
• Balkans

32
Borrowed Words

Sanskrit word Dravidian Loanword in English

Language Dravidian Language
Indo-Aryan words in
Dravidian languages cakram Tamil cakkaram wheel

matsyah Telugu matsyalu fish

Most classical languages
ashvah Kannada ashva horse
borrow heavily from Sanskrit
jalam Malayalam jala.m water

Dravidian words in Indo-Aryan languages

• A matter of great debate
• Could probably be of Munda origin also
• See writings of Kuiper, Witzel, Zvelebil, Burrow, etc.
• Proposal of Dravidian borrowing even in early Rg Vedic texts
Borrowed Syntactic Features
Retroflex Sounds in Indo-Aryan Languages: ट ठ ड ढ ण
• Found in Indo-Aryan, Dravidian and Munda language families
• Not found in Indo-European languages outside the Indo-Aryan branch
Echo Words: Generally means etcetera or things like this
hi: cAya-vAya, te: pulI-gulI, ta v.elai-k.elai
• Standard feature in all Dravidian languages
• Not found in Indo-European languages outside the Indo-Aryan branch
SOV word order in Munda languages
• Their Mon-Khmer cousins have SVO word order
• Munda language were originally SVO, but have become SOV over time
Similarities between Indian
languages
Key Similarities between related languages
भारताच्या स्वातंत्र्यददिानिममत्त अमररकतील लॉस एन्जल्स शहरात काययक्रम आयोजजत करण्यात आला Marathi
bhAratAcyA svAta.ntryadinAnimitta ameriketIla lOsa enjalsa shaharAta kAryakrama Ayojita karaNyAta AlA

भारता च्या स्वातंत्र्य ददिा निममत्त अमररक तील लॉस एन्जल्स शहरा त काययक्रम आयोजजत करण्यात आला Marathi
bhAratA cyA svAta.ntrya dinA nimitta amerike tIla lOsa enjalsa shaharA ta kAryakrama Ayojita karaNyAta AlA segmented

भारत क स्वतंत्रता ददवस क अवसर पर अमरीका क लॉस एन्जल्स शहर में काययक्रम आयोजजत ककया गया
Hindi
bhArata ke svata.ntratA divasa ke avasara para amarIkA ke losa enjalsa shahara me.n kAryakrama Ayojita kiyA gayA

Lexical: share significant vocabulary (cognates & loanwords)

Morphological: correspondence between suffixes/post-positions

Syntactic: share the same basic word order 36

Lexical Similarity
(Words having similar form and meaning)
• Cognates • Named Entities
a common etymological origin do not change across languages
roTI (hi) roTlA (pa) bread mu.mbaI (hi) mu.mbaI (pa) mu.mbaI (pa)
bhai (hi) bhAU (mr) brother keral (hi) k.eraLA (ml) keraL (mr)

• Loan Words • Fixed Expressions/Idioms

borrowed without translation MWE with non-compositional semantics
matsya (sa) matsyalu (te) fish dAla me.n kuCha kAlA honA (hi)
Something fishy
pazha.m (ta) phala (hi) fruit dALa mA kAIka kALu hovu (gu)

Enables sharing of data across languages

But, be warned of ……
False Friends: Similar spelling ; different meaning

• Different origin: pAnI (hi) [water] → panI (ml) [fever]

• Semantic shift: bala means hair (hi, frequent sense) and
baLa means child (mr)

Short words:
jaLa  → jAla

NAACL 2016 Tutorial 38

How similar are Indian Languages?

Estimate lexical similarity from

parallel corpus Computed on ILCI corpus

Longest Common Subsequence Ratio (LCSR)

for a sentence pair
𝐿𝐶𝑆(𝑠1 , 𝑠2 )
𝐿𝐶𝑆𝑅 𝑠1 , 𝑠2 =
max 𝑙𝑒𝑛 𝑠1 , 𝑙𝑒𝑛 𝑠2

LCSR for a language pair

1
𝐿𝐶𝑆𝑅 𝐿1 , 𝐿2 = ෍ 𝐿𝐶𝑆𝑅(𝑠1 , 𝑠2 )
|𝑃(𝐿1 , 𝐿2 )|
𝑠1 ,𝑠2 ∈
𝑃(𝐿1 ,𝐿2 )
Syntactic Similarity
• Almost all Indian languages has SOV word order
• SOV word order determines relative order between:
• Noun-adposition
• Noun-genitive
• Noun-Relative clause
• Verb-Auxilary
• Word order plays a very important role in most NLP applications
• Language Modelling
• Machine Translation
• Relatively Free Word Order
Morphological Similarity
• Inflectionally rich
• Sometimes agglutinative
• Function words with largely 1-1
correspondence
• Similar internal word structure
and compositional semantics
• Similar case-marking systems
Orthographic Similarity

• highly overlapping phoneme sets

• mutually compatible orthographic systems

• similar grapheme to phoneme mappings

Indic Scripts

All major Indic scripts

in Tibet
derived from the
Brahmi script

First seen in Ashoka’s

edicts
• Same script used for multiple languages
• Devanagari used for Sanskrit, Hindi, Marathi, Konkani, Nepali, Sindhi, etc.
• Bangla script used for Assamese too
• Multiple scripts used for same language
• Sanskrit traditionally written in all regional scripts
• Punjabi: Gurumukhi & Shahmukhi, Sindhi: Devanagari & Persio-Arabic
Common characteristics
Abugida scripts:
• primary consonants with secondary vowels
diacritics (maatras)
• rarely found outside of the Brahmi family

• Consonant clusters (क्क,क्ष)

• Special symbols like:
• Largely overlapping character set, but the
• anusvaara (nasalization), visarga
visual rendering differs (aspiration)
• Traditional ordering of characters is same • halanta/pulli (vowel suppression), nukta
(varnamala) (Persian/Arabic sounds)
• Dependent (maatras) and Independent • Basic Unit is the akshar (a pseudo-syllable)
vowels
Syllable as Basic Unit

akshara, the fundamental organizing principle of Indian scripts

(CONSONANT)➕ VOWEL

Examples: की (kI), प्र (pre)

Pseudo-Syllable
True Syllable ⇒ Onset, Nucleus and Coda
Orthographic Syllable ⇒ Onset, Nucleus
1

Organized as per sound

phonetic principles

shows various symmetries

4 5

6
Benefits due to script design
• Common design and standardization enables easy conversion from one script to another
• Makes exploiting lexical similarity possible
• Phonetic scripts: helps capture similarity between characters

Source: Singh, 2006

Useful for natural language processing: transliteration, speech recognition, text-to-speech

The Periodic Table and Indic Scripts
Dmitri Mendeleev is said to have been inspired by the two-dimensional
organization of Indic scripts to create the periodic table
http://swarajyamag.com/ideas/sanskrit-and-mendeleevs-periodic-table-of-elements/

48
India as a linguistic area gives us robust reasons
for writing a common or core grammar of many of
the languages in contact
~ Anvita Abbi

49
Outline

• Motivation

• Relatedness between Indian Languages

• Utilizing Relatedness between Indian Languages

• IndicNLP Library

• Datasets, Services and Standards

• Summary
Utilizing Relatedness between
Indian Languages
Orthographic Similarity

Lexical Similarity

Syntactic Similarity
Utilizing Orthographic Similarity
Script Conversion
• Read any script in any script
• Unicode standard enables consistent script conversion

unicode_codepoint(char) - Unicode_range_start(L1) + Unicode_range_start(L2)

करला

কেরলা કેરલા
Multilingual Acronym Generation
Simple application of script conversion

Need to build Latin to Indic script mappings only once

ए सी एल
ACL
এ সী এল

ಏ ಸೀ ಏಲ
Multilingual Transliteration
Train a joint transliteration model for
multiple Indian languages to English
& vice-versa
Hindi → English corpus

Example of Multi-task Learning

Bengali → English corpus
Similar tasks help each other

Telugu → English corpus Zero-shot transliteration is possible

Perform Kannada → English transliteration

even if network has not seen that data
Concat training sets Share network parameters across languages
Output layer for each target language
Malayalam ക ോഴികകോട് kozhikode

Hindi करल kerala

Kannada ಬ ೆಂಗಳೂರು bengaluru

Convert to a common script

Malayalam कोमिक्कोट् kozhikode

Hindi करल kerala
Kannada बग
ं ळूरु bengaluru
Unsupervised Transliteration

Learn:
Transliteration model (TX) for source
language (F) to target language (E)

Inputs:
• Monolingual word lists (WF and WE)
• Phonetic Representations of words
• Represent each Indic character
as a feature vector

• Define a similarity measure

based on the feature vector
Linguistically-informed phonetic priors
Priors capture how similar two characters are (use Dirichlet Priors)
Effect of linguistic priors

• Rule based and use of linguistic priors outperforms Ravi & Knight’s (2009) model
• Significant increase in top-1 accuracy over rule-based
• Good top-10 accuracy, which rule-based cannot provide
Syllable-based Transliteration
(Atreya, et al 2015)

Hindi Kannada English

Syllable as the basic वव द्या ल य ವಿ ದ್ಯಾ ಲ ಯ vi dya lay
transliteration unit
अ जुय ि ಅ ರ್ುು ನ a rju n
Transliteration Accuracies

Syllable-level transliteration (VS) outperforms character-level transliteration (CS)

Utilizing Relatedness between
Indian Languages
Orthographic Similarity

Lexical Similarity

Syntactic Similarity
Multilingual Word Embeddings

𝑒𝑚𝑏𝑒𝑑(𝑦) = 𝑓(𝑒𝑚𝑏𝑒𝑑(𝑥)) (Source: Khapra and Chandar, 2016)

𝑥, 𝑦 are source and target words

𝑒𝑚𝑏𝑒𝑑 𝑤 : embedding for word 𝑤
Bilingual Lexicon Induction
Given a mapping function and source/target words and embeddings:
Can we extract a bilingual dictionary?
H2O
paanii liquid

water
hydrogen

oxygen

Find nearest neighbor of mapped embedding

y’=W(embed(paani)) m𝑎𝑥 cos(𝑒𝑚𝑏𝑒𝑑 𝑦 , 𝑦 ′ )➔ water

𝑦∈𝑌

A standard intrinsic evaluation task for judging quality of cross-lingual embedding quality
The case of related languages
Concat
• Concat monolingual corpora and train embeddings
• Same words will have same embeddings
• Subword information in both languages considered by FastText

Identity
• For identical words, just assign corresponding embedding for word in other language
embedding(ghar,marathi) = embedding (ghar,hindi)

Enhanced embedding representation

• Add features to monolingual embeddings to capture character occurrence
• Learn bilingual embeddings on these enhanced monolingual embeddings

ghar
Original embedding Char co-occurrence
Evaluation
Method En-German En-Italian En-Finnish
Baseline (B) 40.27 39.40 26.47
B + identity (I) 51.73 44.07 42.63
B + enhanced (E) 50.33 48.40 29.63
B+ I+E 55.40 47.13 43.54

Precision@1
Multilingual Neural Machine Translation
(Zoph et al., 2016; Nguyen et al., 2017; Lee et al., 2017; Dabre et al., 2018)

We want Gujarati → English translation ➔ but little parallel corpus is available

We have lot of Marathi → English parallel corpus

Marathi

Concatenate Shared English

Shared
Parallel Attention Decoder
Encoder
Corpora Mechanism

Gujarati
Combine Corpora from different languages
(Nguyen and Chang, 2017)

I am going home હુ ઘરે જવ છૂ It is cold in Pune पुण्यात थंड आह

It rained last week છે લ્લા આઠવડિયા મા My home is near the market माझा घर बाजाराजवळ आह
વર્ાા દ પાિયો

Convert Script
Concat Corpora

I am going home हु घर जव छू
It rained last week छल्ला आठवडडया मा वसायद पाड्यो
It is cold in Pune पण्
ु यात थंड आह
My home is near the market माझा घर बाजाराजवळ आह
Zeroshot Translation

Marathi → English

Training

Model

Konkani English
Inference
Training Multilingual NMT systems
C1 ’
C1
Method 1 C2 C1 ’ C2 ’
Joint Training C2 ’ Train
Sample from Combine Parallel
Parallel Corpora Corpora

C1
Method 2 C2 Model for C2 Model tuned for C1
Fine-tuning Train Finetune
Subword-level Representation of Corpora
I am going home हु घर जव छू
It rained last week छ_ ल्ला आठवडड_ या मा वसाय_ द पाड्यो
It is cold in Pune पुण्या त थंड आह
My home is near the market माझा घर बा_ जारा_ जवळ आह

• Words don’t match exactly across languages: Subwords needed to utilize lexical
similarity
• Possible Representations: Character, character n-grams, syllables, morph, Byte-
Pair Encoded (BPE) Units
• BPE is very popular:
• unsupervised segmentation
• language-independent
• Identifies frequent substrings
Backtranslation with a high-resource language

E’ L’ E’ H’
Backward MT System Backward MT System

H E H E

Forward MT System E Forward MT System E

L L

E’ E’
L’ H’
Standard backtranslation Modified backtranslation
Make Indian Language Representations similar
Marathi

Concatenate Shared English

Map Shared
Parallel Attention Decoder
Languages Encoder
Corpora Mechanism

Gujarati

Surface form approaches

• Transliteration
• Word-by-word translation
• Word-by-word translation with beam search
Make Indian Language Representations similar
Marathi

Concatenate Shared English

Map Shared
Parallel Attention Decoder
Languages Encoder
Corpora Mechanism

Gujarati

Multilingual Embedding approaches

• Multilingual Word Embeddings
• Multilingual Sentence Embeddings
Multilingual BERT (Devlin et al., 2018)

Transformer encoder with masked LM objective – i.e. try to predict masked words
Concat data from all languages
Cross-lingual Language Model Pre-training
(Lample & Conneau, 2019)

• Variant of BERT that adds a translation objective

• Needs parallel corpus
How to make other NLP applications multilingual?

Hindi
Application Output
Concatenate Shared Application
Bengali
training data Encoder Network

Telugu

• Sentiment Analysis
• Named Entity Recognition
English → Indian Languages
How do we support multiple target languages with a single decoder?
A simple trick!: Append input with special token indicating the target language

Original Input: France and Croatia will play the final on Sunday

Modified Input: France and Croatia will play the final on Sunday <hin>

E
Forward MT System L

H
E

Still an open problem

Backtranslation via Multilingual Model

E L
L’ Backward Multi-lingual E’ Forward MT System
MT System

E’ L’

Experiment BLEU
Baseline Bilingual 19.7
(2) Baseline Multilingual E→X 22.3
(2) + bilingual backtranslation 26.1
(2) + multilingual backtranslation 27.0

English→ Spanish with English → French as helper pair

Shared and Private Decoder Hidden Units

• Shared units allow learning common features

• Language-dependent layers capture language
specific information

Experiment English-Chinese English-German English-French

Baseline Bilingual 44.23 27.84 41.50
(2) Baseline Multilingual E→X 44.30 26.78 41.56
(2) + shared/private units 45.25 27.11 41.98
Indian-Indian Language MT
• Syllable as basic translation unit
• Balance between utilizing lexical similarity and word-level information
Results

● Substantial improvement over char-based (46%)

● Significant improvement over strong baselines: WX (10%) & MX (5%)
● Improvement when languages don't belong to same family (contact exists)
● More beneficial when languages are morphologically rich
Pivot Translation for Related Languages

Related languages ⇒ Use subword level translation units

Translation through intermediate language ⇒ Use Pivot based SMT methods
Combine the two approaches
Word-level join
results in sparse
table

Why is syllable-based pivot model better?

• The underlying source-pivot and pivot-target models are better
• Data loss during join is minimized with subword representation
OS level pivot system outperforms other units
•
~60% improvement over word level
•~15% improvement over morph level
Utilizing Relatedness between
Indian Languages
Orthographic Similarity

Lexical Similarity

Syntactic Similarity
Use Source Re-ordering for Phrase-based SMT
(Kunchukuttan et al., 2014)

Phrase based MT is not good at learning word ordering

Solution: Let’s help PB-SMT with some preprocessing of the input

Change order of words in input sentence to match order of the words in the target
language
Let’s take an example

Bahubali earned more than 1500 crore rupee sat the boxoffice
Parse the sentence to understand
its syntactic structure
1 3
Apply rules to transform the tree 2 4
5
VP → VBD NP PP ⇒ VP → PP NP VBD
PP → IN NP ⇒ PP → NP IN

The new input to the machine translation system is:

Bahubali the boxoffice at 1500 crore rupees earned
3
2 1
5 4
Now we can translate with little reordering:
बाहुबली ि बॉक्सओकिस पर 1500 करोड रुपए कमाए
Can we reuse English-Hindi rules for English-Indian languages?

All Indian languages have the same basic word order

(Kunchukuttan et al., 2014)

Generic reordering (Ramanathan et al 2008)
Basic reordering transformation for English→ Indian language translation

Hindi-tuned reordering (Patel et al 2013)

Improvement over the basic rules by analyzing English → Hindi translation output
Bridging Word-order Divergence for low-resource NMT
(Rudramurthy et al., 2019)
English

Shared Hindi
Map Shared
Attention Decoder
Languages Encoder
Mechanism

Gujarati Translate Gujarati → Hindi

Given: English → Hindi parallel corpus (E→H)
Little Gujarati → Hindi parallel corpus (G→ H)

• Translate English to Gujarati word-by-word ➔ G’->H corpus

• Train the G’->H corpus
• Fine-tune on small G->H corpus
Problem: Difference in Gujarati-English word order
Cannot ensure similar Gujarat and English words have similar representations
Solution: Pre-order English sentence to match Gujarati word-order
Same rules work for all Indic languages
Outline

• Motivation

• Relatedness between Indian Languages

• Utilizing Relatedness between Indian Languages

• IndicNLP Library

• Datasets, Services and Standards

• Summary
Indic NLP Library
https://github.com/anoopkunchukuttan/indic_nlp_library
Design Principles
• Design to support maximum number of Indian languages
• Utilize similarity between Indian languages for scaling to multiple
Indian languages
• Modular and Extensible
• Easy of use:
• Installation
• Consistent Use
• Separation between code and data resources
Capabilities
• Text Normalizer • Script Converter
• Sentence Splitter • Romanization
• Word Tokenizer • Indicization
• Word Detokenizer • Transliteration
• Word Segmenter • Acronym Transliterator
• Syllabification • Statistical Machine Translation
• Query Script Information • Lexical Similarity
• Phonetic Similarity
Library Initialization
Outline

• Motivation

• Relatedness between Indian Languages

• Utilizing Relatedness between Indian Languages

• IndicNLP Library

• Datasets, Services and Standards

• Summary
Indic Standards & Datasets
Enable sharing of data and annotations
Standards
Important to ensure sharing of data and annotations

Necessary to build multilingual NLP systems

• Unicode: codifies Indic script commonalities

• BIS POS Tag Set: hierarchical tagset suitable for Indian languages
• Universal Dependencies: universal accepted tagset for many languages
• IndoWordNet: sense repository for Indian languages
Catalog of Indian Language NLP Resources

https://github.com/indicnlpweb/indicnlp_catalog

Evolving, collaborative catalog of Indian language NLP resources

Please add resources you know of and send a pull request

Commercial Offerings for Indian Languages
Microsoft Google Amazon

Translation Yes Yes Yes

Transliteration Yes No No

Information No No No
Extraction

Information Extraction includes entity recognition, intent recognition, sentiment

analysis, relation extraction, POS tagging, syntactic parsing, etc.
Outline

• Motivation

• Relatedness between Indian Languages

• Utilizing Relatedness between Indian Languages

• IndicNLP Library

• Datasets, Services and Standards

• Summary
Thank You!
http://www.cse.iitb.ac.in/~anoopk

NLP Unit I Notes-1
75% (4)
NLP Unit I Notes-1
22 pages
L2 Challenges in NLP
No ratings yet
L2 Challenges in NLP
18 pages
Natural Language Processing
100% (6)
Natural Language Processing
49 pages
Language Models and Application of Natural Language Processing
No ratings yet
Language Models and Application of Natural Language Processing
70 pages
Tamil Notes-English
No ratings yet
Tamil Notes-English
58 pages
2023 Dravidianlangtech-1
No ratings yet
2023 Dravidianlangtech-1
330 pages
IndicSpeech Text-To-Speech Corpus For Indian Languages
100% (1)
IndicSpeech Text-To-Speech Corpus For Indian Languages
6 pages
The LTRC Hindi-Telugu Parallel Corpus
No ratings yet
The LTRC Hindi-Telugu Parallel Corpus
8 pages
Indian Language Tools & Resources
No ratings yet
Indian Language Tools & Resources
26 pages
Heritage of Tamils - Unit 1
No ratings yet
Heritage of Tamils - Unit 1
50 pages
Languages of India (Sheet 1)
No ratings yet
Languages of India (Sheet 1)
7 pages
1 2 3 4 5 6 7 8 9 Merged
No ratings yet
1 2 3 4 5 6 7 8 9 Merged
221 pages
Wintersc Iitguwahati Multilingual Model Jan25
No ratings yet
Wintersc Iitguwahati Multilingual Model Jan25
81 pages
Natural Language Processing (NLP) Research at Boston University
No ratings yet
Natural Language Processing (NLP) Research at Boston University
68 pages
Unit 1 Heritage of Tamil English Version
100% (1)
Unit 1 Heritage of Tamil English Version
75 pages
Indo Language
No ratings yet
Indo Language
16 pages
Linguistics History of India
100% (1)
Linguistics History of India
28 pages
Lost in Translation: Large Language Models in Non-English Content Analysis
No ratings yet
Lost in Translation: Large Language Models in Non-English Content Analysis
50 pages
CH 6. Applications of AI-NLP
No ratings yet
CH 6. Applications of AI-NLP
65 pages
The Many Languages of India
No ratings yet
The Many Languages of India
3 pages
An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
No ratings yet
An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
13 pages
1 cs626 Intro 27jul22
No ratings yet
1 cs626 Intro 27jul22
44 pages
Dravidian Wordnet: S.Arulmozi Dravidian University
No ratings yet
Dravidian Wordnet: S.Arulmozi Dravidian University
26 pages
HoT English Unit - 1
No ratings yet
HoT English Unit - 1
64 pages
Lecture-1-Introduction To Natural Language Processing-2021
No ratings yet
Lecture-1-Introduction To Natural Language Processing-2021
46 pages
23ta 1101 - Heritage of Tamil - Eng Version-1
No ratings yet
23ta 1101 - Heritage of Tamil - Eng Version-1
61 pages
Mod 1
No ratings yet
Mod 1
71 pages
Learning Word Vectors For 157 Languages
No ratings yet
Learning Word Vectors For 157 Languages
5 pages
Deep Learning For Dravidian Codemix Problem
No ratings yet
Deep Learning For Dravidian Codemix Problem
10 pages
(IJCST-V11I6P2) :ms. Madhuri P. Narkhede, Dr. Harshali B Patil
No ratings yet
(IJCST-V11I6P2) :ms. Madhuri P. Narkhede, Dr. Harshali B Patil
5 pages
NLP Unit 1
No ratings yet
NLP Unit 1
56 pages
Ltrc25 Nlpscale Indic Dec24
No ratings yet
Ltrc25 Nlpscale Indic Dec24
18 pages
Unit 1
No ratings yet
Unit 1
23 pages
Case Study
No ratings yet
Case Study
4 pages
cs224n Winter2023 Lecture1 Notes Draft
No ratings yet
cs224n Winter2023 Lecture1 Notes Draft
13 pages
Language Classification
No ratings yet
Language Classification
13 pages
NLP Introduction
No ratings yet
NLP Introduction
35 pages
Translation of Telugusa Using
No ratings yet
Translation of Telugusa Using
14 pages
Languages of India - Wikipedia
No ratings yet
Languages of India - Wikipedia
78 pages
Languages of India
No ratings yet
Languages of India
32 pages
Ijst 2023 765
No ratings yet
Ijst 2023 765
12 pages
A Review of Machine Transliteration, Translation, Evaluation Metrics and Datasets in Indian Languages
No ratings yet
A Review of Machine Transliteration, Translation, Evaluation Metrics and Datasets in Indian Languages
32 pages
Natural Language Processing Unit 1-2
No ratings yet
Natural Language Processing Unit 1-2
18 pages
Basic NLP To End-To-End Pipeline .PPTX - Removed
No ratings yet
Basic NLP To End-To-End Pipeline .PPTX - Removed
35 pages
Natural Language Processing: Rada Mihalcea
No ratings yet
Natural Language Processing: Rada Mihalcea
26 pages
A Deep Learning-Based Study On Inter-Language Confusion Assessment Using Spoken Language Identification For Indian Languages
No ratings yet
A Deep Learning-Based Study On Inter-Language Confusion Assessment Using Spoken Language Identification For Indian Languages
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
57 pages
India As A Multilingaul Nation
No ratings yet
India As A Multilingaul Nation
6 pages
Parallel Corpora For Machine Translation in Low-Resource Indic Languages: A Comprehensive Review
No ratings yet
Parallel Corpora For Machine Translation in Low-Resource Indic Languages: A Comprehensive Review
15 pages
Centre For Indian Language Technology (CFILT) Computer Science and Engineering Department, Indian Institute of Technology Bombay
No ratings yet
Centre For Indian Language Technology (CFILT) Computer Science and Engineering Department, Indian Institute of Technology Bombay
2 pages
International Journal of Computer Science and Informatics International Journal of Computer Science and Informatics
No ratings yet
International Journal of Computer Science and Informatics International Journal of Computer Science and Informatics
7 pages
4.classification of Languages
No ratings yet
4.classification of Languages
3 pages
Multilingual Mysteries The Art of Automated Language Identification
No ratings yet
Multilingual Mysteries The Art of Automated Language Identification
6 pages
Identification of Indian Languages Using Naïve Bayes, CNN, LSTM, and HMM
No ratings yet
Identification of Indian Languages Using Naïve Bayes, CNN, LSTM, and HMM
7 pages
What Is The Aboriginal Language of Indians Before Sanskrit Speakers Migrated From Iran - Quora
No ratings yet
What Is The Aboriginal Language of Indians Before Sanskrit Speakers Migrated From Iran - Quora
3 pages
Bahasa Ada Untuk Membuat Bahasa Yang Tidak Ada
No ratings yet
Bahasa Ada Untuk Membuat Bahasa Yang Tidak Ada
5 pages
BenderRule TheGradient-refs
No ratings yet
BenderRule TheGradient-refs
2 pages
Duallanguagedetectionusing ML
No ratings yet
Duallanguagedetectionusing ML
5 pages
Beginner's Guide To Kirigami 24 Skill Building Projects For The Absolute Beginner Exclusive Download
100% (12)
Beginner's Guide To Kirigami 24 Skill Building Projects For The Absolute Beginner Exclusive Download
15 pages
Perencanaan Tebal Perkerasan Landasan Pacu
No ratings yet
Perencanaan Tebal Perkerasan Landasan Pacu
8 pages
Bodybuilding, Drugs and Risk
No ratings yet
Bodybuilding, Drugs and Risk
230 pages
HG Grade 3
No ratings yet
HG Grade 3
3 pages
Big Data in Healthcare Systems and Research
No ratings yet
Big Data in Healthcare Systems and Research
4 pages
Term 2 Basic 3 Week 3 Lesson Plan
No ratings yet
Term 2 Basic 3 Week 3 Lesson Plan
20 pages
RRL
100% (1)
RRL
3 pages
Research II Proposal
No ratings yet
Research II Proposal
26 pages
Lang Aquisition - Emergent Rubric Original All Criteria
No ratings yet
Lang Aquisition - Emergent Rubric Original All Criteria
4 pages
Experiment 6 Isolation of Eugenol From Cloves TECHNIQUE: Steam Distillation
No ratings yet
Experiment 6 Isolation of Eugenol From Cloves TECHNIQUE: Steam Distillation
2 pages
DR - AishaCv 20250422 152511 0000
No ratings yet
DR - AishaCv 20250422 152511 0000
4 pages
Activity No. 4: Simple and Complex Tissues
0% (1)
Activity No. 4: Simple and Complex Tissues
6 pages
PEPSICO
No ratings yet
PEPSICO
5 pages
Science 6 - Final RUQA - ANSWER KEY
No ratings yet
Science 6 - Final RUQA - ANSWER KEY
5 pages
Principles of Economics MM MBA 2018
No ratings yet
Principles of Economics MM MBA 2018
60 pages
Occupational Health and Safety Policy For The National Department of Health
No ratings yet
Occupational Health and Safety Policy For The National Department of Health
14 pages
Lesson 1: Pre-Analytical Factors and Gross Description: Histopathologic and Cytologic Techniques - Lecture
No ratings yet
Lesson 1: Pre-Analytical Factors and Gross Description: Histopathologic and Cytologic Techniques - Lecture
28 pages
Telehandler Genie GTH 1048-Specifications
No ratings yet
Telehandler Genie GTH 1048-Specifications
2 pages
Event Management and Marketing in Tourism
No ratings yet
Event Management and Marketing in Tourism
8 pages
Bizhub C25 Spec
No ratings yet
Bizhub C25 Spec
8 pages
Akash Internship Report
No ratings yet
Akash Internship Report
49 pages
Abbotsford VFR Terminal Procedures Chart Rwy 01 & 19
No ratings yet
Abbotsford VFR Terminal Procedures Chart Rwy 01 & 19
3 pages
Short Story
No ratings yet
Short Story
2 pages
Family Worship, Family Worship, J. H. Merle D'aubigné
No ratings yet
Family Worship, Family Worship, J. H. Merle D'aubigné
18 pages
TD Sba0 en
No ratings yet
TD Sba0 en
3 pages
Lesson Plan (Thai Son)
No ratings yet
Lesson Plan (Thai Son)
8 pages
EE1005 L01 Computers & Programming
No ratings yet
EE1005 L01 Computers & Programming
35 pages
Cambridge International AS & A Level: Biology 9700/51
No ratings yet
Cambridge International AS & A Level: Biology 9700/51
16 pages
Adjeivos Comparativos y Superativos Teoria y Practica
No ratings yet
Adjeivos Comparativos y Superativos Teoria y Practica
4 pages
Special Comment(s) Overall:: 9299724. Clark Builders. Raymond Block - Level3 - Zone1 - Phase1. March 01, 2018
No ratings yet
Special Comment(s) Overall:: 9299724. Clark Builders. Raymond Block - Level3 - Zone1 - Phase1. March 01, 2018
3 pages
Mastering Transformers: The Journey from BERT to Large Language Models and Stable Diffusion
From Everand
Mastering Transformers: The Journey from BERT to Large Language Models and Stable Diffusion
Savaş Yıldırım
No ratings yet
The spaCy Handbook: Simplifying Natural Language Processing
From Everand
The spaCy Handbook: Simplifying Natural Language Processing
Robert Johnson
No ratings yet