Natural Language Processing For Indian Language: NASSCOM 2019
Natural Language Processing For Indian Language: NASSCOM 2019
Anoop Kunchukuttan
Microsoft AI & Research
Machine Translation & Speech Group, Hyderabad
• Motivation
• IndicNLP Library
• Summary
Diversity of Indian Languages
• 1600 dialects
• 22 scheduled languages
• 125 million English speakers
• 8 languages in the world’s top 20 languages
• 11 languages with more than 25 million speakers
• 30 languages with more than 1 million speakers
Sources: Wikipedia, Census of India 2011
Indian Languages on the Internet
Information
Question &
Extraction &
Answering
Categorization Transliteration
Entity
Entity Linking
Identification
Machine Learning is the dominant NLP Paradigm
Training Pipeline An ML Pipeline Test Pipeline
for a Text
Text Instance Class Classification Text Instance Class
Decision Function
sign(f(x))
Train
Classifier ?
Training set Positive Negative
f(x) → Model
Scalability Challenges in ML solutions
Positive
? Negative
Neutral
े़
A boy is sitting in the kitchen एक लडका रसोई म बैठा है
Tamil Punjabi
Machine Translation System
Unified
Approaches to
Indic NLP
Algorithms &
Methods Services/APIs
Datasets
A Typical Deep Learning NLP Pipeline
Text Tokens Token Embeddings
Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
How do we transfer information across
languages?
Text Tokens Token Embeddings
Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
A Typical Multilingual NLP Pipeline Similar tokens across
languages should have
similar embeddings
Text Tokens Token Embeddings
Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
A Typical Multilingual NLP Pipeline
Text Tokens Token Embeddings
Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
A Typical Multilingual NLP Pipeline
Pre-process to facilitate
similar embeddings across
languages? Tokens Token Embeddings
Text
Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
A Typical Multilingual NLP Pipeline
Text Tokens Token Embeddings
Output
Application specific Deep Text Embedding
(text or otherwise)
Neural Network layers
Outline
• Motivation
• IndicNLP Library
• Summary
Relatedness between Indian
Languages
Why are Indian languages related?
Related Languages
Linguistic Areas
Language Families
Indian Subcontinent,
Dravidian, Indo-European, Turkic
Standard Average European
(Jones, Rasmus, Verner, 18th & 19th centuries, Raymond ed. (2005))
(Trubetzkoy, 1923)
28
Basis of classification
Plus
Andamanese family
Unknown language of the Sentinelese
Cognates in Indian Languages
English Vedic Sanskrit Hindi Punjabi Gujarati Marathi Odia Bengali
chapāti,
bread Rotika chapātī, roṭī roṭi paũ, roṭlā poli, bhākarī pauruṭi (pau-)ruṭi
fish Matsya Machhlī machhī māchhli māsa mācha machh
bubuksha,
Indo-Aryan hunger kshudhā Bhūkh pukh bhukh bhūkh bhoka khide
bhāshā, boli, zabān,
language bhāshā, vāNī zabān pasha bhāshā bhāshā bhāsā bhasha
ten Dasha Das das, daha das dahā dasa dôsh
• Not all features may be shared by all languages in the linguistic area
Examples of linguistic areas:
• Indian Subcontinent (Emeneau, 1956; Subbarao, 2012)
• Balkans
32
Borrowed Words
भारता च्या स्वातंत्र्य ददिा निममत्त अमररक तील लॉस एन्जल्स शहरा त काययक्रम आयोजजत करण्यात आला Marathi
bhAratA cyA svAta.ntrya dinA nimitta amerike tIla lOsa enjalsa shaharA ta kAryakrama Ayojita karaNyAta AlA segmented
भारत क स्वतंत्रता ददवस क अवसर पर अमरीका क लॉस एन्जल्स शहर में काययक्रम आयोजजत ककया गया
Hindi
bhArata ke svata.ntratA divasa ke avasara para amarIkA ke losa enjalsa shahara me.n kAryakrama Ayojita kiyA gayA
Short words:
jaLa → jAla
1
𝐿𝐶𝑆𝑅 𝐿1 , 𝐿2 = 𝐿𝐶𝑆𝑅(𝑠1 , 𝑠2 )
|𝑃(𝐿1 , 𝐿2 )|
𝑠1 ,𝑠2 ∈
𝑃(𝐿1 ,𝐿2 )
Syntactic Similarity
• Almost all Indian languages has SOV word order
• SOV word order determines relative order between:
• Noun-adposition
• Noun-genitive
• Noun-Relative clause
• Verb-Auxilary
• Word order plays a very important role in most NLP applications
• Language Modelling
• Machine Translation
• Relatively Free Word Order
Morphological Similarity
• Inflectionally rich
• Sometimes agglutinative
• Function words with largely 1-1
correspondence
• Similar internal word structure
and compositional semantics
• Similar case-marking systems
Orthographic Similarity
(CONSONANT)➕ VOWEL
Pseudo-Syllable
True Syllable ⇒ Onset, Nucleus and Coda
Orthographic Syllable ⇒ Onset, Nucleus
1
4 5
6
Benefits due to script design
• Common design and standardization enables easy conversion from one script to another
• Makes exploiting lexical similarity possible
• Phonetic scripts: helps capture similarity between characters
48
India as a linguistic area gives us robust reasons
for writing a common or core grammar of many of
the languages in contact
~ Anvita Abbi
49
Outline
• Motivation
• IndicNLP Library
• Summary
Utilizing Relatedness between
Indian Languages
Orthographic Similarity
Lexical Similarity
Syntactic Similarity
Utilizing Orthographic Similarity
Script Conversion
• Read any script in any script
• Unicode standard enables consistent script conversion
करला
কেরলা કેરલા
Multilingual Acronym Generation
Simple application of script conversion
ए सी एल
ACL
এ সী এল
ಏ ಸೀ ಏಲ
Multilingual Transliteration
Train a joint transliteration model for
multiple Indian languages to English
& vice-versa
Hindi → English corpus
Learn:
Transliteration model (TX) for source
language (F) to target language (E)
Inputs:
• Monolingual word lists (WF and WE)
• Phonetic Representations of words
• Represent each Indic character
as a feature vector
• Rule based and use of linguistic priors outperforms Ravi & Knight’s (2009) model
• Significant increase in top-1 accuracy over rule-based
• Good top-10 accuracy, which rule-based cannot provide
Syllable-based Transliteration
(Atreya, et al 2015)
Lexical Similarity
Syntactic Similarity
Multilingual Word Embeddings
water
hydrogen
oxygen
A standard intrinsic evaluation task for judging quality of cross-lingual embedding quality
The case of related languages
Concat
• Concat monolingual corpora and train embeddings
• Same words will have same embeddings
• Subword information in both languages considered by FastText
Identity
• For identical words, just assign corresponding embedding for word in other language
embedding(ghar,marathi) = embedding (ghar,hindi)
ghar
Original embedding Char co-occurrence
Evaluation
Method En-German En-Italian En-Finnish
Baseline (B) 40.27 39.40 26.47
B + identity (I) 51.73 44.07 42.63
B + enhanced (E) 50.33 48.40 29.63
B+ I+E 55.40 47.13 43.54
Precision@1
Multilingual Neural Machine Translation
(Zoph et al., 2016; Nguyen et al., 2017; Lee et al., 2017; Dabre et al., 2018)
Marathi
Gujarati
Combine Corpora from different languages
(Nguyen and Chang, 2017)
Convert Script
Concat Corpora
I am going home हु घर जव छू
It rained last week छल्ला आठवडडया मा वसायद पाड्यो
It is cold in Pune पण्
ु यात थंड आह
My home is near the market माझा घर बाजाराजवळ आह
Zeroshot Translation
Marathi → English
Training
Model
Konkani English
Inference
Training Multilingual NMT systems
C1 ’
C1
Method 1 C2 C1 ’ C2 ’
Joint Training C2 ’ Train
Sample from Combine Parallel
Parallel Corpora Corpora
C1
Method 2 C2 Model for C2 Model tuned for C1
Fine-tuning Train Finetune
Subword-level Representation of Corpora
I am going home हु घर जव छू
It rained last week छ_ ल्ला आठवडड_ या मा वसाय_ द पाड्यो
It is cold in Pune पुण्या त थंड आह
My home is near the market माझा घर बा_ जारा_ जवळ आह
• Words don’t match exactly across languages: Subwords needed to utilize lexical
similarity
• Possible Representations: Character, character n-grams, syllables, morph, Byte-
Pair Encoded (BPE) Units
• BPE is very popular:
• unsupervised segmentation
• language-independent
• Identifies frequent substrings
Backtranslation with a high-resource language
E’ L’ E’ H’
Backward MT System Backward MT System
H E H E
E’ E’
L’ H’
Standard backtranslation Modified backtranslation
Make Indian Language Representations similar
Marathi
Gujarati
Gujarati
Transformer encoder with masked LM objective – i.e. try to predict masked words
Concat data from all languages
Cross-lingual Language Model Pre-training
(Lample & Conneau, 2019)
Hindi
Application Output
Concatenate Shared Application
Bengali
training data Encoder Network
Telugu
• Sentiment Analysis
• Named Entity Recognition
English → Indian Languages
How do we support multiple target languages with a single decoder?
A simple trick!: Append input with special token indicating the target language
Original Input: France and Croatia will play the final on Sunday
Modified Input: France and Croatia will play the final on Sunday <hin>
E
Forward MT System L
H
E
E L
L’ Backward Multi-lingual E’ Forward MT System
MT System
E’ L’
Experiment BLEU
Baseline Bilingual 19.7
(2) Baseline Multilingual E→X 22.3
(2) + bilingual backtranslation 26.1
(2) + multilingual backtranslation 27.0
Lexical Similarity
Syntactic Similarity
Use Source Re-ordering for Phrase-based SMT
(Kunchukuttan et al., 2014)
Change order of words in input sentence to match order of the words in the target
language
Let’s take an example
Bahubali earned more than 1500 crore rupee sat the boxoffice
Parse the sentence to understand
its syntactic structure
1 3
Apply rules to transform the tree 2 4
5
VP → VBD NP PP ⇒ VP → PP NP VBD
PP → IN NP ⇒ PP → NP IN
Shared Hindi
Map Shared
Attention Decoder
Languages Encoder
Mechanism
• Motivation
• IndicNLP Library
• Summary
Indic NLP Library
https://github.com/anoopkunchukuttan/indic_nlp_library
Design Principles
• Design to support maximum number of Indian languages
• Utilize similarity between Indian languages for scaling to multiple
Indian languages
• Modular and Extensible
• Easy of use:
• Installation
• Consistent Use
• Separation between code and data resources
Capabilities
• Text Normalizer • Script Converter
• Sentence Splitter • Romanization
• Word Tokenizer • Indicization
• Word Detokenizer • Transliteration
• Word Segmenter • Acronym Transliterator
• Syllabification • Statistical Machine Translation
• Query Script Information • Lexical Similarity
• Phonetic Similarity
Library Initialization
Outline
• Motivation
• IndicNLP Library
• Summary
Indic Standards & Datasets
Enable sharing of data and annotations
Standards
Important to ensure sharing of data and annotations
https://github.com/indicnlpweb/indicnlp_catalog
Transliteration Yes No No
Information No No No
Extraction
• Motivation
• IndicNLP Library
• Summary
Thank You!
http://www.cse.iitb.ac.in/~anoopk