0% found this document useful (0 votes)
103 views105 pages

AI Language Models and Sanskrit LLM

The document discusses the development of a robust Sanskrit language model (LLM) using AI, emphasizing the importance of understanding natural language structures and the historical context of AI evolution. It highlights the significance of Panini's grammar in AI applications and the need for indigenous algorithms and resources for Indian languages. The presentation outlines various methods and technologies in AI and linguistics, aiming to enhance communication between humans and machines through effective language processing.

Uploaded by

windowseat169
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views105 pages

AI Language Models and Sanskrit LLM

The document discusses the development of a robust Sanskrit language model (LLM) using AI, emphasizing the importance of understanding natural language structures and the historical context of AI evolution. It highlights the significance of Panini's grammar in AI applications and the need for indigenous algorithms and resources for Indian languages. The presentation outlines various methods and technologies in AI and linguistics, aiming to enhance communication between humans and machines through effective language processing.

Uploaded by

windowseat169
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

AI and Language models:

towards a robust Sanskrit LLM

Girish Nath Jha


Professor of Computational Linguistics
School of Sanskrit and Indic Studies, JNU
&
Concurrent Faculty
School of Engineering, Center for Linguistics, Special Center of E-Learning,
JNU

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 1


Hansraj College (online) 4 Apr 25
 Reading
 Speaking
 Listening
 Comprehension

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 2


Hansraj College (online) 4 Apr 25
 These are essentially bidirectional
(parse/gen) activities
 Very complex cognitive processes involved
 How can AI help?

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 3


Hansraj College (online) 4 Apr 25
What is AI and how it is done today?

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 4


Hansraj College (online) 4 Apr 25
 Standards, data collection, annotation
 Training a model (preprocessing, tokenization, model selection, training)
 Cloud or local infra/data center
 Testing (automated, manual)
 Tuning
 Re-Training (if needed)
 Distillation (pruning, compacting etc)
 Staging
 Deployment (Hosting)

 Use/feedback
 Retraining (with additional data if needed)
4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 5
Hansraj College (online) 4 Apr 25
Central problem in AI

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 6


Hansraj College (online) 4 Apr 25
Central problem in AI
John McCarthy who coined the term “AI” in
1958 says “AI can happen only if machines
understand natural language texts”

 Natural language text has layers of structures and


embedded meanings to be determined locally or
elsewhere

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 7


Hansraj College (online) 4 Apr 25
Evolution of Intelligent computing

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 8


Hansraj College (online) 4 Apr 25
Background (adapted from Dafydd Gibbon (2013))

40s  encryption, decryption, neural automata, neural networks, neuro-linguistics


50s  Machine Translation, dictionaries, text utilities (concordances)
60s  Theoretical informatics, complexity, natural language parsing, speech
70s  psycholinguistic interpretations of parsers/ generators
80s-90s  logic, inference, unification, NLIs, bi/multi-modal interfaces
2000-2010  Web, resources, big data
Future  ???

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 9


Hansraj College (online) 4 Apr 25
Broader areas of study under AI

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 10


Hansraj College (online) 4 Apr 25
 CL/NLP
 Inference engines
 Expert Systems
 Intelligent Tutoring Systems
 Vision Machines
 Robotics
 ...

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 11


Hansraj College (online) 4 Apr 25
CL/NLP and Human Computer Interaction

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 12


Hansraj College (online) 4 Apr 25
Human Computer Interaction (HCI) and NLP
 Conventional HCI
 Intelligent HCI (HCII) - Human interacts with machine with human
(read intelligent) means of communication
 One of the objectives of CL/NLP is to make this happen (if the means
of communication is language)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 13


Hansraj College (online) 4 Apr 25
AI in India : language
roadblocks

4/4/2025 "AI and Language models: towards a robust Sanskrit 14


LLM", Hansraj College (online) 4 Apr 25
Language families*

IndoAryan - 76.87%

Dravidian - 20.82%

Austro Asiatic -
1.11%
Tibeto Burman - 1%

Andamanese* - 0%

4/4/2025 "AI and Language models: towards a robust Sanskrit 15


LLM", Hansraj College (online) 4 Apr 25
Scheduled languages and scripts

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 16


Hansraj College (online) 4 Apr 25
Language Technology
 AI in Indian languages

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 17


Hansraj College (online) 4 Apr 25
Our requirements
 Basic I/O tools for all Indian languages
 OCR, OLHWR, ASR, TTS, Smart Key boards
 Real time MT
 Language to language (unimodal)
 Speech to Speech (bimodal)
 Text simplification
 Dialogue recognition
 Multimodal technologies
 Resources for creating language technology
 Indigenous algorithms
 Indigenous data centers
 India’s own cost effective flexible multilingual AI

"AI and Language models: towards a robust Sanskrit LLM",


4/4/2025 Hansraj College (online) 4 Apr 25 18
How are we going to do this?

"AI and Language models: towards a robust Sanskrit LLM", Hansraj College (online)
4 Apr 25
4/4/2025 19
AI and Linguistics

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 20


Hansraj College (online) 4 Apr 25
 Phonetics
 Phonology
 Morphology
 Syntax
 Semantics

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 21


Hansraj College (online) 4 Apr 25
Why is Sanskrit important?

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 22


Hansraj College (online) 4 Apr 25
 Panini
 Sanskrit
 Linguistic tradition
 Sanskrit and AI Rick Briggs
 Sanskrit as foundation of COLING Nicholas Ostler

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 23


Hansraj College (online) 4 Apr 25
Similarity in methods

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 24


Hansraj College (online) 4 Apr 25
Methods used in Indian
knowledge Tradition

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 25


Hansraj College (online) 4 Apr 25
-Quantitative (maths, natural sciences, linguistics)
-Experimental (natural sciences)
-Observational (all disciplines)
-Descriptive in general
-Generative in grammar

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 26


Hansraj College (online) 4 Apr 25
Theories of knowledge
 Definition and classification of knowledge
 Prama/aprama
 Buddhist
 Vedantist
 Naiyayika
 AI explanation

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 27


Hansraj College (online) 4 Apr 25
Methods of interpretation

-sutra  vrtti,vyakhya,bhashya,shastra
-tantra-yukti
 compose good texts by removing tantra-doshas
 obtain correct unambiguous meaning of a text
 connect sentences for clarification of meaning
 Vakya Yojana, Artha yojana
- shaabda-bodha
 akanksha, yogyata, asatti, tatparya
4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 28
Hansraj College (online) 4 Apr 25
Methods of argumentation
-purva paksha
 knowing the argument of the opponent, find flaws with it
-uttara paksha
 propose the new (supposedly flawless) argument
-siddhanta
 established theory, truth, vaada
-nigraha sthana
 points of defeat in the debate/argumentation

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 29


Hansraj College (online) 4 Apr 25
methods/techniques in NLP/AI

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 30


Hansraj College (online) 4 Apr 25
 All relevant methods from Linguistics, CS, Statistics, probability
 Methods used in Programming and Databases
 Corpora based techniques
 Rule-based vs ML based techniques
 Example – Formal grammar-lexicon
 ML based methods

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 31


Hansraj College (online) 4 Apr 25
(the good old)
Formal Grammar/lexicon based
models

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 32


Hansraj College (online) 4 Apr 25
 AD (Ashtadhyayi) Panini (700 BCE)
 LFG (Lexical Functional Grammar) Bresnan (1982)
 FUG (Function Unification Grammar) Martin Kay (1984)
 TAG (Tree Adjoining Grammar) Arvind Joshi and Schables
(1992)
 HPSG (Head-driven Phrase Structure Grammar)  Polard and Sag
(1994)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 33


Hansraj College (online) 4 Apr 25
Until the 90s, the method involved using formal
grammar and lexicon…

Today the focus has shifted to corpora and ML


based methods and algorithms
4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 34
Hansraj College (online) 4 Apr 25
ML for NLP/AI

Conventional and ML based programs used in


AI/CL

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 35


Hansraj College (online) 4 Apr 25
CL/NLP/AI – integrated platforms
 OpenNLP (Java based platform)
 NLTK (Python based)
 OPenNMT (Python/TensorFlow based)
 BERT(Bidirectional Encoder Representations
from Transformers)
 MuRIL - Multilingual Representations for Indian
Languages (TensorFlow)
 IndicNLP (BERT based platform)
 ILCIANN for Indian languages
4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 36
Hansraj College (online) 4 Apr 25
Why is Panini important in AI?

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 37


Hansraj College (online) 4 Apr 25
Panini’s Grammar – systemic view
 Phonetic Component
 Phonemes – 14 Shivasutras
 Pratyahara (dynamic sound classes)
 Rulebase 4000 grammar rules
 Lexica
 Verbs database
 Nominals database
 Lists
 Affixes
 Rule-specific entries

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 38


Hansraj College (online) 4 Apr 25
Panini’s System
 more formal
 Largely unambiguous procedures  easier programming
 Structure Similar to a program
 Variable Instantiation
 Vriddhi (evaluation / expansion)
 PS rules and replacement procedures may have been
influenced by Panini

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 39


Hansraj College (online) 4 Apr 25
Pāṇini compared with
Chomsky’s grammar
model

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 40


Hansraj College (online) 4 Apr 25
Standard Theory (1957-65)
Syntactic Structures (1957)
Aspects of the Theory of Syntax (1965)
 Base Component [PS rules+lexicon] deep structure
 (DS)semantic component
 T-rules surface structure (SS)  Phonological
component

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 41


Hansraj College (online) 4 Apr 25
Pāṇini compared with Context Free
Grammar (CFG)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 42


Hansraj College (online) 4 Apr 25
G = (V, T, P, S)
Where
V = non terminals
NP, VP, Det, N, V
Panini  aka, savarna, etc A is non terminal
T = terminals (variable) and
the, dog, chases, cat α is a string of symbols
Panini  a, i, u, bhu, ad etc from ( V ∪ T )*
P = productions
A ->α
Panini : ab if c
S= Start symbol

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 43


Hansraj College (online) 4 Apr 25
Pāṇini’s compared with Backus
Naur Form (BNF) notations for
CFG

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 44


Hansraj College (online) 4 Apr 25
CFG in BNF
CF notations with minor format changes with some shorthand
<s>  <np> + <vp>
<p>  <sup> + <tin> suptinantam padam
<np>  <det> + <n> CFG in BNF has
<sp>  <….> + <s>
simplified the definition
<vp>  <v> + <np>
<kp> <…>+<k>+<…>
of programming
<n> = dog, cat, it languages and
<s> = ramah, shyamah, … compiler design
<v> = chases, bites
k= bhu,ad,…
<det> = the, a

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 45


Hansraj College (online) 4 Apr 25
Panini >> Saussure >> Chomsky 
Computational Linguistics (COLING)
and Artificial Intelligence (AI)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 46


Hansraj College (online) 4 Apr 25
The idea (obviously) came from Pāṇini
 Describing natural language (or programming languages) exactly using re-write
rules

 Panini  Saussure  Bloomfield  Harris  Chomsky  John Backus 


Backus Normal  Peter Naur  Backus Naur Form
(Panini Backus Form, Panini Naur Form)
 Panini Backus Form suggested (Peter Zilahy Ingerman 1967)

 Panini’s technique and abstract notation used in accurately describing NL structure


is very similar to John Backus (IBM designer in 1950s) who described a new
programming language IAL (International Algebraic Language) ALGOL)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 47


Hansraj College (online) 4 Apr 25
What is a language model?

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 48


Hansraj College (online) 4 Apr 25
 Statistical/ML models which understand/generate
language natural language texts/speech
 Trained on Statistical or neural models
 Require huge datasets

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 49


Hansraj College (online) 4 Apr 25
Types of language models
 Statistical
 Neural
 Large Language Models
 massive training and data size
 for performing more complex tasks
 Examples
 BERT (Bidirectional Encoder Representations from Transformers) developed by Google
 GPT-3, GPT-4 (Generative Pre-trained Transformer), PaLM, PaLM2…(Pathways LM), Llama
3.1, Llama 3.2, Llama 3.3.
 OpenAI, DeepSeek
 Open source????? Free????
 FOSS
4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 50
Hansraj College (online) 4 Apr 25
Where is the competition headed?

-faster/real-time response
-better capacity in original writing
-better reasoning capacity like humans
-creativity
-lower development cost
-lower access cost
-lower power consumption
-easy deployability/accessibility

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 51


Hansraj College (online) 4 Apr 25
More importantly

-must scale to newer languages/areas of knowledge


-must scale to resource poor languages
-must be secure
-must address data security
-less prone to misuse/manipulation
-should solve problems rather than merely answer queries

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 52


Hansraj College (online) 4 Apr 25
SLM – Small Language Models
most of the concerns expressed earlier can be addressed by an
SLM
 Needs lesser data than LLM
 Faster training an deployment
 Can be deployed and accessed comparatively easily
 Expertise in specific tasks than “all” tasks
 Better for data security

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 53


Hansraj College (online) 4 Apr 25
So what do we do?
-SLMs in each of 22 (or 1369+) languages in each of the 18
domains for each of the major tasks ??????

Or a more standardized LLM which scales?

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 54


Hansraj College (online) 4 Apr 25
Towards a Sanskrit LLM

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 55


Hansraj College (online) 4 Apr 25
Why Sanskrit?

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 56


Hansraj College (online) 4 Apr 25
 Extraordinary language?
 Spawns or influences most Indian languages
 Genealogical basis for many Indo-European languages
 Common lexicon, grammar, themes, culture?
 Original data (minimum biases?)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 57


Hansraj College (online) 4 Apr 25
What will it take to train a Sanskrit
LLM?

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 58


Hansraj College (online) 4 Apr 25
 A decent size Sanskrit LLM has been unheard of so far.
 The MuRIL (Multilingual Representation for Indian Languages) by Google has 17 Indian languages
including Sanskrit.
 This model had limited scope and usage.
 A major problem encountered by Google NLU was collecting gold standard data in desired quantity.
And a major reason why they could not do it was the very nature of Sanskrit language and available
text data.
 A historical language spanning at least 5000 years and being heavily synthetic in nature, Sanskrit text
posed the challenge of pre-processing potential infinitely long strings into meaningful tokens.
 Without meaningful tokenization and other essential pre-processing tasks, no LLM can be usefully
trained.

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 59


Hansraj College (online) 4 Apr 25
Data sources

An estimation of tokens from original Sanskrit sources

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 60


Hansraj College (online) 4 Apr 25
 Vedas (2.3 lac words)
 Upavedas (7 lac words)
 Brāhmaṇa (5 lac words)
 Āraṇyaka (1 lac words)
 Upaniṣads (20 lacs words)
 Vedāṅgas (76 lac words)
 Purāṇa (70 lac words)
 Smṛti Grantha (6 lac words)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 61


Hansraj College (online) 4 Apr 25
 Āgama (10 lacs words)
 Mahābhārata (1 lac ślokas, 20 lac words)
 Rāmāyana (24K ślokas, 4.8 lac words
 Kośa (50 lac words)
 Kalā (25 lac words)
 Darśana (18 lac words)
 Sāhitya (415 lac words)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 62


Hansraj College (online) 4 Apr 25
Current Sanskrit (3000 lac words)
Crawled
Collected
Speech to text data
Auto translated
Text synthesized

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 63


Hansraj College (online) 4 Apr 25
Veda-n
(what it can/cannot do)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 64


Hansraj College (online) 4 Apr 25
 Linguistic Analysis: analyze the morphological structure of Sanskrit words, facilitate studies of
Sanskrit śāstraic texts and their correct translations, in studying linguistic typology, diachronic
linguistics, and language evolution.

 Text Processing and Information Retrieval: efficient keyword-based searches and information
retrieval from large Sanskrit text corpora, aiding researchers in locating relevant passages

 Information Extraction: extraction of named entities (very useful in Historical/cultural research),


relationships, and other structured information from Sanskrit documents.

 Machine Translation: machine translation models can be trained.

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 65


Hansraj College (online) 4 Apr 25
 Text generation: can generate coherent sentences, predict missing words, and perform various
language-related tasks. A Sanskrit model can also potentially be used for other Indian languages as
most of them either have evolved from Sanskrit or have been heavily influenced from it

 Topic Modeling: topic modeling techniques helping researchers to identify prevalent themes and
topics in Sanskrit texts.

 Sentiment Analysis: sentiment analysis to determine the emotional tone or attitude expressed in
Sanskrit writings.

 Digital Humanities and Cultural Studies like Textual Criticism: identify textual variants, manuscript
differences, and the evolution of words and phrases over time in Sanskrit manuscripts.

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 66


Hansraj College (online) 4 Apr 25
 Language Revival and Preservation: reviving Sanskrit as a spoken language, helping learners
understand word boundaries and grammatical structures.

 Text Digitization: digitize ancient Sanskrit manuscripts, preserving and making them accessible in
digital formats.

 Educational Tools: used in language learning apps and tools to provide learners with accurate
segmentation of words and sentences, aiding in pronunciation and comprehension.

 Corpus Linguistics: statistical analysis of language usage patterns, historical shifts, and linguistic
phenomena in Sanskrit over time.

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 67


Hansraj College (online) 4 Apr 25
Current developments in India

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 68


Hansraj College (online) 4 Apr 25
 Anuvadini MT  MoE for quick translation of
Engineering textbooks with CSTT vocabulary
 Current MEITY initiative - Bhashini
 PSA initiative – IC-MATS (Innovation Challenge for
Machine Aided TS)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 69


Hansraj College (online) 4 Apr 25
Indian academia (current major
players)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 70


Hansraj College (online) 4 Apr 25
IIT Chennai  speech
IIT Delhi OCR
IISc Bangalore  OLHWR
JNU, UoHyd, Jadavpur…  LT Resources, Tools
CDACs, IIIT Hyderabad, IIITM, some major universities  MT, resource creation
IIT Bombay  MT, wordnet
IIT Patna etc

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 71


Hansraj College (online) 4 Apr 25
Industry working in diverse
areas

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 72


Hansraj College (online) 4 Apr 25
Microsoft – search engine, MT and all related tools
Google - search engine, MT and all related tools
Swiftkey – input mechanism
Amazon AI
Samsung
Adobe – document processing
Nuance – input mechanism
ezDI – medical data processing
Startups, SMEs

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 73


Hansraj College (online) 4 Apr 25
Google
Search Engine
Google Assistant
Text readers
Machine Translation (supports 21 Indian languages)
Assamese, Bangla, Bhojpuri, Dogri, Gujarati, Hindi, Konkani, Maithili, Marathi, Mizo,
Nepali, Odia, Punjabi, Sanskrit, Sindhi, Urdu, Kannada, Malayalam, Tamil, Telugu,
Manipuri
Document recognition
MuRIL
Multilingual Representations for Indian Languages : A BERT model pre-trained on 17 Indian
languages, and their transliterated counterparts.

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 74


Hansraj College (online) 4 Apr 25
Microsoft
Bing Search

Bing Translator (supports 16 Indian languages)


Assamese, Bangla, Gujarati, Hindi, Konkani, Marathi, Nepali, Odia, Punjabi, Urdu,
Malayalam, Tamil, Telugu, Maithili, Konkani, Sindhi

Microsoft Research

Microsoft Research India Lab

Microsoft AI, Cortana

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 75


Hansraj College (online) 4 Apr 25
Amazon
Amazon Alexa

Amazon AI platform offers

 Translation
 Speech transcription
 Medical data intelligence
 Voice/Text chatbots

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 76


Hansraj College (online) 4 Apr 25
Apple

Siri – personal Assistant for


 iOS
 Mac
 other Apple devices using voice recognition

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 77


Hansraj College (online) 4 Apr 25
Samsung
Samsung Research

Bangalore HQ in India

Noida for Indian language embeddings in their devices

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 78


Hansraj College (online) 4 Apr 25
Start-ups and SMEs

Active in the major areas of development like -

Legal domain
Health
Education

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 79


Hansraj College (online) 4 Apr 25
What does JNU do?

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 80


Hansraj College (online) 4 Apr 25
Work done at

School of Sanskrit and Indic Studies


Jawaharlal Nehru University

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 81


Hansraj College (online) 4 Apr 25
MEITY funded Projects

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 82


Hansraj College (online) 4 Apr 25
 Vidyapati – Hindi-Maithili MT
 Indian Languages Corpora Initiative (ILCI)
 Sanskrit-Hindi Machine Translation (SHMT)
 Shallow Parser Tools for Indian Languages (SPTIL)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 83


Hansraj College (online) 4 Apr 25
Consultancies

4/4/2025
4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 84
Hansraj College (online) 4 Apr 25
Google

 Studying Language variations


(2022-24)

 Studying bilingual interactions


(2020-21)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 85


Hansraj College (online) 4 Apr 25
Nuance Technologies 2016

Predictive mobile keyboard for Kashmiri

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 86


Hansraj College (online) 4 Apr 25
Swiftkey, 2015

Predictive mobile keyboards for lesser used


languages
(Sanskrit, Santhali, Manipuri, Maithili Sindhi)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 87


Hansraj College (online) 4 Apr 25
Microsoft, USA, 2006

Online Handwriting Recognition for Hindi


(ink samples, language and usage model, dictionaries )

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 88


Hansraj College (online) 4 Apr 25
LDC, University of Pennsylvania, 2011

Multimodal data in 8 languages


(Indian English, Hindi, Urdu, Tamil, Bangla, Punjabi, Pushto,
Dari)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 89


Hansraj College (online) 4 Apr 25
Microsoft USA 2013

Microsoft Translator Hub


http://bing.com/translator
(done in collaboration with us at JNU)
More language pairs in progress –
English-Maithili, Sanskrit-Hindi, English-Gujarati, English-Sindhi,
Sanskrit-English

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 90


Hansraj College (online) 4 Apr 25
Our ‘Monster’ Tools
Crawler

Sanitizer

Lexicographer

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 91


Hansraj College (online) 4 Apr 25
Work done by our research students

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 92


Hansraj College (online) 4 Apr 25
M.Phil students
Ph.D students
Current research highlights
 Text summarization
 Sanskrit-Java NLI
 ASR
 MT

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 93


Hansraj College (online) 4 Apr 25
We showcase our developments on
international platforms

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 94


Hansraj College (online) 4 Apr 25
WILD RE
Workshop on Indian Language Data Resource & Evaluation
(Partially sponsored by Microsoft Research India - MSRI)
 WILDRE7 – Torino, Italy, May 2024
 WILDRE6 – Marseille, France, 20 June 2022
 WILDRE5 – Marseille, France, 16 May 2020 (now online on 24 May)
 WILDRE4 - Miyazaki, Japan (2018)
 WILDRE3 - Portoroz, Sloveia (2016)
 WILDRE2 – Reykjavik (2014)
 WILDRE1 – Istanbul (2012)

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 95


Hansraj College (online) 4 Apr 25
Demo
http://sanskrit.jnu.ac.in
https://www.youtube.com/ColingAtJNU

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 96


Hansraj College (online) 4 Apr 25
Summarizing…..

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 97


Hansraj College (online) 4 Apr 25
Our progress depend on Language technology driven AI

India’s strength in foundational disciplines (Sanskrit and


Linguistics) and manpower capabilities can be used to
create our own AI for critical areas

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 98


Hansraj College (online) 4 Apr 25
Most critical areas of development
 Governance
 Education
 Health
 Disaster Management
 Languages, Cultures, Knowledge Traditions
 …..any other area where language applies

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 99


Hansraj College (online) 4 Apr 25
We need our own AI
 Using Sanskrit for “Common core” method
 Resource creation for all Indian languages

including technical vocabulary and content


 Educational technology

 India’s own data centers

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 100
Hansraj College (online) 4 Apr 25
But we have our own unique
Challenges

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 101
Hansraj College (online) 4 Apr 25
Diversity
Language variation and mixing
Paucity of Standards
Funding
Casual approach towards our languages
Teamwork
Lack of competition
Complexity of natural languages, more so in multilingual societies like India

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 102
Hansraj College (online) 4 Apr 25
And there are Challenges in using AI
too

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 103
Hansraj College (online) 4 Apr 25
Constantly evolving techniques in AI complicate the
problem further
AI in 90s vs AI now vs AI tomorrow

4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 104
Hansraj College (online) 4 Apr 25
Thanks !

കൂ क କ
ಕ ਕ
क క
ક గ
ক ಕ







[email protected]
 91-11-26741308
4/4/2025 "AI and Language models: towards a robust Sanskrit LLM", 105
Hansraj College (online) 4 Apr 25

You might also like