Unit 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 35

NATURAL LANGUAGE PROCESSING

SYLLABUS
UNIT I: INTRODUCTION

Overview: Origins and challenges of NLP – Theory of Language – Features of Indian Languages – Issues in Font –
Models and Algorithms – NLP Applications.

UNIT II - MORPHOLOGY AND PARTS-OF-SPEECH

Phonology – Computational Phonology – Words and Morphemes – Segmentation – Categorization and


Lemmatisation – Word Form Recognition – Valency – Agreement – Regular Expressions – Finite State Automata –
Morphology– Morphological issues of Indian Languages – Transliteration.

UNIT III - PROBABILISTIC MODELS

Probabilistic Models of Pronunciation and Spelling – Weighted Automata – N-Grams – Corpus Analysis – Smoothing
– Entropy – Parts-of-Speech – Taggers – Rule based – Hidden Markov Models – Speech Recognition.

UNIT IV - SYNTAX

Basic Concepts of Syntax – Parsing Techniques – General Grammar rules for Indian Languages – Context Free
Grammar – Parsing with Context Free Grammars – Top-Down Parser – Earley Algorithm – Features and Unification
– Lexicalised and Probabilistic Parsing.

UNIT V - SEMANTICS AND PRAGMATICS (6 hours)

Representing Meaning – Computational Representation – Meaning Structure of Language – Semantic Analysis –


Lexical Semantics – WordNet – Pragmatics – Discourse – Reference Resolution – Text Coherence – Dialogue
Conversational Agents.
UNIT – 1

Overview: Origins and challenges of NLP – Theory of Language – Features of Indian Languages –Issues in Font –
Models and Algorithms – NLP Applications.

NATURAL LANGUAGE PROCESSING

NATURAL LANGUAGE PROCESSING:

Q. Define Natural Language Processing.

 Natural language processing (NLP) is the ability of a computer program to understand human language as it's
spoken and written – referred to as natural language. It's a component of artificial intelligence (AI).
 NLP has its roots in the field of linguistics. It has a variety of real-world applications in numerous fields,
including medical research, search engines and business intelligence.
 It plays a role in chatbots, voice assistants, text-based scanning programs, translation applications and
enterprise software that aids in business operations, increases productivity and simplifies different processes.

OVERVIEW OF NLP TASK

OVERVIEW OF NLP TASK:

Q. Give general approaches to natural language process. (or) Write short note on NLP.

 Natural language processing (NLP) is the ability of a computer program to understand human speech as it is
spoken. NLP is a component of artificial intelligence (AI).
 The development of NLP applications is challenging because computers traditionally require humans to
“speak” to them in a programming language that is precise, unambiguous and highly structured or, perhaps
through a limited number of clearly-enunciated voice commands. Human speech, however, is not always
precise - it is often ambiguous and the linguistic structure can depend on many complex variables, including
slang, regional dialects and social context.
 Current approaches to NLP are based on machine learning, a type of artificial intelligence that examines and
uses patterns in data to improve a program’s own understanding. Most of the research being done on natural
language processing revolves around search, especially enterprise search.
 Common NLP tasks in software programs today include:
(1) Sentence segmentation, part-of-speech tagging and parsing.
(2) Deep analytics.
(3) Named entity extraction.
(4) Co-reference resolution.
 The advantage of natural language processing can be seen when considering the following two statements:
“Cloud computing insurance should be part of every service level agreement” and
“A good SLA ensures an easier night's sleep – even in the cloud.”
If you use natural language processing for search, the program will recognize that cloud computing is an
entity, that cloud is an abbreviated form of cloud computing and that SLA is an industry acronym for service
level agreement
 The ultimate goal of NLP is to do away with computer programming languages altogether. Instead of
specialized languages such as Java or Ruby or C, there would only be “human.”

EVOLUTION OF NLP SYSTEMS

EVOLUTION OF NLP SYSTEMS:

Q. Discuss the evolution of NLP systems (or) Given a brief history of NLP.

 History of NLP: The work related to NLP was started with machine translation (MT) in 1950s. It was Allan
Turing who proposed what today is called “The Turing Test” in 1950s. It is the testing ability of the machine
program to have written conversation with human.
 This program should be written so well that one would find it difficult to determine whether the conversation
is with a machine or it is with the other person actually. During the same period of cryptography and language
translation took place. Later on, syntactic structures came up along with linguistics. Further, the sentences
were considered with knowledge augmentation and semantics.
 In 1960s, ELIZA (the most common NLP system) was developed that gained popularity. It was the simulation
of a psychotherapist. At a very later stage, it was the case grammars that came up. Now, there has been a
complete revolution in the NLP with the machine learning approaches having come up. Many NLP systems
have been developed till today and a lot of competitions are being organized that are based on the Turing test.

PRAGMATIC ANALYSIS:

Q. What is pragmatic analysis in natural language processing?

Pragmatic analysis in Natural Language Processing (NLP) focuses on understanding the context and purpose behind
language use, going beyond the literal meaning of words and sentences. It involves interpreting the intended meaning,
implications, and effects of language in a given context, often considering aspects like speaker intention, the
relationship between speakers, and situational factors.

Pragmatic has not been the central concern of most NLP system. Only after ambiguities arise at syntactic or semantic
level are the context and purpose of the utterance considered for analysis. Considered a problem in which pragmatic
has been used in this kind of “support” capacity ambiguous noun phrases.

COMPONENTS OF NLP:

Q. What are the two components of NLP?

There are two components of NLP: Mapping the given input in the natural language into a useful representation.
Different level of analysis required: morphological analysis, syntactic analysis, semantic analysis, discourse analysis.

 Natural Language Generation: Producing output in the natural language from some internal representation.
Different level of synthesis required: deep planning (what to say), syntactic generation.
 NL Understanding: NL Understanding is much harder than NL Generation. But, still both of them are hard.
 Planning: Planning problems are hard problems. They are certainly nontrivial. Method which we focus on
ways of decomposing the original problem into appropriate subparts and on ways of handling interactions
among the subparts during the problem-solving process are often called as planning. Planning refers to the
process of computing several steps of a problem-solving procedure before executing any of them.

MAJOR METHODS OF NLP ANALYSIS:

There are several main techniques used in analyzing natural language processing. Some of them can be briefly
described as follows:

1. Pattern matching: In natural language processing, it's often better to understand whole sentences at once rather
than piece by piece. This approach uses patterns of words to interpret sentences. However, analyzing sentences deeply
requires many patterns, even for specific topics. To manage this, we can break down sentences into smaller parts and
match each part step by step (hierarchical pattern matching). Another method to simplify this process is by using basic
meanings (semantic primitives) instead of individual words.

2. Syntactically driven parsing: Syntactically driven parsing is a method in natural language processing (NLP)
where the structure of a sentence is analysed according to the rules of syntax, which are the rules that govern the
structure of sentences. The goal is to understand the grammatical structure of the sentence, which can help in
determining its meaning. Parsing involves breaking down a sentence into its constituent parts and identifying the
grammatical relationships between these parts.
Syntactically driven parsing focuses on the syntax of the sentence, using grammatical rules to guide the parsing
process.

3. Semantic grammars: Semantic grammars combine both the meaning (semantics) and the structure (syntax) of
language in their analysis. While syntactically driven parsing focuses mainly on the grammatical structure of
sentences, semantic grammars also consider the meaning of the words and phrases when defining categories. This
means that the analysis takes into account both how words are put together and what they mean, providing a more
comprehensive understanding of natural language.

4. Case frame instantiation: Case frame instantiation is a parsing technique that focuses on identifying the roles (or
"cases") that words and phrases play in a sentence, such as subject, object, and verb. This method is advantageous
because:
(a) Recursive Nature: It can break down sentences into smaller parts repeatedly, making the analysis more
manageable.
(b) Bottom-Up Recognition: It starts by identifying key components of the sentence (like nouns and verbs) from the
bottom up.
(c) Top-Down Instantiation: It then uses these identified components to help fill in and structure the rest of the
sentence from the top down.

HISTORY

HISTORY OF NATURAL LANGUAGE PROCESSING: (for 10 marks)

Q. Give a detailed account on the history of NLP.

Natural Language Processing (NLP) has a fascinating history that spans several decades.

THE BIRTH OF NLP (1950S-1960S)

 In 1950, Alan Turing published his famous article “Computing Machinery and Intelligence” which proposed
what is now called the Turing test as a criterion of intelligence.
 This criterion depends on the ability of a computer program to impersonate a human in a real-time written
conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably—on the
basis of the conversational content alone—between the program and a real human.
 In 1957, Noam Chomsky’s Syntactic Structures revolutionized Linguistics with ‘universal grammar’, a rule-
based system of syntactic structures.
 The roots of NLP can be traced back to the 1950s when computer scientists first attempted to teach machines
how to understand and generate human language.
 Early efforts, such as IBM’s “Shoebox,” involved creating dictionaries and rules to translate languages.
However, progress was slow due to limited computational power.
THE RULE-BASED APPROACH (1970S-1980S)

 The 1970s and 1980s saw the rise of rule-based systems. Researchers developed elaborate sets of grammatical
rules to analyze and generate text.
 While this approach worked for simple sentences, it struggled with the complexities of natural language,
leading to the famous “knowledge acquisition bottleneck”.

STATISTICAL REVOLUTION (1990S)

 The 1990s brought a paradigm shift with statistical methods and machine learning. Researchers began using
large corpora of text to train models that could infer grammar and semantics.
 Hidden Markov Models (HMMs) and Maximum Entropy Models (MaxEnt) emerged as powerful tools for
tasks like Part-Of-Speech tagging and Named Entity Recognition.

RISE OF MACHINE LEARNING (2000S)


 The 2000s witnessed a proliferation of machine learning techniques in NLP.
 Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) became popular for tasks like
sentiment analysis and machine translation.
 The availability of labeled datasets like the Penn Treebank and the development of open-source libraries like
NLTK and Apache OpenNLP accelerated progress.

DEEP LEARNING REVOLUTION (2010S)

 The 2010s marked a transformative period for NLP, thanks to deep learning.
 Neural networks, particularly Recurrent Neural Networks (RNNs) and Convolutional Neural Networks
(CNNs), demonstrated remarkable performance on tasks like language modeling and text classification.
 The introduction of word embeddings, such as Word2Vec and GloVe, improved the representation of words.

THE LLM AND GPT-3 REVOLUTION (2020)

 In 2020, the NLP landscape saw a game-changing revolution with the introduction of models like GPT-3
(Generative Pre-trained Transformer 3) and LLM (Large Language Models).
 These models, based on the Transformer architecture, exhibited unprecedented language understanding and
generation capabilities.
 GPT-3, developed by OpenAI, became famous for its ability to perform a wide range of NLP tasks, including
text generation, translation, summarization, and even code generation.
 It boasted 175 billion parameters, making it one of the largest language models at the time.

CHATGPT IN 2022

 Building upon the success of models like GPT-3, OpenAI introduced ChatGPT in 2022. ChatGPT is a sibling
model to GPT-3, fine-tuned for conversational AI applications.
 It’s designed to have more interactive and dynamic conversations with users, making it a powerful tool for
chatbots and virtual assistants.

HISTORY OF NATURAL LANGUAGE PROCESSING TIMELINE: (for 5 marks)

1949: The concept of a “universal machine” capable of mimicking human intelligence is proposed by Alan Turing.

The 1950s: The beginnings of NLP research and development.

1954: The Georgetown-IBM experiment uses an IBM 701 computer for Russian-English translation, one of the
earliest machine translation experiments.
The 1960s: The development of linguistic theories and formal grammar that influence early NLP work.
The 1970s: The shift towards rule-based systems in NLP.

1972: Terry Winograd develops SHRDLU, an effective NLP system that can manipulate blocks in a virtual world
using natural language commands.

The 1980s: Early work on statistical approaches in NLP.

1989: The Hidden Markov Model Toolkit (HTK) development helps researchers build statistical models for speech
recognition.

The 1990s: Continued advancements in statistical approaches and the introduction of probabilistic models such as
probabilistic context-free grammar (PCFG).

2000s: Growing interest in machine learning and statistical methods in NLP.

The 2010s: A resurgence of interest in NLP driven by advancements in deep learning and neural networks.

2013: The introduction of Word2Vec, a word embedding technique that represents words as dense vectors, improves
NLP models’ performance.

2014: The development of Google’s neural network-based machine translation system, Google Neural Machine
Translation (GNMT), significantly improves translation quality.

2017: Introducing the Transformer model architecture powers models like BERT (Bidirectional Encoder
Representations from Transformers) and GPT (Generative Pretrained Transformer). These models achieve state-of-
the-art results in a wide range of NLP tasks.

2020: The release of GPT-3 (Generative Pretrained Transformer 3) by OpenAI, one of the most significant language
models to date, can generate coherent and contextually relevant text.

2021: Advancements in zero-shot and few-shot learning, enabling models to perform well on tasks without extensive
task-specific training data.

LEVELS AND TASKS OF NLP

LEVELS AND TASKS OF NLP:

Q. Briefly explain the NLP tasks and write the different levels of NLP. (or) Explain the synthetic and semantic analysis
in NLP.

NLP problem can be divided into two tasks: Processing written text, using lexical, syntactic and semantic knowledge
of the language as well as the required real-world information.

Processing spoken language, using all the information needed above plus additional knowledge about phonology as
well as enough added information to handle the further ambiguities that arise in speech.

LEVEL OF NLP:

(1) Phonology:
It concerned with interpretation of speech sound within and across words.\
(2) Morphology:
It deals with how words are constructed from more basic meaning units called morphemes. A morpheme is
the primitive unit of meaning in a language. For example, “truth+ful+ness”.
(3) Syntax:
It concerns how words can be put together to form correct sentences and determines what structural role each
word plays in the sentence and what phrases are subparts of other phrases. For example, “the dog ate my
homework”
(4) Semantics:
It is a study of the meaning of words and how these meaning combine in sentences to form sentence meaning.
It is study of context-independent meaning. For example, plant: industrial plant/ living organism.
Pragmatics concerns with how sentences are used in different situations and how it affects the interpretation of
the sentence. Discourse context deals with how the immediately preceding sentences affect the interpretation
of the next sentence. For example, interpreting pronouns and interpreting the temporal aspects of the
information.
(5) Reasoning:
To produce an answer to a question which is not explicitly stored in a database; Natural Language Interface to
Database (NLIDB) carries out reasoning based on data stored in the database. For example, consider the
database that holds the student academic information, and user posed a query such as: ‘Which student is likely
to fail in Science subject?’ To answer the query, NLIDB needs a domain expert to narrow down the reasoning
process.

KNOWLEDGE IN LANGUAGE PROCESSING

A natural language understanding system must have detailed information about what the words mean, how words
combine to form sentences, how word meanings combine to form sentence meanings and so on. What distinguishes
these language processing applications from other data processing systems is their use of knowledge of language.

For example, consider the Unix program, which is used to count the total number of bytes, words, and lines in a text
file. When used to count lines and bytes, is an ordinary data processing application. However, when it is used to count
the words in a file it requires information about what it means to be a word, and thus becomes a language processing
system. Of course, is an extremely simple system with an extremely limited and impoverished knowledge of language.

The different forms of knowledge required for natural language understanding are given below:

Phonetic and Phonological Knowledge:

Phonetics is the study of language at the level of sounds while phonology is the study of combination of sounds into
organized units of speech, the formation of syllables and larger units. Phonetic and phonological knowledge are
necessary for speech-based systems as they are concerned with how words are related to the sounds that realize them.

Morphological Knowledge:

Morphology concerns with word formation. It’s a study of the patterns of formation of words by the combination of
sounds into minimal distinctive units of meaning called morphemes. Morphological knowledge deals with how words
are constructed from morphemes.
Syntactic Knowledge:

Syntax is the level at which we study how words combine to form phrases; phrases combine to form clauses and
clauses join to make sentences. Syntactic analysis concerns sentence formation. It’s concerned with how words can be
put together to form correct sentences. It also determines what structural role each word plays in the sentence and
what phrases are subparts of what other phrases.

Semantic Knowledge:

It deals with meanings of the words and sentences. This is the study of context independent meaning that is the
meaning a sentence has, no matter in which context it is used. Defining the meaning of a sentence is very difficult due
to the ambiguities involved.

Pragmatic Knowledge:

Pragmatics is the extension of the meaning or semantics. Pragmatics concerned with the contextual aspects of
meaning in particular situations. It concerned with how sentences are used in different situations and how use affects
the interpretation of the sentence.

Discourse Knowledge:

Discourse concerns connected sentences. It’s a study of chunks of language which are bigger than a single sentence.
Discourse language concerned with inter-sentential links means how the immediately preceding sentences affect the
interpretation of the next sentence. Discourse knowledge is important for interpreting pronouns and temporal aspects
of the information conveyed.

World Knowledge:

Word knowledge is everyday knowledge that all speakers share about the world. It includes the general knowledge
about the structure of the world and what each language user must know about the other user’s beliefs and goals. This
is essential to make the language understanding much better.

STAGES IN NLP

STAGES IN NLP:

Q. Explain about the different stages of NLP.

There are five phases of NLP.

Lexical Analysis

Syntactic Analysis

Semantic Analysis

Discourse Integration

Pragmatic Analysis

General five steps of NLP

1. Lexical analysis:

 It is the first stage in NLP. It is also known as morphological analysis. It consists of identifying and analyzing
the structure of words.
 Lexicon of a language means the collection of phrases and words in a language. Lexical analysis is dividing
the whole chunk of text into words, sentences, and paragraphs.

2. Syntactic analysis:

 Syntactic analysis consists of analysis of words in the sentence for grammar and ordering words in a way that
shows the relationship among the words. For example, the sentence such as “The school goes to boy” is
rejected by English syntactic analyzer.

3. Semantic analysis:

 Semantic analysis is a structure created by the syntactic analyzer which assigns meanings. This component
transfers linear sequences of words into structures. It shows how the words are associated with each other.
Semantics focuses only on the literal meaning of words, phrases, and sentences. This only draws the
dictionary meaning or the real meaning from the given text. The structures assigned by the syntactic analyzer
always have assigned meaning.
 The text is checked for meaningfulness. It is done by mapping syntactic structure and objects in the task
domain. E.g. “colorless green idea”. This would be rejected by the Symantec analysis as colorless here; green
doesn’t make any sense.

4. Discourse integration

 Discourse The meaning of any sentence depends upon the meaning of the sentence just before it. Furthermore,
it also brings about the meaning of immediately succeeding sentence. For example, “He wanted that”, in this
sentence the word “that” depends upon the prior discourse context.

5. Pragmatic Knowledge:

 Pragmatic analysis concerned with the overall communicative and social content and its effect on
interpretation. It means abstracting or deriving the meaningful use of language in situations. In this analysis,
what was said is reinterpreted on what it truly meant. It contains deriving those aspects of language which
necessitate real world knowledge. E.g., “close the window?” should be interpreted as a request instead of an
order.

CHALLENGES OF NLP

Specifically, the process of


a computer extracting
meaningful
information from
CHALLENGES OF NLP FOR ML:

Below are the steps involved and some challenges that are faced in the machine learning process for NLP:

1. Challenge: Breaking the sentence:

Formally referred to as “sentence boundary disambiguation”, this breaking process is no longer difficult to achieve,
but is nonetheless, a critical process, especially in the case of highly unstructured data that includes structured
information. A breaking application should be intelligent enough to separate paragraphs into their appropriate
sentence units; Highly complex data might not always be available in easily recognizable sentence forms. This data
may exist in the form of tables, graphics, notations, page breaks, etc., which need to be appropriately processed for the
machine to derive meanings in the same way a human would approach interpreting text.

Solution: Tagging the parts of speech (POS) and generating dependency graphs

NLP applications employ a set of POS tagging tools that assign a POS tag to each word or symbol in a given text.
Subsequently, the position of each word in a sentence is determined by a dependency graph, generated in the same
procedure. Those POS tags can be further processed to create meaningful single or compound vocabulary terms.

2. Challenge: Building the appropriate vocabulary:

Using these POS tags and dependency graphs, a powerful vocabulary can be generated and subsequently interpreted
by the machine in a way comparable to human understanding.
Example: Consider the following paragraph:

“All employees are responsible for the management of risk, with the ultimate accountability residing with the Board.
We have a strong risk culture, which is embedded through clear and consistent communication and appropriate
training for all employees. A comprehensive risk management framework is applied throughout the Group, with
governance and corresponding risk management tools. This framework is underpinned by our risk culture and
reinforced by the HSBC Values.” -HSBC annual report 2017

Sentences are generally simple enough to be parsed by a basic NLP program. But to be of real value, an algorithm
should also be able to generate, at a minimum, the following vocabulary terms:

Employees; Management of risk; Ultimate accountability; Board; Strong risk culture; Clear and consistent
communication; Appropriate training for all employees; Comprehensive risk management framework; Governance
and corresponding risk management tools; Framework; Risk culture; HSBC values

Solution: Unfortunately, most NLP software applications do not result in creating a sophisticated set of vocabulary.

3. Challenge: Linking different components of vocabulary:

Recently, new approaches have been developed that can execute the extraction of the linkage between any two
vocabulary terms generated from the document (or “corpus”).

Solution: Word2vec, a vector-space based model, assigns vectors to each word in a corpus, those vectors ultimately
capture each word’s relationship to closely occurring words or set of words. But statistical methods like Word2vec are
not sufficient to capture either the linguistics or the semantic relationships between pairs of vocabulary terms.

Example: In the above-stated example, “All employees are responsible for the management of risk, with the ultimate
accountability residing with the Board”, two vocabulary terms, “Board” and “management of risk” are connected with
the Board having ultimate accountability, but since these two terms are statistically distant, the extent of the
relationship bond between this pair cannot be ascertained, neither linguistically nor semantically.

A more sophisticated algorithm is needed to capture the relationship bonds that exist between vocabulary terms and
not just words.

4. Challenge: Setting the context:

One of the most important and challenging tasks in the entire NLP process is to train a machine to derive context from
a discussion within a document. Consider the following two sentences:

“I enjoy working in a bank.”


“I enjoy working near a river bank.”

The context of these sentences is quite different.

Solution: There are several methods today to help train a machine to understand the differences between the
sentences. Some of the popular methods use custom-made knowledge graphs where, for example, both possibilities
would occur based on statistical calculations. When a new document is under observation, the machine would refer to
the graph to determine the setting before proceeding.

One challenge in building the knowledge graph is domain specificity. Knowledge graphs cannot, in a practical sense,
be made to be universal.

Example: In the example above “enjoy working in a bank” suggests “work, or job, or profession”, while “enjoy near a
river bank” is just any type of work or activity that can be performed near a river bank.

Two sentences with totally different contexts in different domains might confuse the machine if forced to rely solely
on knowledge graphs. It is therefore critical to enhance the methods used with a probabilistic approach in order to
derive context and proper domain choice.

5. Extracting semantic meanings:


Linguistic analysis of vocabulary terms might not be enough for a machine to correctly apply learned knowledge. To
successfully apply learning, a machine must understand further, the semantics of every vocabulary term within the
context of the documents.

By way of example, consider two sentences:

“Under US GAAP, gains and losses from AFS assets are included in net income.”
“Under IFRS, gains and losses from AFS assets are included in comprehensive income.”

Both sentences have the context of gains and losses in proximity to some form of income, but the resultant
information needed to be understood is entirely different between these sentences due to differing semantics. It is a
combination, encompassing both linguistic and semantic methodologies that would allow the machine to truly
understand the meanings within a selected text.

6. Extracting named entities (often referred to as Named Entity Recognition = NER):

Challenge: The next big challenge is to successfully execute NER, which is essential when training a machine to
distinguish between simple vocabulary and named entities. In many instances, these entities are surrounded by dollar
amounts, places, locations, numbers, time, etc., it is critical to make and express the connections between each of
these elements, only then may a machine fully interpret a given text.

Solution: This problem, however, has been solved to a greater degree by some of the famous NLP companies such as
Stanford CoreNLP, AllenNLP, etc.

7. Use Case: Transforming unstructured data into structured format:

Challenge: Putting the unstructured data into a format that could be reusable for analysis. Historically, the same task
has been done only manually by humans.

Example: Consider the following example that contains a named entity, an event, a financial element and its values
under different time scales.

“The recent developments in technology have enabled the stock price of Apple to rise by 20% to $168 as at Feb 20,
2018 from $140 in Q3 2017.”

Think of this sentence broken down into the following structure:

This is extremely challenging through linguistics. Not all sentences are written in a single fashion since authors follow
their unique styles. While linguistics is an initial approach toward extracting the data elements from a document, it
doesn’t stop there. The semantic layer that will understand the relationship between data elements and its values and
surroundings have to be machine-trained too to suggest a modular output in a given format.

CHALLENGES OF NLP FOR AI:

Artificial intelligence has become part of our everyday lives – Alexa and Siri, text and email autocorrect, customer
service chatbots. They all use machine learning algorithms to process and respond to human language. A branch of
machine learning AI, called Natural Language Processing (NLP), allows machines to “understand” natural human
language. A combination of linguistics and computer science, NLP works to transform regular spoken or written
language into something that can be processed by machines.

NLP is a powerful tool with huge benefits, but there are still a number of Natural Language Processing limitations and
problems:

1. Contextual words and phrases and homonyms


2. Synonyms
3. Irony and sarcasm
4. Ambiguity
5. Errors in text or speech
6. Colloquialisms and slang
7. Domain-specific language
8. Low-resource languages
9. Lack of research and development

1. CONTEXTUAL WORDS AND PHRASES AND HOMONYMS:

The same words and phrases can have different meanings according the context of a sentence and many words –
especially in English – have the exact same pronunciation but totally different meanings.

For example:
I ran to the store because we ran out of milk.
Can I run something past you real quick?
The house is looking really run down.

These are easy for humans to understand because we read the context of the sentence and we understand all of the
different definitions. And, while NLP language models may have learned all of the definitions, differentiating between
them in context can present problems. Homonyms – two or more words that are pronounced the same but have
different definitions – can be problematic for question answering and speech-to-text applications because they aren’t
written in text form. Usage of their and there, for example, is even a common problem for humans.

2. SYNONYMS:

Synonyms can lead to issues similar to contextual understanding because we use many different words to express the
same idea. Furthermore, some of these words may convey exactly the same meaning, while some may be levels of
complexity (small, little, tiny, minute) and different people use synonyms to denote slightly different meanings within
their personal vocabulary.

So, for building NLP systems, it’s important to include all of a word’s possible meanings and all possible synonyms.
Text analysis models may still occasionally make mistakes, but the more relevant training data they receive, the better
they will be able to understand synonyms.

3. IRONY AND SARCASM:

Irony and sarcasm present problems for machine learning models because they generally use words and phrases that,
strictly by definition, may be positive or negative, but actually connote the opposite.

Models can be trained with certain cues that frequently accompany ironic or sarcastic phrases, like “yeah right,”
“whatever,” etc., and word embeddings (where words that have the same meaning have a similar representation), but
it’s still a tricky process.

4. AMBIGUITY:
Q. Differentiate between Syntactic Ambiguity and Lexical Ambiguity.

Q. Explain the ambiguities associated at each level with example for natural language processing.

Ambiguity can occur at all NLP levels. It is a property of linguistic expressions. If an expression
(word/phrase/sentence) has more than one meaning we can refer it as ambiguous. For example: Consider the sentence,
“The chicken is ready to eat.” The meaning in the phrase can be, the chicken (food) is ready to be eaten or the chicken
(bird) is ready to be Feeder.

NLP has the following types of ambiguities:

(a) Lexical ambiguity:

It’s the ambiguity of a single word. A word can be ambiguous with respect to its syntactic class. Eg: book,
study.

For example: The word "silver" can be used as, an adjective, a noun or a verb.

o She made a silver speech.


o She bagged two silver medals.
o His worries had silvered his hair.

Lexical ambiguity can be resolved by Lexical category disambiguation i.e., parts-of speech tagging. As many
words may belong to more than one lexical category. Part-of speech tagging is the process of assigning a part-
of-speech or lexical category such as a noun, pronoun, verb, preposition, adverb, adjective etc. to each word in
a sentence.

(b) Lexical Semantic Ambiguity:

The type of lexical ambiguity, which occurs when a single word is associated with multiple interpretations.
Eg: fast, bat, bank, pen, cricket etc.

For example: 1. The tank was full of water.


2. I saw a military tank.

Words have multiple meanings for such sentences. Consider the sentence "I saw a bat." Possible meaning of
the words which changes the contexts of the sentence are:

o bat flying mammal / wooden club?


o saw past tense of “see” / present tense of “saw” (to cut with a saw.)

The occurrence of tank in both sentences corresponds to the syntactic category noun, but their meanings are
different. Lexical Semantic ambiguity resolved using word sense disambiguation (WSD) techniques, where
WSD aims at automatically assigning the meaning of the word in the context in a computational manner.

(c) Syntactic ambiguity:

A sentence can be parsed in different ways. For example, “He lifted the beetle with red cap” or “Did he use
cap to lift the beetle”.
The structural ambiguities were syntactic ambiguities. Structural ambiguity is of two kinds: (i) Scope
Ambiguity and (ii)Attachment Ambiguity
 Scope Ambiguity: Scope ambiguity involves operators and quantifiers.
For example: Old men and women were taken to safe locations.
The scope of the adjective (i.e., the amount of text it qualifies) is ambiguous. That is, whether the
structure (old men and women) or ((old men) and women)? The scope of quantifiers is often not clear
and creates ambiguity.
 Attachment Ambiguity: A sentence has attachment ambiguity if a constituent fits more than one
position in a parse tree. Attachment ambiguity arises from uncertainty of attaching a phrase or clause
to a part of a sentence.
Consider the example: The man saw the girl with the telescope.
It is ambiguous whether the man saw her through his telescope or he saw a girl carrying a telescope.
The meaning is dependent on whether the preposition ‘with’ is attached to the girl or the man.
Consider the example: Buy books for children and attach to the verb buy
Preposition Phrase ‘for children’ can be either adverbial or adjectival and attach to the object noun
books.

(d) Semantic ambiguity:

This type of ambiguity is typically related to the interpretation of sentence. Even after the syntax and the
meanings of the individual words have been resolved, there are two ways of reading the sentence.
Consider example: "Seema loves mother and Sriya does too" The interpretations can be Sriya loves Seema's
mother Sriya likes her own mother. Semantic ambiguities born from the fact that generally a computer is not
in a position to distinguishing what is logical from what is not.

Consider the example: “The car hit the pole while it was moving"

The interpretations can be


• The car, while moving, hit the pole
• The car the pole while the pole was moving.

The interpretation is preferred than the second because we have model of the world that helps distinguish
what logical (or possible) from what is not. To supply to computer, model the world is not so easy.

Consider the example: "We saw his duck” Duck can refer person's bird or to a motion he made. Semantic
ambiguity happens when a sentence contains an ambiguous word or phrase.

(e) Discourse Ambiguity:

Discourse level processing needs shared world or shared knowledge and interpretation is carried out using this
context. Anaphoric ambiguity under discourse level.

 Anaphoric Ambiguity: Anaphora's are the entities that have been previously introduced into the
discourse.
For example, “The horse ran up the hill. It was very steep. It soon got tired.” The anaphoric reference
of ‘it’ in the two situations cause ambiguity. Steep applies to surface hence ‘it’ can be hill. Tired
applies to animate object hence ‘it’ can be horse.

(f) Pragmatic ambiguity:

Pragmatic ambiguity refers to a situation where the context of a phrase gives it multiple meanings. One of the
hardest tasks in NLP. The problem involves processing user intention, sentiment, belief world etc. all of which
are highly complex tasks.

Consider the example, "Tourist (checking out of the hotel): Waiter, go upstairs to my room and see that my
sandals are there; do not be late; I have to catch the train in 15 minutes." Waiter (running upstairs and coming
back panting) Yes sir, they are there. Clearly, the waiter is falling short of the expectation of the tourist, since
he does not understand the pragmatics of the situation.

Pragmatic ambiguity arises when the statement is not specific, and the context does not provide the
information needed to clarify the statement. Information is missing, and must be inferred.

It is a highly complex task to resolve all these kinds of ambiguities, especially in the upper levels of NLP. The
meaning of a word, phrase, or sentence cannot be understood in isolation and contextual knowledge is needed to
interpret the meaning, pragmatic and world knowledge is required in higher levels. It is not easy job to create a world
model for disambiguation tasks. Linguistic tools and lexical resources are needed for the development of
disambiguation techniques. Resourceless languages are lagging behind in these fields compared to resourceful
languages in implementation of these techniques. Rule based methods are language specific where as stochastic or
statistical methods are language independent. Introduction Automatic resolution of all these ambiguities contains
several long-standing problems but again development towards full-fledged disambiguation techniques is required
which takes care of all the ambiguities. It is very much necessary for the accurate working of NLP applications such
as Machine Translation, Information Retrieval, Question Answering etc.

Statistical Approaches of Ambiguity Resolution in Natural Language Processing are:


[1] Probabilistic model
[2] Part of Speech Tagging
o Rule-Based Approaches
o Mak Model Approaches
o Maxman Entropy Approaches
o HMI Based Taggers
[3] Machine Learning Approaches

5. SPELLING ERRORS IN TEXT:

Misspelled or misused words can create problems for text analysis. Spelling mistakes can occur for a variety of
reasons, from typing errors to extra spaces between letters or missing letters.

Autocorrect and grammar correction applications can handle common mistakes, but don’t always understand the
writer’s intention.

Cosine similarity is one of the methods used to find the correct word when a spelling mistake has been detected.
Cosine similarity is calculated using the distance between two words by taking a cosine between the common letters
of the dictionary word and the misspelled word. This way we can find different combinations of words that are close
to the misspelled word by setting a threshold to the cosine similarity and identifying all the words above the set
threshold as possible replacement words.

For example, if the misspelled word is “speling,” the system will find the correct word: “spelling.”

6. COLLOQUIALISMS AND SLANG:

Informal phrases, expressions, idioms, and culture-specific lingo present a number of problems for NLP – especially
for models intended for broad use. Because as formal language, colloquialisms may have no “dictionary definition” at
all, and these expressions may even have different meanings in different geographic areas.

Furthermore, cultural slang is constantly morphing and expanding, so new words pop up every day.

This is where training and regularly updating custom models can be helpful, although it oftentimes requires quite a lot
of data.

7. DOMAIN-SPECIFIC LANGUAGE:

Different businesses and industries often use very different language. An NLP processing model needed for
healthcare, for example, would be very different than one used to process legal documents. These days, however, there
are a number of analysis tools trained for specific fields, but extremely niche industries may need to build or train their
own models.

8. LOW-RESOURCE LANGUAGES:

AI machine learning NLP applications have been largely built for the most common, widely used languages.
However, many languages, especially those spoken by people with less access to technology often go overlooked and
under processed. For example, by some estimations, (depending on language vs. dialect) there are over 3,000
languages in Africa, alone. There simply isn’t very much data on many of these languages.
However, new techniques, like multilingual transformers (using Google’s BERT “Bidirectional Encoder
Representations from Transformers”) and multilingual sentence embeddings aim to identify and leverage universal
similarities that exist between languages.

9. LACK OF RESEARCH AND DEVELOPMENT:

Machine learning requires A LOT of data to function to its outer limits – billions of pieces of training data. The more
data NLP models are trained on, the smarter they become. That said, data (and human language!) is only growing by
the day, as are new machine learning techniques and custom algorithms. All of the problems above will require more
research and new techniques in order to improve on them.

Advanced practices like artificial neural networks and deep learning allow a multitude of NLP techniques, algorithms,
and models to work progressively, much like the human mind does. As they grow and strengthen, we may have
solutions to some of these challenges in the near future.
(Write either the content above for the 'challenges of NLP' or the content below)

CHALLENGES OF NLP

CHALLENGES OF NLP:

If we have to progress in terms of the potential applications and overall capabilities of NLP, these are the important
issues we need to resolve:

(1) Language Differences: If we speak English and if we are thinking of reaching an international and/or
multicultural audience, we shall need to provide support for multiple languages. Different languages have not only
vastly different sets of vocabulary, but also different types of phrasing, different modes of inflection and different
cultural expectations. We shall need to spend time retraining our NLP system for each new language.

(2) Training Data: NLP is all about analyzing language to better understand it. One must spend years constantly to
become fluent in a language. One must spend a significant amount of time reading, listening to, and utilizing a
language. The abilities of an NLP system depend on the training data provided to it. If questionable data is fed to the
system it is going to learn wrong things, or learn in an inefficient manner.

(3) Development Time: One must also consider the development time for an NLP system. With a distributed deep
learning model and multiple GPUs working in coordination, training time can be reduced to just a few hours.

(4) Phrasing Ambiguities: Sometimes, it is hard even for another human being to parse out what someone means
when they say something ambiguous.

There may not be a clear, concise meaning to be found in a strict analysis of their words. In order to resolve this, an
NLP system must be able to seek context that can help it understand the phrasing. It may also need to ask the user for
clarity.

(5) Misspelling: Misspellings are a simple problem for human beings, but for a machine, misspellings can be harder
to identify. One should use an NLP tool with capabilities to recognize common misspellings of words, and move
beyond them.

(6) Innate Biases: In some cases, NLP tools can carry the biases of their programmers as well as biases within the
data sets. Depending on the application, an NLP could provide a better experience to certain types of users over
others. It is challenging to make a system that works equally well in all situations, with all people.

(7) Words with Multiple Meaning: Most languages have words that can have multiple meanings depending on the
context.

For example, a user who asks, “how are you” has a completely different intention than a user who asks something
else. Good NLP tools should be able to differentiate between these phrases using context.

(8) Phrases with Multiple Intentions: Some phrases and questions actually have multiple intentions, so the NLP
system cannot over simplify the situation by interpreting only one of those intentions.
For example, a user may prompt the Chabot with something like, “I need to cancel any previous order and update my
card on file.” The AI needs to be able to distinguish these intentions separately.

(9) False Positives ad Uncertainty: A false positive occurs when an NLP notices a phrase that should be
understandable but cannot be sufficiently answered.

The solution here is to develop an NLP system that can recognize its own limitations, and use questions to clear up the
ambiguity.

(10) Keeping a conversation moving: Many modern NLP applications are built on dialogue between a human and a
machine.

Accordingly, your NLP AI needs to be able to keep the conversation moving, providing additional questions to collect
more information and always pointing towards a solution.
FEATURES OF INDIAN LANGUAGES

Morphological
Richness

Features
Script Free Word
of Indian
Diversity Order
Languages

Compound
Words and
Agglutination

1. MORPHOLOGICAL RICHNESS

Definition: This refers to the complexity and variety of word forms in a language, often due to inflection.

 Inflection and Multiple Word Forms: Inflection involves changing the form of a word to express different
grammatical features such as tense, mood, voice, aspect, person, number, gender, and case. For example, in
English, the verb "run" can be inflected to "runs," "ran," and "running" to convey different tenses and aspects.
Similarly, nouns can have different forms to indicate singular and plural (e.g., "cat" vs. "cats") or possessive
(e.g., "cat's" vs. "cats'").
 Languages with rich morphology have extensive inflectional systems, which means a single word can have
numerous forms.

2. SCRIPT DIVERSITY

Definition: The variety of writing systems used in a language or across different languages in a region.
 Devanagari, Tamil, Bengali, etc.: These are different scripts used to write languages in the Indian
subcontinent. For example:
o Devanagari: Used for Hindi, Marathi, Nepali, and Sanskrit.
o Tamil: Used for Tamil.
o Bengali: Used for Bengali and Assamese.

Each script has unique characters and writing rules, contributing to the script diversity of the region. The ability to use
multiple scripts is a significant feature of linguistic diversity.

3. COMPOUND WORDS AND AGGLUTINATION

Definition: The formation of new words by combining two or more existing words (compound words) or adding
prefixes, suffixes, and infixes to a base word (agglutination).

 Examples from Hindi and Tamil:


o Compound Words: These are words formed by combining two or more words. For example, in
Hindi, "राष्ट्रपति" (rāṣṭrapati) is a compound of "राष्ट्र" (nation) and "पति" (leader), meaning
"President."
o Agglutination: This refers to the process of adding affixes to a base word to express grammatical
relations. For example, in Tamil, the word "கல்வி" (kalvi) means "education." By adding the suffix
"-க்காக" (-kkāga), it becomes "கல்விக்காக" (kalvikkaaga), meaning "for education."

4. FREE WORD ORDER

Definition: The flexibility in the arrangement of words in a sentence without altering the fundamental meaning.

 Syntax Flexibility: Languages with free word order allow for various word orders in sentences, unlike
English, which primarily follows a Subject-Verb-Object (SVO) order.
 For example, in Hindi (a language with relatively free word order), the sentence "I ate an apple" can be
expressed in different ways without changing its meaning:

This flexibility allows speakers to emphasize different parts of the sentence or to conform to poetic and stylistic
requirements.

• Unicode Support
• Ensuring comprehensive encoding
• Rendering Challenges
• Proper display of complex scripts
• Input Methods
• Efficient text input for various scripts

NLP FOR INDIAN REGIONAL LANGUAGES

NLP FOR INDIAN REGIONAL LANGUAGES:

(1) One might think that people who are acquainted with computers are already familiar with the English interface.
However, it's worth noting that majority of the Indian population in India is still based in rural areas where teaching
and learning would be in local languages, where communities are literate, but still are not familiar with English. So,
yes, it is a worthwhile effort to upscale NLP research in India.

(2) The dream of an all-inclusive Digital India cannot be realized without bringing NLP research and application in
India at par with that of languages like English. When engaging with smartphones, the language barrier can be a huge
obstacle to many.

(3) Take the case of farmers and agriculture which has long been considered the backbone of India. Farmers play an
obviously important role in feeding the country. Helping such farmers improve their methods (through precision
agriculture, farmer helplines, chatbots, etc.) has been an aim of development projects and an important part of the fight
against global hunger.
But many small farmers are not knowledgeable in English, meaning it is difficult for them to share and learn about
new farming practices since most of the information is in English.

4) Can you imagine a mobile application like Google assistant but tailor-made for Indian farmers? It'd allow them to
ask their questions in their native tongue, the system would understand their query and suggest relevant information
from around the globe.

(5) Do you think this is possible to do without NLP for Indian regional languages?

And, this is just one possible use-case. From making information more accessible to understanding farmer suicides,
NLP has a huge role to play.

Thus, there is a clear need to bolster NLP research for Indian languages so that such people who don't know
English can get "online" in the true sense of the word, ask questions, in their mother tongue and get answers.

The need also becomes clear when we look at some of the applications of NLP in India.
Applications:

They are:

(1) Smartphone users in India crossed 500 million in 2019. Businesses are feeling a need to increase user
engagement at the local level. NLP can go a long way in achieving that-by improving search accuracy
(Google Assistant now supports multiple Indian Languages), chatbots and virtual agents, etc.
(2) NLP has huge application in helping people with disabilities-interpretation of sign languages, text to speech,
speech to text, etc.
(3) Digitization of Indian Manuscripts to preserve knowledge contained in them.
(4) Signboard Translation from Vernacular Languages to make travel more accessible.
(5) Fonts for Indian Scripts for improving the impact/readability of advertisements, signboards, presentations,
reports, etc.
(6) There are many more. The ideal scenario would be to have corpora and tools available in as good quality as
they are for English to support work in these areas.

TYPES OF TOOLS FOR REGIONAL LANGUAGES:

Various types of tools in Indian regional language are:

(i) Using the Phonetic Keyboard:


o Using Indian languages on computer are very attractive for a layman.
o Quilllpad and Lipikaar are free online typing tool in Indian languages. It supports transliteration
technologies according to pre-defined rules.
o A transliteration technology allows users to type words as they normally would (like
‘rashtrabhasha’ instead of 'RASHTRASHA') without worrying about case-sensitive typing rules.
o Transliteration tools expect users to type words phonetically in English.
o This enables users to communicate in their regional language of choice.
(ii) Fonts Download:
o The Technology Development for Indian Languages (TDIL) program, initiated by the Department
of Electronics and Information Technology (DEIT) of the Government of India, aims to develop
information processing tools that facilitate human-machine interaction in Indian languages and to
create technologies that provide access to multilingual knowledge resources.
o The fonts are being made available to the public for free through language CDs and web
downloads for the benefit of the masses.
(iii) Padma Plugin:
o Padma is a technology designed to transform Indic text between public and proprietary formats.
o It currently supports Telugu, Malayalam, Tamil, Devanagari (including Marathi), Gujarati,
Bengali, and Gurmukhi.
o Padma's goal is to bridge the gap between closed and open standard until the day Unicode support
is widely available on all platforms.
o Padma transforms Indic text encoded in proprietary formats automatically Unicode.
ISSUES IN FONT

Unicode
Support

Rendering Input
Challenges Methods

UNICODE SUPPORT:

Definition: Unicode is a standardized encoding system that allows computers to consistently represent and manipulate
text expressed in most of the world's writing systems.

 Ensuring Comprehensive Encoding: To support a wide range of languages and scripts, it's crucial that the
Unicode standard includes comprehensive encoding for all characters. This involves ensuring that every
unique character in a language's script is assigned a specific code point within the Unicode standard.
Comprehensive Unicode support is essential for accurately representing and processing text in different
languages, especially those with diverse and complex scripts.

RENDERING CHALLENGES:

Definition: Rendering involves the process of displaying text on a screen or other output device. For languages with
complex scripts, rendering can be challenging due to the need to correctly display various character shapes, ligatures,
and contextual forms.

 Proper Display of Complex Scripts: Complex scripts, such as Devanagari, Tamil, and Arabic, require
advanced rendering techniques to display characters correctly. These scripts often involve:
o Ligatures: Combined characters that are displayed as a single glyph.
o Contextual Forms: Characters that change shape depending on their position in a word or sentence.
o Diacritics: Marks added to characters to alter their pronunciation or meaning. Proper rendering
ensures that these elements are displayed accurately and legibly, maintaining the readability and
aesthetics of the text.

INPUT METHODS:

Definition: Input methods refer to the tools and techniques used to enter text into a computer or other devices. For
languages with diverse scripts, developing efficient and user-friendly input methods is essential.

 Efficient Text Input for Various Scripts: Different scripts require different input methods. For example:
o Phonetic Keyboards: These allow users to type phonetically, where the keyboard layout corresponds
to the sounds of the language rather than the actual script. This is common for typing languages like
Hindi or Tamil using a standard QWERTY keyboard.
o Script-Specific Keyboards: These keyboards have layouts specifically designed for a particular
script, such as Devanagari or Bengali keyboards.
o Input Method Editors (IMEs): Software tools that convert keystrokes into complex characters. For
instance, typing "ka" in an IME might produce the Devanagari character "क".
o Predictive Text: Software that predicts and suggests words as users type, improving typing speed and
accuracy.

MODELS IN NLP
RULE-BASED MODELS IN NLP

 Rule-based models in NLP rely on a set of handcrafted rules to process and analyze text. These models were
among the earliest approaches to NLP and continue to be useful for certain tasks.
 Rule-based approach is one of the oldest NLP methods in which predefined linguistic rules are used to
analyze and process textual data.
 Rule-based approach involves applying a particular set of rules or patterns to capture specific structures,
extract information, or perform tasks such as text classification and so on. Some common rule-based
techniques include regular expressions and pattern matches.

STEPS IN RULE-BASED APPROACH IN NLP:

1. Rule Creation: Based on the desired tasks, domain-specific linguistic rules are created such as grammar
rules, syntax patterns, semantic rules or regular expressions.
2. Rule Application: The predefined rules are applied to the inputted data to capture matched patterns.
3. Rule Processing: The text data is processed in accordance with the results of the matched rules to extract
information, make decisions or other tasks.
4. Rule refinement: The created rules are iteratively refined by repetitive processing to improve accuracy and
performance. Based on previous feedback, the rules are modified and updated when needed.

KEY FEATURES OF RULE-BASED MODELS:

1. Handcrafted Rules: Linguistic experts manually create rules based on language syntax, morphology, and
semantics.
2. Pattern Matching: These models often use regular expressions or similar pattern matching techniques to
identify structures in the text.
3. Transparency: The rules and their applications are transparent, making it easy to understand how decisions
are made.
4. Efficiency: For specific tasks with well-defined patterns, rule-based models can be very efficient and
accurate.

APPLICATIONS OF RULE-BASED MODELS IN NLP:

1. Part-of-Speech Tagging: Assigning parts of speech to each word in a sentence based on rules related to word
morphology and context.
2. Named Entity Recognition (NER): Identifying proper nouns, such as names of people, organizations, and
locations, using rules related to capitalization and context.
3. Text Normalization: Converting text into a standardized format, such as expanding abbreviations or
correcting spelling errors.
4. Information Extraction: Extracting specific information from texts, such as dates, addresses, or specific
phrases.
5. Grammar Checking: Identifying and correcting grammatical errors based on syntactic rules.

ADVANTAGES:

1. Precision: For tasks with well-defined patterns, rule-based systems can be highly precise.
2. Explainability: The decision-making process is transparent and understandable.
3. Domain Specificity: Rules can be tailored to specific domains or applications, leading to high performance in
those areas.

LIMITATIONS:

1. Scalability: Creating and maintaining a large set of rules can be labor-intensive and time-consuming.
2. Flexibility: Rule-based models struggle with linguistic variability and ambiguity.
3. Coverage: They may fail to handle cases not covered by the predefined rules.
4. Adaptability: Adapting to new languages or domains requires significant manual effort.

STATISTICAL MODELS IN NLP

Statistical models in NLP rely on probabilistic methods and statistical techniques to process and analyze text. Unlike
rule-based models, which use handcrafted rules, statistical models learn patterns and relationships from large datasets.

KEY FEATURES OF STATISTICAL MODELS:

1. Data-Driven: Statistical models are trained on large corpora of text data, allowing them to learn patterns and
make predictions based on observed data.
2. Probabilistic Nature: These models often use probabilities to handle uncertainty and variability in language.
3. Scalability: They can handle large datasets and complex tasks more effectively than rule-based models.
4. Generalization: Statistical models can generalize to new, unseen data by learning underlying patterns from
the training data.

COMMON STATISTICAL TECHNIQUES IN NLP:

1. N-grams: Sequences of 'n' words used to predict the next word or to model the probability of a sentence.
2. Hidden Markov Models (HMMs): Used for sequence labelling tasks like part-of-speech tagging and named
entity recognition by modelling the sequence of states (e.g., tags) as a Markov process.
3. Naive Bayes Classifier: A probabilistic classifier based on Bayes' theorem with strong (naive) independence
assumptions between features.
4. Log-Linear Models: Generalize linear models to predict probabilities by applying a linear function followed
by a SoftMax function.
5. Maximum Entropy Models: A type of log-linear model used for classification tasks that aims to model
distributions with maximum entropy subject to certain constraints.

APPLICATIONS:

1. Language Modelling: Predicting the next word in a sentence or the likelihood of a sequence of words.
2. Part-of-Speech Tagging: Assigning parts of speech to words in a sentence using statistical methods to
determine the most likely tag sequence.
3. Named Entity Recognition (NER): Identifying entities like names, dates, and locations within text using
statistical techniques to predict entity boundaries and types.
4. Machine Translation: Translating text from one language to another by modelling the probabilities of word
and phrase correspondences.
5. Speech Recognition: Converting spoken language into text by modelling the probability of sequences of
sounds corresponding to words.

ADVANTAGES:

1. Robustness: Can handle variability and ambiguity in natural language better than rule-based models.
2. Adaptability: Can be retrained or fine-tuned with new data to adapt to different languages or domains.
3. Performance: Often achieve higher accuracy on complex tasks due to their ability to learn from large
amounts of data.
4. Automation: Reduce the need for manual rule creation and maintenance.

LIMITATIONS:

1. Data Dependency: Require large annotated datasets for training, which can be expensive and time-consuming
to obtain.
2. Complexity: Can be computationally intensive, requiring significant resources for training and inference.
3. Interpretability: Often seen as “black boxes” with less transparent decision-making processes compared to
rule-based models.
4. Overfitting: Risk of overfitting to the training data, especially with smaller datasets or highly complex
models.
MACHINE LEARNING MODELS IN NLP

Machine learning models have revolutionized NLP by allowing systems to automatically learn patterns and make
decisions based on large amounts of text data.

KEY MACHINE LEARNING APPROACHES:

1. Supervised Learning: Models are trained on labelled data, where the input data is paired with the correct
output. Common algorithms include:
o Support Vector Machines (SVMs): Effective for text classification tasks like spam detection.
o Decision Trees and Random Forests: Used for classification and regression tasks.
o Logistic Regression: Often used for binary classification tasks, such as sentiment analysis.
2. Unsupervised Learning: Models learn from unlabeled data by identifying patterns and structures. Key
techniques include:
o Clustering: Grouping similar texts together, e.g., using k-means or hierarchical clustering.
o Topic Modelling: Discovering abstract topics within a collection of documents, using methods like
Latent Dirichlet Allocation (LDA).
3. Semi-Supervised Learning: Combines both labelled and unlabeled data for training, leveraging large
amounts of unlabeled data to improve model performance.
4. Reinforcement Learning: Models learn by interacting with an environment and receiving feedback, used in
applications like dialogue systems and chatbots.

DEEP LEARNING MODELS

Deep learning, a subset of machine learning, has achieved state-of-the-art results in many NLP tasks. Key deep
learning architectures include:

1. Feedforward Neural Networks: Basic neural networks used for simple tasks like sentiment analysis.
2. Recurrent Neural Networks (RNNs): Suitable for sequential data, capturing temporal dependencies in text.
Variants include:
o Long Short-Term Memory (LSTM): Designed to overcome the limitations of traditional RNNs by
handling long-range dependencies.
o Gated Recurrent Units (GRUs): Simplified version of LSTMs with fewer parameters.
3. Convolutional Neural Networks (CNNs): Typically used for image processing but also effective for text
classification tasks.
4. Transformer Models: Revolutionized NLP with their ability to capture contextual information across the
entire text. Notable models include:
o BERT (Bidirectional Encoder Representations from Transformers): Pre-trained on a large corpus
and fine-tuned for specific tasks.
o GPT (Generative Pre-trained Transformer): Excellent for generating coherent and contextually
relevant text.
o T5 (Text-To-Text Transfer Transformer): Converts all NLP tasks into a text-to-text format,
simplifying the model architecture.

APPLICATIONS:

1. Text Classification: Assigning categories to text, such as spam detection, sentiment analysis, and news
categorization.
2. Named Entity Recognition (NER): Identifying entities like names, dates, and locations within text.
3. Machine Translation: Translating text from one language to another, e.g., Google Translate.
4. Question Answering: Building systems that can answer questions based on a given context.
5. Summarization: Generating concise summaries of long texts.
6. Speech Recognition: Converting spoken language into text.
7. Text Generation: Creating new text based on a given input, such as writing assistance and content
generation.

ADVANTAGES

1. Performance: Machine learning models, especially deep learning models, have achieved high accuracy on
many NLP tasks.
2. Adaptability: Can be fine-tuned to different languages and domains with additional training data.
3. Automation: Reduce the need for manual feature extraction and rule creation.

LIMITATIONS

1. Data Requirements: Require large amounts of labelled data for training, which can be expensive and time-
consuming to obtain.
2. Computational Resources: Deep learning models, in particular, demand significant computational power and
memory.
3. Interpretability: Often seen as black boxes with complex decision-making processes that are difficult to
understand.

ALGORITHMS IN NLP

NLP is a dynamic technology that uses different methodologies to translate complex human language for machines. It
mainly utilizes artificial intelligence to process and translate written or spoken words so they can be understood by
computers.

NLP ALGORITHMS CATEGORIES:

NLP algorithms are ML-based algorithms or instructions that are used while processing natural languages. They are
concerned with the development of protocols and models that enable a machine to interpret human languages.

NLP algorithms can modify their shape according to the AI’s approach and also the training data they have been fed
with. The main job of these algorithms is to utilize different techniques to efficiently transform confusing or
unstructured input into knowledgeable information that the machine can learn from.

Along with all the techniques, NLP algorithms utilize natural language principles to make the inputs better
understandable for the machine. They are responsible for assisting the machine to understand the context value of a
given input; otherwise, the machine won’t be able to carry out the request.

NLP algorithms are segregated into three different core categories, and AI models choose any one of the categories
depending on the data scientist’s approach. These categories are:

(1) SYMBOLIC ALGORITHMS:

Symbolic algorithms serve as one of the backbones of NLP algorithms. These are responsible for analyzing the
meaning of each input text and then utilizing it to establish a relationship between different concepts.

Symbolic algorithms leverage symbols to represent knowledge and also the relation between concepts. Since these
algorithms utilize logic and assign meanings to words based on context, you can achieve high accuracy.

Knowledge graphs also play a crucial role in defining concepts of an input language along with the relationship
between those concepts. Due to its ability to properly define the concepts and easily understand word contexts, this
algorithm helps build XAI.

However, symbolic algorithms are challenging to expand a set of rules owing to various limitations.

(2) STATISTICAL ALGORITHMS:

Statistical algorithms can make the job easy for machines by going through texts, understanding each of them, and
retrieving the meaning. It is a highly efficient NLP algorithm because it helps machines learn about human language
by recognizing patterns and trends in the array of input texts. This analysis helps machines to predict which word is
likely to be written after the current word in real-time.

From speech recognition, sentiment analysis, and machine translation to text suggestion, statistical algorithms are used
for many applications. The main reason behind its widespread usage is that it can work on large data sets.

Moreover, statistical algorithms can detect whether two sentences in a paragraph are similar in meaning and which
one to use. However, the major downside of this algorithm is that it is partly dependent on complex feature
engineering.

(3) HYBRID ALGORITHMS:

This type of NLP algorithm combines the power of both symbolic and statistical algorithms to produce an effective
result. By focusing on the main benefits and features, it can easily negate the maximum weakness of either approach,
which is essential for high accuracy.

There are many ways where both approaches can be leveraged:

(a) Symbolic supporting machine learning


(b) Machine learning supporting symbolic
(c) Symbolic and machine learning working in parallel

Symbolic algorithms can support machine learning by helping it to train the model in such a way that it has to make
less effort to learn the language on its own. Although machine learning supports symbolic ways, the machine learning
model can create an initial rule set for the symbolic and spare the data scientist from building it manually.

However, when symbolic and machine learning works together, it leads to better results as it can ensure that models
correctly understand a specific passage.

BEST NLP ALGORITHMS:

There are numerous NLP algorithms that help a computer to emulate human language for understanding. Here are the
best NLP algorithms:

 TOPIC MODELING:

Topic modeling is one of those algorithms that utilize statistical NLP techniques to find out themes or main topics
from a massive bunch of text documents.

Basically, it helps machines in finding the subject that can be utilized for defining a particular text set.

As each corpus of text documents has numerous topics in it, this algorithm uses any suitable technique to find out each
topic by assessing particular sets of the vocabulary of words.

Latent Dirichlet Allocation is a popular choice when it comes to using the best technique for topic modeling.

It is an unsupervised ML algorithm and helps in accumulating and organizing archives of a large amount of data
which is not possible by human annotation.

 TEXT SUMARIZATION:

In this approach algorithms or programs are built which will reduce the text size and create a summary of our text
data. This is called automatic text summarization in machine learning. Text summarization is the process of creating
shorter text without removing the semantic structure of text.
There are two approaches to text summarization:

(I) Extractive approaches:

It causes the machine to extract only the main words and phrases from the document without modifying the original
content.

(II) Abstractive approaches:

In this process, new words and phrases are created from the text document, which depicts all the information and
intent.

 SENTIMENT ANALYSIS:

It’s the NLP algorithm that aids a machine in comprehending the meaning or the intent behind a text from the user. It
is widely popular and used in different AI models of businesses because it helps companies understand what
customers think about their products or service.

By understanding the intent of a customer’s text or voice data on different platforms, AI models can tell us about a
customer’s sentiments and help us approach them accordingly.

 KEYWORD EXTRACTION:
Keyword extraction is another popular NLP algorithm that helps in the extraction of a large number of targeted words
and phrases from a huge set of text-based data.

There are different keyword extraction algorithms available which include popular names like TextRank, Term
Frequency, and RAKE.

Some of the algorithms might use extra words, while some of them might help in extracting keywords based on the
content of a given text.

Each of the keyword extraction algorithms utilizes its own theoretical and fundamental methods. It is beneficial for
many organizations because it helps in storing, searching, and retrieving content from a substantial unstructured data
set.

1.
Document of
Interest

6.
Top N 2.
keywords Remove Stop
based on Words
score

Keyword
Extraction

3.
5.
Term
TF * IDF
Frequency

4.
Inverse
Document
Frequency

 KNOWLEDGE GRAPHS:

When it comes to choosing the best NLP algorithm, many consider knowledge graph algorithms. It is an excellent
technique that utilizes triples for storing information.

This algorithm is basically a blend of three things – subject, predicate, and entity. However, the creation of a
knowledge graph isn’t restricted to one technique; instead, it requires multiple NLP techniques to be more effective
and detailed. The subject approach is used for extracting ordered information from a heap of unstructured texts.

For example, the following two sentences "My cat eats fish on Saturday", "His dog eats turkey on Tuesday" can be
expressed as:
 TF-IDF:

TF-IDF is a statistical NLP algorithm that is important in evaluating the importance of a word to a particular
document belonging to a massive collection. This technique involves the multiplication of distinctive values, which
are:

(a) Term frequency: The term frequency value gives you the total number of times a word comes up in a
particular document. Stop words generally get a high term frequency in a document.
(b) Inverse document frequency: Inverse document frequency, on the other hand, highlights the terms that are
highly specific to a document or words that occur less in a whole corpus of documents.

 WORDS CLOUD:

Words Cloud is a unique NLP algorithm that involves techniques for data visualization. In this algorithm, the
important words are highlighted, and then they are displayed in a table.

The essential words in the document are printed in larger letters, whereas the least important words are shown in small
fonts. Sometimes the less important things are not even visible on the table.

An example of a word cloud:


APPLICATIONS OF NLP

APPLICATIOS OF NLP:

 Natural Language processing, machine learning and artificial intelligence are used interchangeably. Al is
regarded as an umbrella-term for machines that can simulate human intelligence, NLP and ML are regarded as
subsets of AI.
 Natural language processing is a form of Al that gives machines the ability to not just read, but to understand
and interpret human language.
 With NLP, machines can make sense of written or spoken text and perform tasks including speech
recognition, sentiments analysis, and automatic test summarization.
 Thus, we can note that NLP and ML are parts of AI and both subsets share techniques, algorithms and
knowledge.
 Some NLP-based solutions include translation, speech recognition, sentiment analysis, question/answer
systems, chatbots, automatic test summarization, market intelligence, automatic text classification, and
automatic grammar checking.
 These technologies help organizations to analyze data, discover insights, automate time-consuming processes,
and/or gain competitive advantages.

(1) Translation:

 Translating languages is more complex task than a simple word-to-word replacement method. Since each
language has grammar rules, the challenge of translating a text is to be done without changing its meaning and
style.
 Since computers do not understand grammar, they need a process in which they can deconstruct a sentence,
then again reconstruct it in another language in a way that makes sense.
 Google translate is one of the most well-known online translation tools. Google Translate once used phrase-
based machine Translation (PBMT), which looks for similar phrases between different languages.
 At present Google uses Google neural machine translation (GNMT), which uses ML with NLP to look for
patterns in languages.
 Some other translation tools are DeepL and Reverso.

(2) Speech Recognition:

 Speech recognition is a machine’s ability to identify and interpret phrases and words from spoken language
and convert them into a machine-readable format.
 It uses NLP to allow computers to collect human interaction, and ML to respond in a way that simulates
human responses.
 Google Now, Alexa, and Siri are some of the most popular examples of speech recognition. Simply by saying
‘call Ravi’, a mobile recognizes what the command means and it makes a phone call to the contact saved as
‘Ravi’.

(3) Sentiment Analysis:

 Sentiment analysis uses NLP to interpret and analyze emotions in subjective data like news articles and
tweets.
 Positive, negative and neutral opinions can be identified to determine a customer's sentiment towards a brand,
product, or service.
 Sentiment analysis is used to measure public opinion, monitor brand reputation, and better understand
customer experiences.
 The stock market is a sensitive field that can be heavily influenced by human emotion. Negative sentiment can
lead stock prices to drop, while positive sentiment may trigger people to buy more of the company’s stock,
causing stock princes to increase.

(4) Chatbots:

 Chatbots are programs used to provide automated answers to common customer queries.
 They have pattern recognition systems with heuristic responses, which are used to hold conversations with
humans.
 Initially, chatbots were used to answer basic questions to alleviate heavy volume call centers and offer quick
customer support services. Al-powered chatbots are designed to handle more complicated request making
conversational experiences increasingly original.
 Chatbots in healthcare can collect intake data, help patients assess their symptoms, and determine the next
steps. These chatbots can set up appointments with the right doctor and even recommend treatments.
(5) Question- Answer systems

 Question Answer systems are intelligent systems that can provide answers to customer queries.
 Other than chatbots, question-answer systems have a huge array of knowledge and good language
understanding rather than canned answers. They can answer questions like “When was Indira Gandhi
assassinated?”, or “How do I go to the Airport?” and it can be created to deal with textual data, and audio,
images and videos.
 Question answer systems can be found in social media chats and tools such as Siri and IBM's Waston.
 In 2011, IBM's Watson computer competed on Jeopardy, a game show during which answers are given first,
and the contestants supply the questions. The computer connected against the show's two biggest all-time
champions and astounded the tech industry as it won first place.

(6) Automatic Text Summarization:

 Automatic text summarization is the task of condensing a piece of text into a shorter version, extracting its
main ideas while preserving the content's meaning.
 This application of NLP is used in News Headlines, Result Snippets in Web Search and Bulletins of Market
Reports

(7) Market Intelligence:

 Market intelligence is the gathering of valuable insights surrounding trends, consumers, products and
competitors. It extracts action able information that can be used for strategic decision-making.
 Market intelligence can analyze topics, sentiment, keywords, and intent in unstructured data and is less time
consuming than traditional desk research.
 Using Market intelligence, organizations can pick up on search queries and add relevant synonyms to search
results.
 It can also help organizations to decide which products or services to discontinue or what to target to
customers.

(8) Automatic Text Classification:

 Automatic text classification is another fundamental solution of NLP. It is the process of assigning tags to text
according to its content and semantics. It allows for rapid, easy collection of information in the search phase.
 This NLP application can differentiate span from non-spam based on its content.

(9) Automatic Grammar Checking:

 Automatic grammar checking is the task of detecting and correcting grammatical errors and spelling mistakes
in text depending on context, is another major part of NLP.
 Automatic grammar checking will make one alert to a possible error by underling the word in red.

(10) Span Detection:

 Span detection is used to detect unwanted e-mails getting to a user’s inbox.

(11) Information extraction:

 Information extraction is one of the most important applications of NLP.


 It is used for extracting structured information from unstructured or semi-structural machine-readable
documents.

(12) Natural Language Understand (NIU):

 It converts a large set of text into more formal representations such as first-order logic structures that are
easier for the computer programs to manipulate notations of the natural language processing.

TWO COMPONENTS OF NATURAL LANGUAGE PROCESSING:


a. Natural Language Understanding (NLU)
 NLU takes some spoken/typed sentence and working out what it means.
 Here different level of analysis is required such as morphological analysis, syntactic analysis,
semantic analysis, discourse analysis, etc.
b. Natural Language Generation (NLG)
 NLG takes some formal representation of what you want to say and working out a way to express it in
a natural (human) language (e.g., English).
 Here different level of synthesis is required: deep planning (what to say), syntactic generation.

DIFFERENCE BETWEEN NLU AND NLG:

NLU NLG

It is the process of reading and interpreting


It is the process of writing or generating language.
language.
NLU explains the meaning behind the written text NLG generates the natural language using
or speech in natural language. machines.
NLU draws facts from the natural language using
NLG uses the insights generated from parsers,
various tools and technologies such as parsers,
POS tags, etc. to generate the natural language.
POS taggers, etc.
NLU understands the human language and NLG uses the structured data and generates
converts it into data. meaningful narratives out of it.

GENERIC NLP SYSTEM:

1. Data Collection:
 Collect raw text data from various sources such as websites, social media, books, etc.
 Optionally, gather pre-labeled datasets for supervised learning tasks.

2. Data Preprocessing:

 Tokenization: Split text into tokens (words, sentences).


 Normalization: Convert text to lowercase, remove punctuation and stop words, apply
stemming/lemmatization.
 Vectorization: Convert text into numerical vectors using methods like N-grams, Bag-of-Words (BoW), TF-
IDF, or embeddings (Word2Vec, GloVe, BERT).

3. Model Selection:

 Classical Models:
o N-grams: Frequency counts of sequences of words.
o Bag-of-Words: Word frequency vectors.
o Naive Bayes: Probabilistic classifier based on Bayes' theorem.
 Advanced Models:
o Word Embeddings: Word2Vec, GloVe.
o Contextual Embeddings: BERT, GPT.
o Deep Learning Models: RNN, LSTM, GRU, Transformers.

4. Model Training:

 Train the selected model using labeled data.


 Split data into training and validation sets.
 Perform hyperparameter tuning to optimize performance.

5. Evaluation:

 Evaluate model performance using metrics like accuracy, precision, recall, F1-score, etc.
 Use cross-validation to validate the model on different subsets of data.

6. Deployment:

 Develop APIs to enable the model to interact with other systems.


 Integrate the model into applications, websites, or other platforms.

7. Monitoring and Maintenance:

 Continuously monitor the model’s performance.


 Update the model with new data to maintain accuracy.

GENERIC NLP SYSTEM

Natural language Processing should start with some input and ends with effective and accurate output. The possible
inputs to an NLP are quite broad. Language can be in a variety of forms such as the paragraphs of text, commands that
are typed directly to a computer system, etc. The input language might be given to the system a sentence at a time or it
might be multiple sentences all at once. The inputs for natural language processor can be typed input, message text or
speech. NLP systems have some kind of pre-processor. Data preprocessing involves preparing and cleaning text data
for machines to be able to analyze it. Preprocessing puts data in workable form and highlights features in the text that
an algorithm can work with. Preprocessing does dictionary lookup, morphological analysis, lexical substitutions, and
part-of-speech assignment.

There are variety of output can be generated by the system. The output Introduction from a system that incorporates an
NLP might be an answer from a database, a command to change some data in a database, a spoken response,
Semantics, Part of speech, Morphology of word, Semantics of the word or some other action on the part of the system.
Remember these are the output of the system as a whole, not the output of the NLP component of the system.

Generic NLP System:


01. Typed Input/Speech: Input can be typed text or spoken language.
02. Speech Recognizer: If the input is speech, it is converted to text.
03. Preprocessor: Prepares the input text for processing, which may include tasks like tokenization,
normalization, etc.
04. Natural Language Processing: Involves various NLP techniques to understand and derive meaning from the
input text.
05. Meaning: The result of the NLP process, representing the understood meaning of the input.
06. Output Prep: Prepares the output in the required form (e.g., text, speech).
07. Answer Output: The system provides a response.
08. Database Update: If necessary, updates any relevant database with new information derived from the input.
09. Spoken Response: Converts the response back to speech if required.
10. Other: Represents any other actions the system might take based on the input.

Generic NLP system

Pipeline view of the components of a generic NLP system


SAMPLE QUESTIONS

1. What is NLP? Discuss the different stages in NLP (5/10)


2. What are the different types of ambiguities present in NLP? How to remove it (5/10)
3. Explain generic NLP system with a diagram. (5)
4. State and explain the different applications and challenges in NLP. (5/10)
5. Explain Components of NLP. (2)
6. For each sentence, identify whether the different meanings arise from structural ambiguity, semantic ambiguity or
pragmatic ambiguity? (a) Time flies like an arrow (b) He crushed the key to my heart. (3)
7. Describe augmented grammar in syntactic analysis. (3)
8. Identify the morphological type (Noun phrase, Verb Phrase, Adjective Phrase) of following sentence segments (a)
important to Bill (b) looked up the tree. (3)
9. Distinguish between semantics, pragmatics and discourse. (5)

You might also like