0% found this document useful (0 votes)
33 views37 pages

Artificial Intelligence-UNIT-4

Natural Language Processing (NLP) is a field of AI focused on enabling computers to understand and generate human language, encompassing tasks like meaning extraction, grammatical analysis, and language generation. The history of NLP spans from rule-based systems in the 1950s to statistical methods in the 1980s, and now to deep learning techniques that enhance its capabilities. Despite its advancements, NLP faces challenges such as language diversity, data quality, and contextual understanding, which require innovative solutions and methodologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views37 pages

Artificial Intelligence-UNIT-4

Natural Language Processing (NLP) is a field of AI focused on enabling computers to understand and generate human language, encompassing tasks like meaning extraction, grammatical analysis, and language generation. The history of NLP spans from rule-based systems in the 1950s to statistical methods in the 1980s, and now to deep learning techniques that enhance its capabilities. Despite its advancements, NLP faces challenges such as language diversity, data quality, and contextual understanding, which require innovative solutions and methodologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Artificial Intelligence

Unit-4
Natural Language Processing (NLP)

What is Natural Language Processing (NLP)?


Natural Language Processing (NLP) is a field of computer science and artificial intelligence (AI) concerned
with the interaction between computers and human language. Its core objective is to enable computers
to understand, analyze, and generate human language in a way that is similar to how humans do. This
includes tasks like:

 Understanding the meaning: Being able to extract the meaning from text, speech, or other
forms of human language.

 Analyzing structure: Recognizing the grammatical structure and syntax of language, including
parts of speech and sentence construction.

 Generating human-like language: Creating text or speech that is natural, coherent, and
grammatically correct.

Ultimately, NLP aims to bridge the gap between human communication and machine comprehension,
fostering seamless interaction between us and technology.

History of Natural Language Processing (NLP)


The history of NLP (Natural Language Processing) is divided into three segments that are as follows:

The Dawn of NLP (1950s-1970s)

In the 1950s, the dream of effortless communication across languages fueled the birth of NLP. Machine
translation (MT) was the driving force, and rule-based systems emerged as the initial approach.

How Rule-Based Systems Worked:

These systems functioned like complex translation dictionaries on steroids. Linguists meticulously
crafted a massive set of rules that captured the grammatical structure (syntax) and vocabulary of
specific languages.

Imagine the rules as a recipe for translation. Here's a simplified breakdown:

1. Sentence Breakdown: The system would first analyze the source language sentence and break it
down into its parts of speech (nouns, verbs, adjectives, etc.).
2. Matching Rules: Each word or phrase would be matched against the rule base to find its
equivalent in the target language, considering grammatical roles and sentence structure.

3. Rearrangement: Finally, the system would use the rules to rearrange the translated words and
phrases to form a grammatically correct sentence in the target language.

Limitations of Rule-Based Systems:

While offering a foundation for MT, this approach had several limitations:

 Inflexibility: Languages are full of nuances and exceptions. Rule-based systems struggled to
handle idioms, slang, and variations in sentence structure. A slight deviation from the expected
format could throw the entire translation off.

 Scalability Issues: Creating and maintaining a vast rule base for every language pair was a time-
consuming and laborious task. Imagine the immense effort required for just a handful of
languages!

 Limited Scope: These systems primarily focused on syntax and vocabulary, often failing to
capture the deeper meaning and context of the text. This resulted in translations that sounded
grammatically correct but unnatural or even nonsensical.

Despite these limitations, rule-based systems laid the groundwork for future NLP advancements. They
demonstrated the potential for computers to understand and manipulate human language, paving the
way for more sophisticated approaches that would emerge later.

The Statistical Revolution (1980s-1990s)

 A Shift Towards Statistics: The 1980s saw a paradigm shift towards statistical NLP approaches.
Machine learning algorithms emerged as powerful tools for NLP tasks.

 The Power of Data: Large collections of text data (corpora) became crucial for training these
statistical models.

 Learning from Patterns: Unlike rule-based systems, statistical models learn patterns from data,
allowing them to handle variations and complexities of natural language.

The Deep Learning Era (2000s-Present)

 The Deep Learning Revolution: The 2000s ushered in the era of deep learning, significantly
impacting NLP.

 Artificial Neural Networks (ANNs): These complex algorithms, inspired by the human brain,
became the foundation of deep learning advancements in NLP.

 Advanced Architectures: Deep learning architectures like recurrent neural networks and
transformers further enhanced NLP capabilities. Briefly mention these architectures without
going into technical details.
The Advent of Rule-Based Systems

The 1960's and 1970's witnessed the emergence of rule-primarily based systems inside the realm of
NLP. Collaborations among linguists and computer scientists precipitated the development of structures
that trusted predefined policies to analyze and understand human language.

The aim became to codify linguistic recommendations, at the side of syntax and grammar, into
algorithms that would be completed by way of computer systems to machine and generate human-like
text.

During this period, the General Problem Solver (GPS) received prominence. They had been developed
with the resources of Allen Newell and Herbert A. Simon; in 1957, GPS wasn't explicitly designed for
language processing. However, it established the functionality of rule-based total systems by showcasing
how computers must solve issues with the use of predefined policies and heuristics.

Challenges of Natural Language Processing (NPL)


Natural Language Processing (NLP) is a powerful tool in artificial intelligence that enables computers to
understand, interpret, and generate human-readable text. Despite its potential, NLP faces several
significant challenges due to the complexity and diversity of human language.

1. Language Differences
Human language is rich and intricate, with thousands of languages spoken worldwide, each
having its own grammar, vocabulary, and cultural nuances. This diversity creates ambiguity, as
the same words and phrases can have different meanings in different contexts. Additionally, the
complex syntactic structures and grammatical rules of natural languages add to the difficulty.
2. Training Data
High-quality annotated data is essential for training NLP models. Collecting, annotating, and
preprocessing large text datasets can be time-consuming and resource-intensive, especially for
tasks requiring specialized domain knowledge.
3. Development Time and Resource Requirements
The development time and resources required for NLP projects depend on various factors,
including task complexity, data quality, and the availability of existing tools and libraries.
Complex tasks like machine translation or question answering require more time and
computational resources compared to simpler tasks like text classification.
4. Phrasing Ambiguities
Phrasing ambiguities arise when a phrase can be interpreted in multiple ways, leading to
uncertainty in understanding the meaning. Contextual understanding, semantic analysis,
syntactic analysis, and pragmatic analysis are crucial for resolving these ambiguities.
5. Misspellings and Grammatical Errors
Misspellings and grammatical errors introduce linguistic noise that can impact the accuracy of
NLP models. Techniques like spell checking, text normalization, tokenization, and language
models trained on large corpora can help address these issues.
6. Mitigating Innate Biases
Ensuring fairness, equity, and inclusivity in NLP applications requires mitigating innate biases in
algorithms. This involves collecting diverse and representative training data, applying bias
detection methods, and evaluating models for fairness.
7. Words with Multiple Meanings
Words with multiple meanings, known as polysemous or homonymous words, pose a lexical
challenge in NLP. Semantic analysis, domain-specific knowledge, multi-word expressions, and
knowledge graphs can help disambiguate these words.
8. Addressing Multilingualism
Handling text data in multiple languages is crucial for NLP systems. Techniques like multilingual
corpora, cross-lingual transfer learning, language identification, and machine translation are
essential for addressing language diversity.
9. Reducing Uncertainty and False Positives
Reducing uncertainty and false positives in NLP models improves their accuracy and reliability.
Probabilistic models, confidence scores, threshold tuning, and ensemble methods can help
achieve this.
10. Facilitating Continuous Conversations
Building NLP models that can maintain context throughout a conversation is essential for
seamless interaction between users and machines. Real-time processing, context
understanding, and intent recognition are key components for facilitating continuous
conversations.
11. Overcoming NLP Challenges
Overcoming these challenges requires a combination of innovative technologies, domain
expertise, and methodological approaches. Techniques like data augmentation, transfer
learning, and pre-training can help address data scarcity and improve model performance.

NLP Techniques
NLP encompasses a wide array of techniques that aimed at enabling computers to process and
understand human language. These tasks can be categorized into several broad areas, each addressing
different aspects of language processing. Here are some of the key NLP techniques:

1. Text Processing and Preprocessing

 Tokenization: Dividing text into smaller units, such as words or sentences.

 Stemming and Lemmatization: Reducing words to their base or root forms.

 Stopword Removal: Removing common words (like “and”, “the”, “is”) that may not carry
significant meaning.

 Text Normalization: Standardizing text, including case normalization, removing punctuation and
correcting spelling errors.

2. Syntax and Parsing


 Part-of-Speech (POS) Tagging: Assigning parts of speech to each word in a sentence (e.g., noun,
verb, adjective).

 Dependency Parsing: Analyzing the grammatical structure of a sentence to identify relationships


between words.

 Constituency Parsing: Breaking down a sentence into its constituent parts or phrases (e.g., noun
phrases, verb phrases).

3. Semantic Analysis

 Named Entity Recognition (NER): Identifying and classifying entities in text, such as names of
people organizations, locations, dates, etc.

 Word Sense Disambiguation (WSD): Determining which meaning of a word is used in a given
context.

 Coreference Resolution: Identifying when different words refer to the same entity in a text (e.g.,
“he” refers to “John”).

4. Information Extraction

 Entity Extraction: Identifying specific entities and their relationships within the text.

 Relation Extraction: Identifying and categorizing the relationships between entities in a text.

5. Text Classification in NLP

 Sentiment Analysis: Determining the sentiment or emotional tone expressed in a text (e.g.,
positive, negative, neutral).

 Topic Modeling: Identifying topics or themes within a large collection of documents.

 Spam Detection: Classifying text as spam or not spam.

6. Language Generation

 Machine Translation: Translating text from one language to another.

 Text Summarization: Producing a concise summary of a larger text.

 Text Generation: Automatically generating coherent and contextually relevant text.

7. Speech Processing

 Speech Recognition: Converting spoken language into text.

 Text-to-Speech (TTS) Synthesis: Converting written text into spoken language.

8. Question Answering
 Retrieval-Based QA: Finding and returning the most relevant text passage in response to a
query.

 Generative QA: Generating an answer based on the information available in a text corpus.

9. Dialogue Systems

 Chatbots and Virtual Assistants: Enabling systems to engage in conversations with users,
providing responses and performing tasks based on user input.

10. Sentiment and Emotion Analysis in NLP

 Emotion Detection: Identifying and categorizing emotions expressed in text.

 Opinion Mining: Analyzing opinions or reviews to understand public sentiment toward products,
services or topics.

Natural Language Processing (NLP) Working/Steps:

Working in natural language processing (NLP) typically involves using computational techniques to
analyze and understand human language. This can include tasks such as language understanding,
language generation and language interaction.
1. Text Input and Data Collection
 Data Collection: Gathering text data from various sources such as websites, books, social media
or proprietary databases.
 Data Storage: Storing the collected text data in a structured format, such as a database or a
collection of documents.
2. Text Preprocessing
Preprocessing is crucial to clean and prepare the raw text data for analysis. Common preprocessing
steps include:
 Tokenization: Splitting text into smaller units like words or sentences.
 Lowercasing: Converting all text to lowercase to ensure uniformity.
 Stopword Removal: Removing common words that do not contribute significant meaning, such
as “and,” “the,” “is.”
 Punctuation Removal: Removing punctuation marks.
 Stemming and Lemmatization: Reducing words to their base or root forms. Stemming cuts off
suffixes, while lemmatization considers the context and converts words to their meaningful base
form.
 Text Normalization: Standardizing text format, including correcting spelling errors, expanding
contractions and handling special characters.
3. Text Representation
 Bag of Words (BoW): Representing text as a collection of words, ignoring grammar and word
order but keeping track of word frequency.
 Term Frequency-Inverse Document Frequency (TF-IDF): A statistic that reflects the importance
of a word in a document relative to a collection of documents.
 Word Embeddings: Using dense vector representations of words where semantically similar
words are closer together in the vector space (e.g., Word2Vec, GloVe).
4. Feature Extraction
Extracting meaningful features from the text data that can be used for various NLP tasks.
 N-grams: Capturing sequences of N words to preserve some context and word order.
 Syntactic Features: Using parts of speech tags, syntactic dependencies and parse trees.
 Semantic Features: Leveraging word embeddings and other representations to capture word
meaning and context.
5. Model Selection and Training
Selecting and training a machine learning or deep learning model to perform specific NLP tasks.
 Supervised Learning: Using labeled data to train models like Support Vector Machines (SVM),
Random Forests or deep learning models like Convolutional Neural Networks (CNNs) and
Recurrent Neural Networks (RNNs).
 Unsupervised Learning: Applying techniques like clustering or topic modeling (e.g., Latent
Dirichlet Allocation) on unlabeled data.
 Pre-trained Models: Utilizing pre-trained language models such as BERT, GPT or transformer-
based models that have been trained on large corpora.
6. Model Deployment and Inference
Deploying the trained model and using it to make predictions or extract insights from new text data.
 Text Classification: Categorizing text into predefined classes (e.g., spam detection, sentiment
analysis).
 Named Entity Recognition (NER): Identifying and classifying entities in the text.
 Machine Translation: Translating text from one language to another.
 Question Answering: Providing answers to questions based on the context provided by text
data.
7. Evaluation and Optimization
Evaluating the performance of the NLP algorithm using metrics such as accuracy, precision, recall, F1-
score and others.
 Hyperparameter Tuning: Adjusting model parameters to improve performance.
 Error Analysis: Analyzing errors to understand model weaknesses and improve robustness.
Technologies related to Natural Language Processing
There are a variety of technologies related to natural language processing (NLP) that are used to analyze
and understand human language. Some of the most common include:
1. Machine learning: NLP relies heavily on machine learning techniques such as supervised and
unsupervised learning, deep learning and reinforcement learning to train models to understand
and generate human language.

2. Natural Language Toolkits (NLTK) and other libraries: NLTK is a popular open-source library in
Python that provides tools for NLP tasks such as tokenization, stemming and part-of-speech
tagging. Other popular libraries include spaCy, OpenNLP and CoreNLP.

3. Parsers: Parsers are used to analyze the syntactic structure of sentences, such as dependency
parsing and constituency parsing.

4. Text-to-Speech (TTS) and Speech-to-Text (STT) systems: TTS systems convert written text into
spoken words, while STT systems convert spoken words into written text.

5. Named Entity Recognition (NER) systems: NER systems identify and extract named entities such
as people, places and organizations from the text.

6. Sentiment Analysis: A technique to understand the emotions or opinions expressed in a piece of


text, by using various techniques like Lexicon-Based, Machine Learning-Based and Deep
Learning-based methods

7. Machine Translation: NLP is used for language translation from one language to another
through a computer.

8. Chatbots: NLP is used for chatbots that communicate with other chatbots or humans through
auditory or textual methods.

9. AI Software: NLP is used in question-answering software for knowledge representation,


analytical reasoning as well as information retrieval.

Applications of Natural Language Processing (NLP)


 Spam Filters: One of the most irritating things about email is spam. Gmail uses natural language
processing (NLP) to discern which emails are legitimate and which are spam. These spam filters
look at the text in all the emails you receive and try to figure out what it means to see if it’s
spam or not.

 Algorithmic Trading: Algorithmic trading is used for predicting stock market conditions. Using
NLP, this technology examines news headlines about companies and stocks and attempts to
comprehend their meaning in order to determine if you should buy, sell or hold certain stocks.

 Questions Answering: NLP can be seen in action by using Google Search or Siri Services. A major
use of NLP is to make search engines understand the meaning of what we are asking and
generate natural language in return to give us the answers.

 Summarizing Information: On the internet, there is a lot of information and a lot of it comes in
the form of long documents or articles. NLP is used to decipher the meaning of the data and
then provides shorter summaries of the data so that humans can comprehend it more
quickly.

Future Scope
NLP is shaping the future of technology in several ways:

 Chatbots and Virtual Assistants: NLP enables chatbots to quickly understand and respond to
user queries, providing 24/7 assistance across text or voice interactions.

 Invisible User Interfaces (UI): With NLP, devices like Amazon Echo allow for seamless
communication through voice or text, making technology more accessible without traditional
interfaces.

 Smarter Search: NLP is improving search by allowing users to ask questions in natural language,
as seen with Google Drive’s recent update, making it easier to find documents.

 Multilingual NLP: Expanding NLP to support more languages, including regional and minority
languages, broadens accessibility.

Future Enhancements: NLP is evolving with the use of Deep Neural Networks (DNNs) to make human-
machine interactions more natural. Future advancements include improved semantics for word
understanding and broader language support, enabling accurate translations and better NLP models for
languages not yet supported.

Goals of NLP:
1. Understanding Human Language

 Goal: Make machines understand the meaning and structure of human language (both written
and spoken).

 Why it matters: This is the foundation for everything else—whether it’s translating text or
answering a question, the machine has to “get” what we’re saying.

2. Natural Language Generation (NLG)

 Goal: Enable machines to produce human-like language in response to input.

 Examples: Chatbots, content generation, text summarization, report writing.

3. Language Translation

 Goal: Automatically translate from one language to another while preserving meaning and tone.

 Example: Google Translate, multilingual customer support bots.

4. Information Extraction
 Goal: Pull out relevant data from large amounts of text.

 Example: Extracting names, places, dates, and relationships from news articles or legal
documents.

5. Information Retrieval

 Goal: Help find relevant information from large text collections (like search engines).

 Example: Ranking relevant documents in response to a user query.

6. Sentiment Analysis

 Goal: Determine the emotional tone or opinion expressed in text.

 Example: Analyzing tweets or reviews to understand public sentiment about a product or event.

7. Speech Recognition & Understanding

 Goal: Convert spoken language into text, and then understand it.

 Example: Voice assistants like Siri, Alexa, or Google Assistant.

8. Text Classification

 Goal: Automatically assign categories or labels to text.

 Example: Spam detection in emails, topic tagging in news articles.

9. Summarization

 Goal: Generate concise summaries of larger texts without losing key information.

 Example: News digests, executive summaries of reports.

10. Question Answering

 Goal: Let machines answer questions using natural language.

 Example: ChatGPT 😉, search engine snippets, customer support bots.

Discourse Knowledge in NLP


Definition:

Discourse knowledge in NLP is the understanding of how sentences connect and make sense together in
a larger context — such as conversations, documents, or multi-turn dialogues.

NLP Focus:
 Understand how ideas flow across sentences.

 Identify reference chains, topic continuity, and coherence.

Used in NLP tasks like:

 Coreference Resolution: Identifying when two expressions refer to the same thing.
E.g., "Lisa went home. She was tired." — "She" = "Lisa".

 Dialogue Systems/Chatbots: Maintaining context in multi-turn conversations.


E.g., remembering earlier user inputs in a session.

 Summarization: Identifying main ideas across multiple sentences.

 Question Answering: Using context beyond just one sentence to generate answers.

Example:

User: “I watched Interstellar yesterday. It was amazing.”

 The system must know that “it” refers to Interstellar, and understand that both sentences are
related.

Pragmatic Knowledge in NLP

Definition:

Pragmatic knowledge involves understanding what is meant, not just what is said — taking into account
context, tone, real-world knowledge, and speaker intention.

NLP Focus:

 Interpret indirect meaning, sarcasm, politeness, requests, and implicature.

 Understand emotion, tone, and social cues.

Used in NLP tasks like:

 Sentiment Analysis: Detecting true emotions or opinions behind words.

 Intent Recognition: Understanding the user's real purpose behind a statement (important in
virtual assistants).

 Chatbots/Conversational AI: Responding appropriately to indirect requests or implied


meanings.

 Sarcasm Detection: Recognizing non-literal or ironic language.

Example:
User says: “Yeah, right, this app totally works…”

 Literal words are positive, but pragmatically (via sarcasm), the sentiment is negative.

Languages and grammars


Languages and grammars are the most important concepts. Grammars are the most fundamental thing
for human languages and computer languages as well.

The Language in Formal Language Theory

A language can be defined as a set of strings over a given alphabet. This definition is true for human
languages as well as computer languages. There are several components –

 Alphabet − A finite set of symbols

 String − A finite sequence of symbols from the alphabet

 Language − A (possibly infinite) set of strings over an alphabet

The Concept of Formal Languages

In computer science, we talk about the formal languages. This provides a rigorous framework for
studying the properties and structures of languages. They are essential in computer science to make the
syntax of programming languages and for analyzing the capabilities of different computational models.

Alphabets and strings are important components of formal languages. Let us discuss about them a little:

 Alphabet (Σ) − A non-empty finite set of symbols

 String − A finite sequence of symbols from Σ

 Empty string (ε) − The string containing no symbols (This is important into formal languages)

 Length of a string − The number of symbols in the string

Formal languages can be classified into four types: regular, context-free, context-
sensitive and recursively enumerable languages. But having a knowledge about languages is not
enough; we need to go with grammars as well.
Grammars in Automata Theory
In automata, the grammars are formal systems for describing the structure of languages. In grammar,
there are set of rules for generating valid strings in a language.

Formally, we can define grammar like this. A grammar is a tuple G = (V, Σ, R, S), where:

 V is a finite set of variables (non-terminal symbols)

 Σ is a finite set of terminal symbols (the alphabet)

 R is a finite set of production rules

 S is the start symbol (S ∈ V)

Grammars are used to generate all valid strings in a language, it also provides a structural
description of the language and serve as a basis for parsing and syntax analysis. Let us see the
following table to understand different components of a grammar clearly.

Component Description Example

Variables Non-terminal symbols A, B, C

Terminals Symbols in the alphabet a, b, c, 0, 1

Production rules Rules for string generation A → aB, B → bC

Start symbol Initial variable for derivations S

While we talk about grammars, it is necessary to understand two important concepts related to
grammars –

 Derivation − A sequence of rule applications that transform the start symbol into a string of
terminal symbols

 Parse tree − A graphical representation of a derivation, showing the hierarchical structure of the
generated string

Let us understand them through an example:

Producing string "abc" from rules {S, S → aB, B → bC, C → c}

Applications of Languages and Grammars


The study of languages and grammars has many such practical applications in computer science and
linguistics.

Field Descriptions

 Syntax − Formal grammars specify the structure of programming


languages

Programming  Parser generation − Grammars are used to automatically generate


Languages parsers for compilers

 Code analysis − Static analysis tools use grammars to understand


program structure

 Syntactic parsing − Grammars model the structure of human


languages

Natural Language  Machine translation − Formal language theory underpins


Processing translation algorithms

 Speech recognition − Language models based on grammars


improve accuracy

 Lexical analysis − Regular expressions (Type 3) define tokens

 Syntax analysis − Context-free grammars (Type 2) define language


Compiler Design syntax

 Semantic analysis − Attribute grammars extend CFGs for semantic


checks

What is Grammar in Computation?


In the literary sense of the term, grammars denote syntactical rules for conversation in natural
languages. Linguistics have attempted to define grammars since the inception of natural languages
like English, Sanskrit, Mandarin, etc.

The theory of formal languages finds its applicability extensively in the fields of Computer Science.
Noam Chomsky gave a mathematical model of grammar in 1956 which is effective for writing
computer languages.

Representation of Grammar
A grammar G can be formally written as a 4-tuple (N, T, S, P) where −

 N or VN is a set of variables or non-terminal symbols.

 T or ∑ is a set of Terminal symbols.

 S is a special variable called the Start symbol, S ∈ N

where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to VN.
 P is Production rules for Terminals and Non-terminals. A production rule has the form α → β,

Example 1

Grammar G1 −

({S, A, B}, {a, b}, S, {S → AB, A → a, B → b})

Here,

 S, A, and B are Non-terminal symbols;

 a and b are Terminal symbols

 S is the Start symbol, S ∈ N

 Productions, P : S → AB, A → a, B → b

Example 2

Grammar G2 −

(({S, A}, {a, b}, S,{S → aAb, aA → aaAb, A → ε } )

Here,

 S and A are Non-terminal symbols.

 a and b are Terminal symbols.

 ε is an empty string.

 S is the Start symbol, S ∈ N

 Production P : S → aAb, aA → aaAb, A → ε

Basic Elements of Grammar

Grammar is composed of two basic elements


Terminal Symbols - Terminal symbols are the components of the sentences that are generated using
grammar and are denoted using small case letters like a, b, c etc.

Non-Terminal Symbols - Non-Terminal Symbols take part in the generation of the sentence but are
not the component of the sentence. These types of symbols are also called Auxiliary Symbols and
Variables. They are represented using a capital letter like A, B, C, etc.

Example 1

Consider a grammar

G=(V,T,P,S)G=(V,T,P,S)

Where,

 V = { S , A , B } Non-Terminal symbols

 T = { a , b } Terminal symbols

 P = { S → ABa , A → BB , B → ab , AA → b } Production rules

 S = { S } Start symbol

Example 2

Consider a grammar

G=(V,T,P,S)G=(V,T,P,S)

Where,

 V = {S, A, B} non terminal symbols

 T = { 0,1} terminal symbols

 P = { S → A1B A → 0A| ε B → 0B| 1B| ε } Production rules

 S = {S} start symbol.


Types of grammar

The different types of grammar −

Production
Grammar Language Automata
rules

Recursively
Type-0 Turing machine No restriction
enumerable

Linear-bounded non-deterministic
Type-1 Context-sensitive αAβ→αγβ
machine

Non-deterministic push down


Type-2 Context-free A→γ
automata

A→αB
Type-3 Regular Finite state automata
A→α

The diagram representing the types of grammar in the theory of computation (TOC) is as follows −
Derivations from a Grammar
Strings may be derived from other strings using the productions in a grammar. If a grammar G has a
production α → β, we can say that x α y derives x β y in G. This derivation is written as −

x α y ⇒G x β y

Example

Let us consider the grammar −

G2 = ({S, A}, {a, b}, S, {S → aAb, aA → aaAb, A → ε } )

Some of the strings that can be derived are −

S ⇒ aAb using production S → aAb

⇒ aaAbb using production aA → aAb

⇒ aaaAbbb using production aA → aaAb

⇒ aaabbb using production A → ε

Language Generated by a Grammar


The set of all strings that can be derived from a grammar is said to be the language generated from that
grammar. A language generated by a grammar G is a subset formally defined by

L(G)={W|W ∈ ∑*, S ⇒G W}

If L(G1) = L(G2), the Grammar G1 is equivalent to the Grammar G2.

Example

If there is a grammar

G: N = {S, A, B} T = {a, b} P = {S → AB, A → a, B → b}

Here S produces AB, and we can replace A by a, and B by b. Here, the only accepted string is ab, i.e.,

L(G) = {ab}

Example

Suppose we have the following grammar −

G: N = {S, A, B} T = {a, b} P = {S → AB, A → aA|a, B → bB|b}

The language generated by this grammar −

L(G) = {ab, a2b, ab2, a2b2, }


= {am bn | m ≥ 1 and n ≥ 1}

Construction of a Grammar Generating a Language


Well consider some languages and convert it into a grammar G which produces those languages.

Example

Problem − Suppose, L (G) = {am bn | m ≥ 0 and n > 0}. We have to find out the grammar G which
produces L(G).

Solution

Since L(G) = {am bn | m ≥ 0 and n > 0}

the set of strings accepted can be rewritten as −

L(G) = {b, ab,bb, aab, abb, .}

Here, the start symbol has to take at least one b preceded by any number of a including null.

To accept the string set {b, ab, bb, aab, abb, .}, we have taken the productions −

S → aS , S → B, B → b and B → bB

S → B → b (Accepted)

S → B → bB → bb (Accepted)

S → aS → aB → ab (Accepted)

S → aS → aaS → aaB → aab(Accepted)

S → aS → aB → abB → abb (Accepted)

Thus, we can prove every single string in L(G) is accepted by the language generated by the production
set.

Hence the grammar −

G: ({S, A, B}, {a, b}, S, { S → aS | B , B → b | bB })

Example

Problem − Suppose, L (G) = {am bn | m > 0 and n ≥ 0}. We have to find out the grammar G which produces
L(G).

Solution −
Since L(G) = {am bn | m > 0 and n ≥ 0}, the set of strings accepted can be rewritten as −

L(G) = {a, aa, ab, aaa, aab ,abb, .}

Here, the start symbol has to take at least one a followed by any number of b including null.

To accept the string set {a, aa, ab, aaa, aab, abb, .}, we have taken the productions −

S → aA, A → aA , A → B, B → bB ,B → λ

S → aA → aB → aλ → a (Accepted)

S → aA → aaA → aaB → aaλ → aa (Accepted)

S → aA → aB → abB → abλ → ab (Accepted)

S → aA → aaA → aaaA → aaaB → aaaλ → aaa (Accepted)

S → aA → aaA → aaB → aabB → aabλ → aab (Accepted)

S → aA → aB → abB → abbB → abbλ → abb (Accepted)

Thus, we can prove every single string in L(G) is accepted by the language generated by the production
set.

Hence the grammar −

G: ({S, A, B}, {a, b}, S, {S → aA, A → aA | B, B → λ | bB })

Chomsky Classification of Grammars


According to Noam Chomosky, there are four types of grammars − Type 0, Type 1, Type 2, and Type 3.
The following table shows how they differ from each other −

Grammar
Grammar Accepted Language Accepted Automaton
Type

Recursively enumerable
Type 0 Unrestricted grammar Turing Machine
language

Context-sensitive Linear-bounded
Type 1 Context-sensitive language
grammar automaton

Type 2 Context-free grammar Context-free language Pushdown automaton

Type 3 Regular grammar Regular language Finite state automaton


Take a look at the following illustration. It shows the scope of each type of grammar −

Type - 3 Grammar

Type-3 grammars generate regular languages. Type-3 grammars must have a single non-terminal on the
left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a single
non-terminal.

The productions must be in the form X → a or X → aY

where X, Y ∈ N (Non terminal)

and a ∈ T (Terminal)

The rule S → ε is allowed if S does not appear on the right side of any rule.

Example

X→ε

X → a | aY

Y→b
Advertisement

PauseSkip backward 5 secondsSkip forward 5 seconds

Mute

Fullscreen

Type - 2 Grammar

Type-2 grammars generate context-free languages.

The productions must be in the form A →

where A ∈ N (Non terminal)

and ∈ (T ∪ N)* (String of terminals and non-terminals).

These languages generated by these grammars are be recognized by a non-deterministic pushdown


automaton.

Example

S→Xa

X→a

X → aX

X → abc

X→ε

Type - 1 Grammar

Type-1 grammars generate context-sensitive languages. The productions must be in the form

αAβ→αβ

where A ∈ N (Non-terminal)

and α, β, ∈ (T ∪ N)* (Strings of terminals and non-terminals)

The strings α and β may be empty, but must be non-empty.

The rule S → ε is allowed if S does not appear on the right side of any rule. The languages generated by
these grammars are recognized by a linear bounded automaton.
Example

AB → AbBc

A → bcA

B→b

Type - 0 Grammar

Type-0 grammars generate recursively enumerable languages. The productions have no restrictions.
They are any phase structure grammar including all formal grammars.

They generate the languages that are recognized by a Turing machine.

The productions can be in the form of α → β where α is a string of terminals and nonterminals with at
least one non-terminal and α cannot be null. β is a string of terminals and non-terminals.

Example

S → ACaB

Bc → acB

CB → DB

aD → Db

Context-Sensitive Languages
Context-sensitive languages (CSLs) bridge the gap between the well-studied context-free languages and
the more complex recursively enumerable languages. In this chapter, we will cover the concepts of
context-sensitive languages, exploring their definitions, properties, and relationships to other language
classes in the Chomsky hierarchy.

What is Context-Free Grammar?

A context-free language is generated by a context-free grammar (CFG). In a CFG, production rules have
the form: A → X, Where −

 A is a variable (non-terminal)

 X is any string of terminals or variables

Properties of Context-Free Languages

The key characteristic of CFLs is that the replacement of A with X is independent of the surrounding
context. This property gives CFLs their name they are "free" of context constraints.
CFLs correspond to pushdown automata (PDAs) in the Chomsky hierarchy, which are more powerful
than finite automata but less powerful than linear-bounded automata.

Context-Sensitive Languages: More Powerful CFLs

The context-sensitive languages extend the concept of CFLs by allowing production rules to depend on
the context in which variables appear. This seemingly small change leads to a significant increase in
expressive power. Let us understand CSG in greater detail.

What is Context-Sensitive Grammars (CSGs)?

A context-sensitive grammar has production rules of the form: αAβ → αXβ, where −

 α, β are strings of terminals and/or variables (can be empty)

 A is a variable

 X is a non-empty string of terminals or variables

Properties of Context-Sensitive Languages

Listed below are some of the important properties of context-sensitive languages −

 Context Preservation − The production process maintains the same context (α and β) on both
sides, ensuring that the replacement of A with X only occurs within the defined context.

 Non-Contracting − The grammar's property X cannot be empty, ensuring it doesn't reduce string
length during derivation. However, the start variable S can generate an empty string if it's part
of the language.

 Increased Expressive Power − CSLs can describe patterns that CFLs cannot, such as matching
multiple repeated substrings.

The Chomsky Hierarchy and CSLs

Context-sensitive languages are located at the third level of the Chomsky hierarchy, between context-
free and recursively enumerable languages.

The relationship between different classes of languages is as follows −

 Regular Languages ⊂ Context-Free Languages ⊂ Context-Sensitive Languages ⊂ Recursively


Enumerable Languages.

 CSLs have the added power to describe patterns that CFLs cannot.

Now let us see CSL through an example.

Example of a Language That is context-free or Not

A classic example of a language that is context-sensitive but not context-free is: L = {abc | n ≥ 0}
The language is composed of strings with equal numbers of a's, b's, and c's, which cannot be generated
by context-free grammars due to their inability to count and ensure equal numbers of three different
symbols.

To illustrate the power of context-sensitive grammars, let's construct a CSG that generates the
language L = {abc | n ≥ 0}.

The production rules are as follows −

 S → ε (to handle the case n = 0)

 S → S'

 S' → ABC

 AB → BAB

 BA → ACA

 CA → CB

 CB → AB

 Bb → Bb

 A→a

 B→b

 C→c

This grammar works through a series of transformations −

 Rule 1 handles the empty string case (n = 0).

 Rules 2-3 initialize the string with one occurrence of each variable (A, B, C).

 Rules 4-7 allow for the rearrangement of variables. These rules effectively "bubble" the A's to
the left and the C's to the right, maintaining the correct order and equal numbers of each
variable.

 Rule 8 is crucial: it ensures that B's are replaced with lowercase b's only when adjacent to an
existing lowercase b. This prevents premature conversion of B's and maintains the structure of
the string.

 Rules 9-11 convert the variables to their corresponding terminals once they are in the correct
position.

The grammar maintains the equal numbers of A's, B's, and C's while rearranging them into the correct
order. The context-sensitive nature of the rules allows for this precise control over the string's structure.
1. Chomsky Hierarchy of Grammars

This is a classification of formal grammars proposed by Noam Chomsky, used in both linguistics and
computer science.

Power (Most →
Grammar Type Description
Least)

Type 0: Unrestricted
Most powerful No restrictions. Can describe any computable language.
Grammar

Type 1: Context-Sensitive
Powerful Rules depend on the context of non-terminals.
Grammar (CSG)

Type 2: Context-Free Rules are of the form A → α (non-terminal → string of


Common in NLP
Grammar (CFG) symbols). Used in parsing.

Can be represented by finite automata. Used in regex,


Type 3: Regular Grammar Least powerful
tokenization, and lexical analysis.

In NLP:

 Most parsers use CFGs or probabilistic CFGs.

 Lower levels (like regular grammars) are used for tokenization, while higher levels help with
syntax and semantics.

2. Transformational Grammar (Chomsky)

What is it?

A theory of syntax that describes how deep structures (underlying meaning) can be transformed into
surface structures (actual sentences) via rules.

🔁Example:

Deep structure: "John eats an apple."


Passive transformation: "An apple is eaten by John."

In NLP:

 Important for parsing, machine translation, and syntax-based models.

 Influenced many modern grammar formalisms.


3. Case Grammar (Fillmore’s Grammar)

What is it?

Developed by Charles Fillmore, this grammar focuses on semantic roles (called cases) like agent, object,
instrument, etc., rather than just syntax.

Example:

Sentence: “Mary opened the door with a key.”

 Agent: Mary

 Object: the door

 Instrument: a key

In NLP:

 Used in Semantic Role Labeling (SRL) — identifying "who did what to whom with what".

 Useful for information extraction, question answering, etc.

4. Semantic Grammars

What is it?

A grammar where rules are based on semantic categories, not just syntactic ones. Words are grouped
based on meaning.

Example Rule:

Instead of:

nginx

CopyEdit

VP → V NP

A semantic grammar might say:

css

CopyEdit

Action → Buy Person Object

In NLP:
 Used in domain-specific systems (e.g., travel booking, customer support).

 Helps machines interpret user intents more directly.

5. Context-Free Grammar (CFG)

What is it?

A formal grammar where each rule replaces a single non-terminal with a string of terminals and/or non-
terminals.

Rule Format:

css

CopyEdit

A→α

Where A is a non-terminal and α is a string of terminals/non-terminals.

In NLP:

 Core of many parsing algorithms (e.g., CKY, Earley).

 Used to build parse trees of sentences.

Example CFG:

mathematica

CopyEdit

S → NP VP

NP → Det N

VP → V NP

Det → "the" | "a"

N → "dog" | "cat"

V → "chased" | "saw"

Summary Table
Concept Focus Key Use in NLP

Chomsky Hierarchy Classification of grammars Formal language theory, parsing

Transformational Grammar Deep vs surface structure Syntax, translation

Case Grammar Semantic roles (agent, object) Semantic role labeling

Semantic Grammar Rules based on meaning Domain-specific NLP systems

Context-Free Grammar (CFG) Syntax rules (non-terminal → symbols) Parsing, syntax trees

Parsing – Introduction to Parsers


Parsing, also known as syntactic analysis, is the process of analyzing a sequence of tokens to determine
the grammatical structure of a program. It takes the stream of tokens, which are generated by a lexical
analyzer or tokenizer, and organizes them into a parse tree or syntax tree.

The parse tree visually represents how the tokens fit together according to the rules of the language’s
syntax. This tree structure is crucial for understanding the program’s structure and helps in the next
stages of processing, such as code generation or execution. Additionally, parsing ensures that the
sequence of tokens follows the syntactic rules of the programming language, making the program valid
and ready for further analysis or execution.

What is the Role of Parser?

A parser performs syntactic and semantic analysis of source code, converting it into an intermediate
representation while detecting and handling errors.
1. Context-free syntax analysis: The parser checks if the structure of the code follows the basic
rules of the programming language (like grammar rules). It looks at how words and symbols are
arranged.

2. Guides context-sensitive analysis: It helps with deeper checks that depend on the meaning of
the code, like making sure variables are used correctly. For example, it ensures that a variable
used in a mathematical operation, like x + 2, is a number and not text.

3. Constructs an intermediate representation: The parser creates a simpler version of your code
that’s easier for the computer to understand and work with.

4. Produces meaningful error messages: If there’s something wrong in your code, the parser tries
to explain the problem clearly so you can fix it.

5. Attempts error correction: Sometimes, the parser tries to fix small mistakes in your code so it
can keep working without breaking completely.

Types of Parsing

The parsing is divided into two types, which are as follows:

 Top-down Parsing

 Bottom-up Parsing

Top-Down Parsing
Top-down parsing is a method of building a parse tree from the start symbol (root) down to the leaves
(end symbols). The parser begins with the highest-level rule and works its way down, trying to match the
input string step by step.

 Process: The parser starts with the start symbol and looks for rules that can help it rewrite this
symbol. It keeps breaking down the symbols (non-terminals) into smaller parts until it matches
the input string.

 Leftmost Derivation: In top-down parsing, the parser always chooses the leftmost non-terminal
to expand first, following what is called leftmost derivation. This means the parser works on the
left side of the string before moving to the right.

 Other Names: Top-down parsing is sometimes called recursive parsing or predictive parsing. It is
called recursive because it often uses recursive functions to process the symbols.

Top-down parsing is useful for simple languages and is often easier to implement. However, it can have
trouble with more complex or ambiguous grammars.

Top-down parsers can be classified into two types based on whether they use backtracking or not:

1. Top-down Parsing with Backtracking

In this approach, the parser tries different possibilities when it encounters a choice, If one possibility
doesn’t work (i.e., it doesn’t match the input string), the parser backtracks to the previous decision point
and tries another possibility.

Example: If the parser chooses a rule to expand a non-terminal, and it doesn’t work, it will go back, undo
the choice, and try a different rule.

Advantage: It can handle grammars where there are multiple possible ways to expand a non-terminal.

Disadvantage: Backtracking can be slow and inefficient because the parser might have to try many
possibilities before finding the correct one.

2. Top-down Parsing without Backtracking

In this approach, the parser does not backtrack. It tries to find a match with the input using only the first
choice it makes, If it doesn’t match the input, it fails immediately instead of going back to try another
option.

Example: The parser will always stick with its first decision and will not reconsider other rules once it
starts parsing.

Advantage: It is faster because it doesn’t waste time going back to previous steps.

Disadvantage: It can only handle simpler grammars that don’t require trying multiple choices.

Read more about classification of top-down parser.


Bottom-Up Parsing

Bottom-up parsing is a method of building a parse tree starting from the leaf nodes (the input symbols)
and working towards the root node (the start symbol). The goal is to reduce the input string step by step
until we reach the start symbol, which represents the entire language.

 Process: The parser begins with the input symbols and looks for patterns that can be reduced to
non-terminals based on the grammar rules. It keeps reducing parts of the string until it forms
the start symbol.

 Rightmost Derivation in Reverse: In bottom-up parsing, the parser traces the rightmost
derivation of the string but works backwards, starting from the input string and moving towards
the start symbol.

 Shift-Reduce Parsing: Bottom-up parsers are often called shift-reduce parsers because they shift
(move symbols) and reduce (apply rules to replace symbols) to build the parse tree.

Bottom-up parsing is efficient for handling more complex grammars and is commonly used in compilers.
However, it can be more challenging to implement compared to top-down parsing.

Generally, bottom-up parsing is categorized into the following types:

1. LR parsing/Shift Reduce Parsing: Shift reduce Parsing is a process of parsing a string to obtain the
start symbol of the grammar.

 LR(0)

 SLR(1)

 LALR

 CLR

2. Operator Precedence Parsing: The grammar defined using operator grammar is known as operator
precedence parsing. In operator precedence parsing there should be no null production and two non-
terminals should not be adjacent to each other.

Difference between Bottom-Up and Top-Down Parser

Feature Top-down Parsing Bottom-up Parsing

Direction Builds tree from root to leaves. Builds tree from leaves to root.

Derivation Uses leftmost derivation. Uses rightmost derivation in


Feature Top-down Parsing Bottom-up Parsing

reverse.

Can be slower, especially with More efficient for complex


Efficiency
backtracking. grammars.

Example
Recursive descent, LL parser. Shift-reduce, LR parser.
Parsers

Augmented Transition Networks (ATNs)


Transition Network: Overview

A Transition Network is a graph-based structure used to represent possible sequences of words or


syntactic structures in a language. Each node represents a state, and each arc (edge) shows a transition
labeled with grammar rules or word classes (like noun, verb, etc.).

Think of it like a flowchart for parsing sentences.

Types of Transition Networks

Here are the main types, progressing from simple to more powerful:

1. Finite State Transition Network (FSTN)

 A basic model that uses finite automata to parse text.

 Can represent regular grammars.

 Works well for simple languages or token-level processing.

Used for: Tokenization, Part-of-Speech tagging, and simple pattern recognition.

Example:

Recognizing a noun phrase (NP):

sql

CopyEdit

Start → Det → Adj* → Noun → Accept


2. Recursive Transition Network (RTN)

 Extension of FSTNs.

 Allows recursion by having transitions that can call sub-networks (like functions in
programming).

 Can model context-free grammars.

Used for: Syntactic parsing — recognizing more complex sentence structures.

Example:

A sentence network might call a NounPhrase or VerbPhrase sub-network, and those can call others
recursively.

3. Augmented Transition Network (ATN)

 Most powerful form of transition networks.

 Like RTNs, but augmented with memory, registers, and conditions.

 Can include semantic actions, such as storing values or features during parsing.

Used for:

 Deep syntactic and semantic parsing

 Early NLP systems like ELIZA and SHRDLU used ATNs

Features:

 Tests and actions can be placed on arcs.

 Can simulate full natural language understanding with enough augmentation.

Comparison Table

Type Power Handles Used In

Simple, linear language


FSTN Regular Grammars Tokenization, POS tagging
patterns

RTN Context-Free Grammars Recursive sentence structures Syntax parsing

Context-Sensitive (with Advanced NLP parsing


ATN Complex syntax + semantics
augmentation) systems

Example: Sentence Parsing with ATN

A simplified ATN might process this sentence:


"The dog chased the cat."

1. Enter S (Sentence)

2. Call NP (Noun Phrase) → parses “The dog”

3. Call VP (Verb Phrase) → parses “chased the cat”

4. Store semantic roles:

o Agent = “dog”

o Action = “chased”

o Object = “cat”

Case Studies: Eliza System. Lunar System


1. ELIZA System (1960s)

Created by: Joseph Weizenbaum at MIT

Purpose: Simulate a psychotherapist using pattern matching.

How it worked:

 Keyword-based pattern matching.

 No real understanding — just rearranged user input into pre-defined templates.

 Used scripts, like the famous DOCTOR script, which mimicked a Rogerian therapist.

Example Interaction:

User: "I'm feeling sad today."


ELIZA: "Why do you say you are feeling sad?"
(Just picks up on the word "sad" and turns the sentence around)

Techniques Used:

 Pattern-action rules (like regular expressions)

 Simple memory (context-free)

 Used a kind of Finite State Machine / Rule-based Transition Network

Limitations:

 No actual understanding of semantics or meaning.

 Easily tricked with unexpected input.


 Gave the illusion of intelligence — a classic example of shallow NLP.

Impact:

 First "chatbot" — inspired tons of research in conversational agents.

 Showed how humans project intelligence onto machines with minimal cues (known as the ELIZA
effect).

2. LUNAR System (1970s)

Created by: William A. Woods

Purpose: Answer questions about moon rock samples from the Apollo missions.

How it worked:

 Aimed at natural language question answering in a restricted domain (moon geology).

 Parsed user queries, mapped them to formal database queries, and retrieved answers.

Example Query:

User: "What is the aluminum content of the rock sample 10017?"


System: Looks up in its database and returns the aluminum percentage.

Techniques Used:

 Augmented Transition Networks (ATNs) for syntactic parsing.

 Semantic grammar rules for interpreting the meaning.

 Database querying for factual answers.

Strengths:

 Very accurate within its domain.

 Combined syntax, semantics, and knowledge base access.

Limitations:

 Couldn't generalize beyond its narrow field (moon rocks).

 Required lots of hand-crafted grammar and domain knowledge.

Impact:

 One of the earliest domain-specific question answering systems.

 Paved the way for modern QA systems and semantic parsing techniques.
ELIZA vs. LUNAR – Quick Comparison

Feature ELIZA LUNAR

Purpose Simulate therapist Answer geology-related questions

Domain Open, generic Narrow, domain-specific

Understanding None (surface-level) Deep (syntax + semantics)

Techniques Pattern matching ATN, semantic grammar, DB access

NLP Depth Shallow Deep

Impact First chatbot Early QA system

You might also like