Artificial Intelligence-UNIT-4
Artificial Intelligence-UNIT-4
Unit-4
Natural Language Processing (NLP)
Understanding the meaning: Being able to extract the meaning from text, speech, or other
forms of human language.
Analyzing structure: Recognizing the grammatical structure and syntax of language, including
parts of speech and sentence construction.
Generating human-like language: Creating text or speech that is natural, coherent, and
grammatically correct.
Ultimately, NLP aims to bridge the gap between human communication and machine comprehension,
fostering seamless interaction between us and technology.
In the 1950s, the dream of effortless communication across languages fueled the birth of NLP. Machine
translation (MT) was the driving force, and rule-based systems emerged as the initial approach.
These systems functioned like complex translation dictionaries on steroids. Linguists meticulously
crafted a massive set of rules that captured the grammatical structure (syntax) and vocabulary of
specific languages.
1. Sentence Breakdown: The system would first analyze the source language sentence and break it
down into its parts of speech (nouns, verbs, adjectives, etc.).
2. Matching Rules: Each word or phrase would be matched against the rule base to find its
equivalent in the target language, considering grammatical roles and sentence structure.
3. Rearrangement: Finally, the system would use the rules to rearrange the translated words and
phrases to form a grammatically correct sentence in the target language.
While offering a foundation for MT, this approach had several limitations:
Inflexibility: Languages are full of nuances and exceptions. Rule-based systems struggled to
handle idioms, slang, and variations in sentence structure. A slight deviation from the expected
format could throw the entire translation off.
Scalability Issues: Creating and maintaining a vast rule base for every language pair was a time-
consuming and laborious task. Imagine the immense effort required for just a handful of
languages!
Limited Scope: These systems primarily focused on syntax and vocabulary, often failing to
capture the deeper meaning and context of the text. This resulted in translations that sounded
grammatically correct but unnatural or even nonsensical.
Despite these limitations, rule-based systems laid the groundwork for future NLP advancements. They
demonstrated the potential for computers to understand and manipulate human language, paving the
way for more sophisticated approaches that would emerge later.
A Shift Towards Statistics: The 1980s saw a paradigm shift towards statistical NLP approaches.
Machine learning algorithms emerged as powerful tools for NLP tasks.
The Power of Data: Large collections of text data (corpora) became crucial for training these
statistical models.
Learning from Patterns: Unlike rule-based systems, statistical models learn patterns from data,
allowing them to handle variations and complexities of natural language.
The Deep Learning Revolution: The 2000s ushered in the era of deep learning, significantly
impacting NLP.
Artificial Neural Networks (ANNs): These complex algorithms, inspired by the human brain,
became the foundation of deep learning advancements in NLP.
Advanced Architectures: Deep learning architectures like recurrent neural networks and
transformers further enhanced NLP capabilities. Briefly mention these architectures without
going into technical details.
The Advent of Rule-Based Systems
The 1960's and 1970's witnessed the emergence of rule-primarily based systems inside the realm of
NLP. Collaborations among linguists and computer scientists precipitated the development of structures
that trusted predefined policies to analyze and understand human language.
The aim became to codify linguistic recommendations, at the side of syntax and grammar, into
algorithms that would be completed by way of computer systems to machine and generate human-like
text.
During this period, the General Problem Solver (GPS) received prominence. They had been developed
with the resources of Allen Newell and Herbert A. Simon; in 1957, GPS wasn't explicitly designed for
language processing. However, it established the functionality of rule-based total systems by showcasing
how computers must solve issues with the use of predefined policies and heuristics.
1. Language Differences
Human language is rich and intricate, with thousands of languages spoken worldwide, each
having its own grammar, vocabulary, and cultural nuances. This diversity creates ambiguity, as
the same words and phrases can have different meanings in different contexts. Additionally, the
complex syntactic structures and grammatical rules of natural languages add to the difficulty.
2. Training Data
High-quality annotated data is essential for training NLP models. Collecting, annotating, and
preprocessing large text datasets can be time-consuming and resource-intensive, especially for
tasks requiring specialized domain knowledge.
3. Development Time and Resource Requirements
The development time and resources required for NLP projects depend on various factors,
including task complexity, data quality, and the availability of existing tools and libraries.
Complex tasks like machine translation or question answering require more time and
computational resources compared to simpler tasks like text classification.
4. Phrasing Ambiguities
Phrasing ambiguities arise when a phrase can be interpreted in multiple ways, leading to
uncertainty in understanding the meaning. Contextual understanding, semantic analysis,
syntactic analysis, and pragmatic analysis are crucial for resolving these ambiguities.
5. Misspellings and Grammatical Errors
Misspellings and grammatical errors introduce linguistic noise that can impact the accuracy of
NLP models. Techniques like spell checking, text normalization, tokenization, and language
models trained on large corpora can help address these issues.
6. Mitigating Innate Biases
Ensuring fairness, equity, and inclusivity in NLP applications requires mitigating innate biases in
algorithms. This involves collecting diverse and representative training data, applying bias
detection methods, and evaluating models for fairness.
7. Words with Multiple Meanings
Words with multiple meanings, known as polysemous or homonymous words, pose a lexical
challenge in NLP. Semantic analysis, domain-specific knowledge, multi-word expressions, and
knowledge graphs can help disambiguate these words.
8. Addressing Multilingualism
Handling text data in multiple languages is crucial for NLP systems. Techniques like multilingual
corpora, cross-lingual transfer learning, language identification, and machine translation are
essential for addressing language diversity.
9. Reducing Uncertainty and False Positives
Reducing uncertainty and false positives in NLP models improves their accuracy and reliability.
Probabilistic models, confidence scores, threshold tuning, and ensemble methods can help
achieve this.
10. Facilitating Continuous Conversations
Building NLP models that can maintain context throughout a conversation is essential for
seamless interaction between users and machines. Real-time processing, context
understanding, and intent recognition are key components for facilitating continuous
conversations.
11. Overcoming NLP Challenges
Overcoming these challenges requires a combination of innovative technologies, domain
expertise, and methodological approaches. Techniques like data augmentation, transfer
learning, and pre-training can help address data scarcity and improve model performance.
NLP Techniques
NLP encompasses a wide array of techniques that aimed at enabling computers to process and
understand human language. These tasks can be categorized into several broad areas, each addressing
different aspects of language processing. Here are some of the key NLP techniques:
Stopword Removal: Removing common words (like “and”, “the”, “is”) that may not carry
significant meaning.
Text Normalization: Standardizing text, including case normalization, removing punctuation and
correcting spelling errors.
Constituency Parsing: Breaking down a sentence into its constituent parts or phrases (e.g., noun
phrases, verb phrases).
3. Semantic Analysis
Named Entity Recognition (NER): Identifying and classifying entities in text, such as names of
people organizations, locations, dates, etc.
Word Sense Disambiguation (WSD): Determining which meaning of a word is used in a given
context.
Coreference Resolution: Identifying when different words refer to the same entity in a text (e.g.,
“he” refers to “John”).
4. Information Extraction
Entity Extraction: Identifying specific entities and their relationships within the text.
Relation Extraction: Identifying and categorizing the relationships between entities in a text.
Sentiment Analysis: Determining the sentiment or emotional tone expressed in a text (e.g.,
positive, negative, neutral).
6. Language Generation
7. Speech Processing
8. Question Answering
Retrieval-Based QA: Finding and returning the most relevant text passage in response to a
query.
Generative QA: Generating an answer based on the information available in a text corpus.
9. Dialogue Systems
Chatbots and Virtual Assistants: Enabling systems to engage in conversations with users,
providing responses and performing tasks based on user input.
Opinion Mining: Analyzing opinions or reviews to understand public sentiment toward products,
services or topics.
Working in natural language processing (NLP) typically involves using computational techniques to
analyze and understand human language. This can include tasks such as language understanding,
language generation and language interaction.
1. Text Input and Data Collection
Data Collection: Gathering text data from various sources such as websites, books, social media
or proprietary databases.
Data Storage: Storing the collected text data in a structured format, such as a database or a
collection of documents.
2. Text Preprocessing
Preprocessing is crucial to clean and prepare the raw text data for analysis. Common preprocessing
steps include:
Tokenization: Splitting text into smaller units like words or sentences.
Lowercasing: Converting all text to lowercase to ensure uniformity.
Stopword Removal: Removing common words that do not contribute significant meaning, such
as “and,” “the,” “is.”
Punctuation Removal: Removing punctuation marks.
Stemming and Lemmatization: Reducing words to their base or root forms. Stemming cuts off
suffixes, while lemmatization considers the context and converts words to their meaningful base
form.
Text Normalization: Standardizing text format, including correcting spelling errors, expanding
contractions and handling special characters.
3. Text Representation
Bag of Words (BoW): Representing text as a collection of words, ignoring grammar and word
order but keeping track of word frequency.
Term Frequency-Inverse Document Frequency (TF-IDF): A statistic that reflects the importance
of a word in a document relative to a collection of documents.
Word Embeddings: Using dense vector representations of words where semantically similar
words are closer together in the vector space (e.g., Word2Vec, GloVe).
4. Feature Extraction
Extracting meaningful features from the text data that can be used for various NLP tasks.
N-grams: Capturing sequences of N words to preserve some context and word order.
Syntactic Features: Using parts of speech tags, syntactic dependencies and parse trees.
Semantic Features: Leveraging word embeddings and other representations to capture word
meaning and context.
5. Model Selection and Training
Selecting and training a machine learning or deep learning model to perform specific NLP tasks.
Supervised Learning: Using labeled data to train models like Support Vector Machines (SVM),
Random Forests or deep learning models like Convolutional Neural Networks (CNNs) and
Recurrent Neural Networks (RNNs).
Unsupervised Learning: Applying techniques like clustering or topic modeling (e.g., Latent
Dirichlet Allocation) on unlabeled data.
Pre-trained Models: Utilizing pre-trained language models such as BERT, GPT or transformer-
based models that have been trained on large corpora.
6. Model Deployment and Inference
Deploying the trained model and using it to make predictions or extract insights from new text data.
Text Classification: Categorizing text into predefined classes (e.g., spam detection, sentiment
analysis).
Named Entity Recognition (NER): Identifying and classifying entities in the text.
Machine Translation: Translating text from one language to another.
Question Answering: Providing answers to questions based on the context provided by text
data.
7. Evaluation and Optimization
Evaluating the performance of the NLP algorithm using metrics such as accuracy, precision, recall, F1-
score and others.
Hyperparameter Tuning: Adjusting model parameters to improve performance.
Error Analysis: Analyzing errors to understand model weaknesses and improve robustness.
Technologies related to Natural Language Processing
There are a variety of technologies related to natural language processing (NLP) that are used to analyze
and understand human language. Some of the most common include:
1. Machine learning: NLP relies heavily on machine learning techniques such as supervised and
unsupervised learning, deep learning and reinforcement learning to train models to understand
and generate human language.
2. Natural Language Toolkits (NLTK) and other libraries: NLTK is a popular open-source library in
Python that provides tools for NLP tasks such as tokenization, stemming and part-of-speech
tagging. Other popular libraries include spaCy, OpenNLP and CoreNLP.
3. Parsers: Parsers are used to analyze the syntactic structure of sentences, such as dependency
parsing and constituency parsing.
4. Text-to-Speech (TTS) and Speech-to-Text (STT) systems: TTS systems convert written text into
spoken words, while STT systems convert spoken words into written text.
5. Named Entity Recognition (NER) systems: NER systems identify and extract named entities such
as people, places and organizations from the text.
7. Machine Translation: NLP is used for language translation from one language to another
through a computer.
8. Chatbots: NLP is used for chatbots that communicate with other chatbots or humans through
auditory or textual methods.
Algorithmic Trading: Algorithmic trading is used for predicting stock market conditions. Using
NLP, this technology examines news headlines about companies and stocks and attempts to
comprehend their meaning in order to determine if you should buy, sell or hold certain stocks.
Questions Answering: NLP can be seen in action by using Google Search or Siri Services. A major
use of NLP is to make search engines understand the meaning of what we are asking and
generate natural language in return to give us the answers.
Summarizing Information: On the internet, there is a lot of information and a lot of it comes in
the form of long documents or articles. NLP is used to decipher the meaning of the data and
then provides shorter summaries of the data so that humans can comprehend it more
quickly.
Future Scope
NLP is shaping the future of technology in several ways:
Chatbots and Virtual Assistants: NLP enables chatbots to quickly understand and respond to
user queries, providing 24/7 assistance across text or voice interactions.
Invisible User Interfaces (UI): With NLP, devices like Amazon Echo allow for seamless
communication through voice or text, making technology more accessible without traditional
interfaces.
Smarter Search: NLP is improving search by allowing users to ask questions in natural language,
as seen with Google Drive’s recent update, making it easier to find documents.
Multilingual NLP: Expanding NLP to support more languages, including regional and minority
languages, broadens accessibility.
Future Enhancements: NLP is evolving with the use of Deep Neural Networks (DNNs) to make human-
machine interactions more natural. Future advancements include improved semantics for word
understanding and broader language support, enabling accurate translations and better NLP models for
languages not yet supported.
Goals of NLP:
1. Understanding Human Language
Goal: Make machines understand the meaning and structure of human language (both written
and spoken).
Why it matters: This is the foundation for everything else—whether it’s translating text or
answering a question, the machine has to “get” what we’re saying.
3. Language Translation
Goal: Automatically translate from one language to another while preserving meaning and tone.
4. Information Extraction
Goal: Pull out relevant data from large amounts of text.
Example: Extracting names, places, dates, and relationships from news articles or legal
documents.
5. Information Retrieval
Goal: Help find relevant information from large text collections (like search engines).
6. Sentiment Analysis
Example: Analyzing tweets or reviews to understand public sentiment about a product or event.
Goal: Convert spoken language into text, and then understand it.
8. Text Classification
9. Summarization
Goal: Generate concise summaries of larger texts without losing key information.
Discourse knowledge in NLP is the understanding of how sentences connect and make sense together in
a larger context — such as conversations, documents, or multi-turn dialogues.
NLP Focus:
Understand how ideas flow across sentences.
Coreference Resolution: Identifying when two expressions refer to the same thing.
E.g., "Lisa went home. She was tired." — "She" = "Lisa".
Question Answering: Using context beyond just one sentence to generate answers.
Example:
The system must know that “it” refers to Interstellar, and understand that both sentences are
related.
Definition:
Pragmatic knowledge involves understanding what is meant, not just what is said — taking into account
context, tone, real-world knowledge, and speaker intention.
NLP Focus:
Intent Recognition: Understanding the user's real purpose behind a statement (important in
virtual assistants).
Example:
User says: “Yeah, right, this app totally works…”
Literal words are positive, but pragmatically (via sarcasm), the sentiment is negative.
A language can be defined as a set of strings over a given alphabet. This definition is true for human
languages as well as computer languages. There are several components –
In computer science, we talk about the formal languages. This provides a rigorous framework for
studying the properties and structures of languages. They are essential in computer science to make the
syntax of programming languages and for analyzing the capabilities of different computational models.
Alphabets and strings are important components of formal languages. Let us discuss about them a little:
Empty string (ε) − The string containing no symbols (This is important into formal languages)
Formal languages can be classified into four types: regular, context-free, context-
sensitive and recursively enumerable languages. But having a knowledge about languages is not
enough; we need to go with grammars as well.
Grammars in Automata Theory
In automata, the grammars are formal systems for describing the structure of languages. In grammar,
there are set of rules for generating valid strings in a language.
Formally, we can define grammar like this. A grammar is a tuple G = (V, Σ, R, S), where:
Grammars are used to generate all valid strings in a language, it also provides a structural
description of the language and serve as a basis for parsing and syntax analysis. Let us see the
following table to understand different components of a grammar clearly.
While we talk about grammars, it is necessary to understand two important concepts related to
grammars –
Derivation − A sequence of rule applications that transform the start symbol into a string of
terminal symbols
Parse tree − A graphical representation of a derivation, showing the hierarchical structure of the
generated string
Field Descriptions
The theory of formal languages finds its applicability extensively in the fields of Computer Science.
Noam Chomsky gave a mathematical model of grammar in 1956 which is effective for writing
computer languages.
Representation of Grammar
A grammar G can be formally written as a 4-tuple (N, T, S, P) where −
where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to VN.
P is Production rules for Terminals and Non-terminals. A production rule has the form α → β,
Example 1
Grammar G1 −
Here,
Productions, P : S → AB, A → a, B → b
Example 2
Grammar G2 −
Here,
ε is an empty string.
Non-Terminal Symbols - Non-Terminal Symbols take part in the generation of the sentence but are
not the component of the sentence. These types of symbols are also called Auxiliary Symbols and
Variables. They are represented using a capital letter like A, B, C, etc.
Example 1
Consider a grammar
G=(V,T,P,S)G=(V,T,P,S)
Where,
V = { S , A , B } Non-Terminal symbols
T = { a , b } Terminal symbols
S = { S } Start symbol
Example 2
Consider a grammar
G=(V,T,P,S)G=(V,T,P,S)
Where,
Production
Grammar Language Automata
rules
Recursively
Type-0 Turing machine No restriction
enumerable
Linear-bounded non-deterministic
Type-1 Context-sensitive αAβ→αγβ
machine
A→αB
Type-3 Regular Finite state automata
A→α
The diagram representing the types of grammar in the theory of computation (TOC) is as follows −
Derivations from a Grammar
Strings may be derived from other strings using the productions in a grammar. If a grammar G has a
production α → β, we can say that x α y derives x β y in G. This derivation is written as −
x α y ⇒G x β y
Example
L(G)={W|W ∈ ∑*, S ⇒G W}
Example
If there is a grammar
Here S produces AB, and we can replace A by a, and B by b. Here, the only accepted string is ab, i.e.,
L(G) = {ab}
Example
Example
Problem − Suppose, L (G) = {am bn | m ≥ 0 and n > 0}. We have to find out the grammar G which
produces L(G).
Solution
Here, the start symbol has to take at least one b preceded by any number of a including null.
To accept the string set {b, ab, bb, aab, abb, .}, we have taken the productions −
S → aS , S → B, B → b and B → bB
S → B → b (Accepted)
S → B → bB → bb (Accepted)
S → aS → aB → ab (Accepted)
Thus, we can prove every single string in L(G) is accepted by the language generated by the production
set.
Example
Problem − Suppose, L (G) = {am bn | m > 0 and n ≥ 0}. We have to find out the grammar G which produces
L(G).
Solution −
Since L(G) = {am bn | m > 0 and n ≥ 0}, the set of strings accepted can be rewritten as −
Here, the start symbol has to take at least one a followed by any number of b including null.
To accept the string set {a, aa, ab, aaa, aab, abb, .}, we have taken the productions −
S → aA, A → aA , A → B, B → bB ,B → λ
S → aA → aB → aλ → a (Accepted)
Thus, we can prove every single string in L(G) is accepted by the language generated by the production
set.
Grammar
Grammar Accepted Language Accepted Automaton
Type
Recursively enumerable
Type 0 Unrestricted grammar Turing Machine
language
Context-sensitive Linear-bounded
Type 1 Context-sensitive language
grammar automaton
Type - 3 Grammar
Type-3 grammars generate regular languages. Type-3 grammars must have a single non-terminal on the
left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a single
non-terminal.
and a ∈ T (Terminal)
The rule S → ε is allowed if S does not appear on the right side of any rule.
Example
X→ε
X → a | aY
Y→b
Advertisement
Mute
Fullscreen
Type - 2 Grammar
Example
S→Xa
X→a
X → aX
X → abc
X→ε
Type - 1 Grammar
Type-1 grammars generate context-sensitive languages. The productions must be in the form
αAβ→αβ
where A ∈ N (Non-terminal)
The rule S → ε is allowed if S does not appear on the right side of any rule. The languages generated by
these grammars are recognized by a linear bounded automaton.
Example
AB → AbBc
A → bcA
B→b
Type - 0 Grammar
Type-0 grammars generate recursively enumerable languages. The productions have no restrictions.
They are any phase structure grammar including all formal grammars.
The productions can be in the form of α → β where α is a string of terminals and nonterminals with at
least one non-terminal and α cannot be null. β is a string of terminals and non-terminals.
Example
S → ACaB
Bc → acB
CB → DB
aD → Db
Context-Sensitive Languages
Context-sensitive languages (CSLs) bridge the gap between the well-studied context-free languages and
the more complex recursively enumerable languages. In this chapter, we will cover the concepts of
context-sensitive languages, exploring their definitions, properties, and relationships to other language
classes in the Chomsky hierarchy.
A context-free language is generated by a context-free grammar (CFG). In a CFG, production rules have
the form: A → X, Where −
A is a variable (non-terminal)
The key characteristic of CFLs is that the replacement of A with X is independent of the surrounding
context. This property gives CFLs their name they are "free" of context constraints.
CFLs correspond to pushdown automata (PDAs) in the Chomsky hierarchy, which are more powerful
than finite automata but less powerful than linear-bounded automata.
The context-sensitive languages extend the concept of CFLs by allowing production rules to depend on
the context in which variables appear. This seemingly small change leads to a significant increase in
expressive power. Let us understand CSG in greater detail.
A context-sensitive grammar has production rules of the form: αAβ → αXβ, where −
A is a variable
Context Preservation − The production process maintains the same context (α and β) on both
sides, ensuring that the replacement of A with X only occurs within the defined context.
Non-Contracting − The grammar's property X cannot be empty, ensuring it doesn't reduce string
length during derivation. However, the start variable S can generate an empty string if it's part
of the language.
Increased Expressive Power − CSLs can describe patterns that CFLs cannot, such as matching
multiple repeated substrings.
Context-sensitive languages are located at the third level of the Chomsky hierarchy, between context-
free and recursively enumerable languages.
CSLs have the added power to describe patterns that CFLs cannot.
A classic example of a language that is context-sensitive but not context-free is: L = {abc | n ≥ 0}
The language is composed of strings with equal numbers of a's, b's, and c's, which cannot be generated
by context-free grammars due to their inability to count and ensure equal numbers of three different
symbols.
To illustrate the power of context-sensitive grammars, let's construct a CSG that generates the
language L = {abc | n ≥ 0}.
S → S'
S' → ABC
AB → BAB
BA → ACA
CA → CB
CB → AB
Bb → Bb
A→a
B→b
C→c
Rules 2-3 initialize the string with one occurrence of each variable (A, B, C).
Rules 4-7 allow for the rearrangement of variables. These rules effectively "bubble" the A's to
the left and the C's to the right, maintaining the correct order and equal numbers of each
variable.
Rule 8 is crucial: it ensures that B's are replaced with lowercase b's only when adjacent to an
existing lowercase b. This prevents premature conversion of B's and maintains the structure of
the string.
Rules 9-11 convert the variables to their corresponding terminals once they are in the correct
position.
The grammar maintains the equal numbers of A's, B's, and C's while rearranging them into the correct
order. The context-sensitive nature of the rules allows for this precise control over the string's structure.
1. Chomsky Hierarchy of Grammars
This is a classification of formal grammars proposed by Noam Chomsky, used in both linguistics and
computer science.
Power (Most →
Grammar Type Description
Least)
Type 0: Unrestricted
Most powerful No restrictions. Can describe any computable language.
Grammar
Type 1: Context-Sensitive
Powerful Rules depend on the context of non-terminals.
Grammar (CSG)
In NLP:
Lower levels (like regular grammars) are used for tokenization, while higher levels help with
syntax and semantics.
What is it?
A theory of syntax that describes how deep structures (underlying meaning) can be transformed into
surface structures (actual sentences) via rules.
🔁Example:
In NLP:
What is it?
Developed by Charles Fillmore, this grammar focuses on semantic roles (called cases) like agent, object,
instrument, etc., rather than just syntax.
Example:
Agent: Mary
Instrument: a key
In NLP:
Used in Semantic Role Labeling (SRL) — identifying "who did what to whom with what".
4. Semantic Grammars
What is it?
A grammar where rules are based on semantic categories, not just syntactic ones. Words are grouped
based on meaning.
Example Rule:
Instead of:
nginx
CopyEdit
VP → V NP
css
CopyEdit
In NLP:
Used in domain-specific systems (e.g., travel booking, customer support).
What is it?
A formal grammar where each rule replaces a single non-terminal with a string of terminals and/or non-
terminals.
Rule Format:
css
CopyEdit
A→α
In NLP:
Example CFG:
mathematica
CopyEdit
S → NP VP
NP → Det N
VP → V NP
N → "dog" | "cat"
V → "chased" | "saw"
Summary Table
Concept Focus Key Use in NLP
Context-Free Grammar (CFG) Syntax rules (non-terminal → symbols) Parsing, syntax trees
The parse tree visually represents how the tokens fit together according to the rules of the language’s
syntax. This tree structure is crucial for understanding the program’s structure and helps in the next
stages of processing, such as code generation or execution. Additionally, parsing ensures that the
sequence of tokens follows the syntactic rules of the programming language, making the program valid
and ready for further analysis or execution.
A parser performs syntactic and semantic analysis of source code, converting it into an intermediate
representation while detecting and handling errors.
1. Context-free syntax analysis: The parser checks if the structure of the code follows the basic
rules of the programming language (like grammar rules). It looks at how words and symbols are
arranged.
2. Guides context-sensitive analysis: It helps with deeper checks that depend on the meaning of
the code, like making sure variables are used correctly. For example, it ensures that a variable
used in a mathematical operation, like x + 2, is a number and not text.
3. Constructs an intermediate representation: The parser creates a simpler version of your code
that’s easier for the computer to understand and work with.
4. Produces meaningful error messages: If there’s something wrong in your code, the parser tries
to explain the problem clearly so you can fix it.
5. Attempts error correction: Sometimes, the parser tries to fix small mistakes in your code so it
can keep working without breaking completely.
Types of Parsing
Top-down Parsing
Bottom-up Parsing
Top-Down Parsing
Top-down parsing is a method of building a parse tree from the start symbol (root) down to the leaves
(end symbols). The parser begins with the highest-level rule and works its way down, trying to match the
input string step by step.
Process: The parser starts with the start symbol and looks for rules that can help it rewrite this
symbol. It keeps breaking down the symbols (non-terminals) into smaller parts until it matches
the input string.
Leftmost Derivation: In top-down parsing, the parser always chooses the leftmost non-terminal
to expand first, following what is called leftmost derivation. This means the parser works on the
left side of the string before moving to the right.
Other Names: Top-down parsing is sometimes called recursive parsing or predictive parsing. It is
called recursive because it often uses recursive functions to process the symbols.
Top-down parsing is useful for simple languages and is often easier to implement. However, it can have
trouble with more complex or ambiguous grammars.
Top-down parsers can be classified into two types based on whether they use backtracking or not:
In this approach, the parser tries different possibilities when it encounters a choice, If one possibility
doesn’t work (i.e., it doesn’t match the input string), the parser backtracks to the previous decision point
and tries another possibility.
Example: If the parser chooses a rule to expand a non-terminal, and it doesn’t work, it will go back, undo
the choice, and try a different rule.
Advantage: It can handle grammars where there are multiple possible ways to expand a non-terminal.
Disadvantage: Backtracking can be slow and inefficient because the parser might have to try many
possibilities before finding the correct one.
In this approach, the parser does not backtrack. It tries to find a match with the input using only the first
choice it makes, If it doesn’t match the input, it fails immediately instead of going back to try another
option.
Example: The parser will always stick with its first decision and will not reconsider other rules once it
starts parsing.
Advantage: It is faster because it doesn’t waste time going back to previous steps.
Disadvantage: It can only handle simpler grammars that don’t require trying multiple choices.
Bottom-up parsing is a method of building a parse tree starting from the leaf nodes (the input symbols)
and working towards the root node (the start symbol). The goal is to reduce the input string step by step
until we reach the start symbol, which represents the entire language.
Process: The parser begins with the input symbols and looks for patterns that can be reduced to
non-terminals based on the grammar rules. It keeps reducing parts of the string until it forms
the start symbol.
Rightmost Derivation in Reverse: In bottom-up parsing, the parser traces the rightmost
derivation of the string but works backwards, starting from the input string and moving towards
the start symbol.
Shift-Reduce Parsing: Bottom-up parsers are often called shift-reduce parsers because they shift
(move symbols) and reduce (apply rules to replace symbols) to build the parse tree.
Bottom-up parsing is efficient for handling more complex grammars and is commonly used in compilers.
However, it can be more challenging to implement compared to top-down parsing.
1. LR parsing/Shift Reduce Parsing: Shift reduce Parsing is a process of parsing a string to obtain the
start symbol of the grammar.
LR(0)
SLR(1)
LALR
CLR
2. Operator Precedence Parsing: The grammar defined using operator grammar is known as operator
precedence parsing. In operator precedence parsing there should be no null production and two non-
terminals should not be adjacent to each other.
Direction Builds tree from root to leaves. Builds tree from leaves to root.
reverse.
Example
Recursive descent, LL parser. Shift-reduce, LR parser.
Parsers
Here are the main types, progressing from simple to more powerful:
Example:
sql
CopyEdit
Extension of FSTNs.
Allows recursion by having transitions that can call sub-networks (like functions in
programming).
Example:
A sentence network might call a NounPhrase or VerbPhrase sub-network, and those can call others
recursively.
Can include semantic actions, such as storing values or features during parsing.
Used for:
Features:
Comparison Table
1. Enter S (Sentence)
o Agent = “dog”
o Action = “chased”
o Object = “cat”
How it worked:
Used scripts, like the famous DOCTOR script, which mimicked a Rogerian therapist.
Example Interaction:
Techniques Used:
Limitations:
Impact:
Showed how humans project intelligence onto machines with minimal cues (known as the ELIZA
effect).
Purpose: Answer questions about moon rock samples from the Apollo missions.
How it worked:
Parsed user queries, mapped them to formal database queries, and retrieved answers.
Example Query:
Techniques Used:
Strengths:
Limitations:
Impact:
Paved the way for modern QA systems and semantic parsing techniques.
ELIZA vs. LUNAR – Quick Comparison