Natural Language Processing
Natural Language Processing
Natural Language Understanding (NLU) helps the machine to understand and analyse human language by
extracting the metadata from content such as concepts, entities, keywords, emotion, relations, and semantic
roles.
NLU mainly used in Business applications to understand the customer's problem in both spoken and written
language.
Natural Language Generation (NLG) acts as a translator that converts the computerized data into natural
language representation. It mainly involves Text planning, Sentence planning, and Text Realization.
o Breaking language barriers: NLP helps bridge the gap between languages, enabling people to
communicate across languages and cultures.
o Improving human-computer interaction: NLP enables computers to understand human language,
making it easier for people to interact with technology.
o Automating tasks: NLP enables automation of tasks such as data entry, customer service, and language
translation.
o Enhancing productivity: NLP helps streamline processes, freeing up time for more strategic and creative
work.
These are just a few examples of why NLP is essential in today's world. As technology continues to evolve, the
importance of NLP will only continue to grow.
Virtual assistants like Siri, Alexa, and Google Assistant use NLP to understand voice commands and respond
accordingly. They can perform tasks such as:
o - Answering questions
o - Setting reminders
o - Sending messages
o - Making calls
o - Controlling smart home devices
Language Translation
Language translation software like Google Translate uses NLP to translate text and speech from one language to
another. This enables people to communicate across language barriers.
Sentiment Analysis
Sentiment analysis tools use NLP to analyze customer feedback and sentiment on social media, reviews, and
other online platforms. This helps businesses to:
Chatbots use NLP to understand and respond to customer inquiries, providing 24/7 customer support. They can:
Speech recognition systems use NLP to transcribe spoken words into text, enabling applications like:
o - Voice-to-text messaging
o - Voice-controlled interfaces
o - Transcription services
Information Retrieval
Information retrieval systems use NLP to improve search engine results, enabling users to:
Text summarization tools use NLP to automatically summarize long documents, articles, and other texts into
shorter, more digestible versions.
Social media monitoring tools use NLP to track social media conversations, identifying trends and potential
issues.
Customer service automation uses NLP to automate customer support tasks, such as:
Language learning platforms use NLP to provide personalized language instruction, such as:
o - Grammar correction
o - Vocabulary building
o - Pronunciation practice
Content Generation
Content generation tools use NLP to automatically generate content, such as:
o - News articles
o - Social media posts
o - Product descriptions
Healthcare
o Tokenization: breaking down text into individual words or tokens. For example, 'The cat sat on the mat'
is tokenized into ['The', 'cat', 'sat', 'on', 'the', 'mat'].
o Stopword Removal: removing common words like "the", "and", etc. that don't add much value.
o Stemming or Lemmatization: reducing words to their base form (e.g., "running" becomes "run").
o Part-of-Speech Tagging: identifying the grammatical category of each word (e.g., noun, verb, adjective).
o Named Entity Recognition (NER): identifying named entities like people, organizations, locations.
o Entity Disambiguation: identifying the correct meaning of entities with multiple possible meanings.
o Text Classification: classifying text into predefined categories (e.g., spam/not spam, positive/negative
sentiment).
o Topic Modeling: identifying the underlying topics in a large corpus of text.
One of the biggest challenges in NLP is ambiguity and uncertainty. Natural language is inherently ambiguous,
with words and phrases often having multiple meanings. For example, the word "bank" can refer to a financial
institution or the side of a river. This ambiguity makes it difficult for NLP systems to accurately understand the
meaning of text. Additionally, uncertainty in language can arise from incomplete or noisy data, making it
challenging for NLP systems to make accurate predictions.
Contextual Understanding
Another challenge in NLP is contextual understanding. Natural language is often dependent on the context in
which it is used. For example
“The bat flew across the field." (In one context, this sentence might refer to a flying mammal, while in another
context, it might refer to a sports equipment used in cricket or baseball.)
"The professor is going to kill us tomorrow." (In one context, this sentence might be a literal statement about
violence, while in another context, it might be an idiomatic expression meaning "the professor is going to give
us a very difficult exam tomorrow.")
These examples illustrate how the meaning of a sentence can change depending on the context in which it is
used. NLP systems need to be able to understand the context in order to accurately interpret the meaning of
text.. NLP systems struggle to understand the nuances of language and the context in which it is used, leading to
misinterpretations and inaccuracies.
Sarcasm and idioms are another challenge in NLP. Sarcasm, in particular, can be difficult to detect, as it often
involves saying the opposite of what you mean. Idioms, on the other hand, are phrases or expressions that have
a different meaning than the literal meaning of the individual words. For example, "kick the bucket" means to
die, not to physically kick a bucket. NLP systems struggle to understand sarcasm and idioms, leading to
misinterpretations and inaccuracies.
Language Evolution
Language is constantly evolving, with new words, phrases, and grammar emerging all the time. This makes it
challenging for NLP systems to keep up with the latest language trends and nuances. Additionally, language
evolution can lead to changes in the meaning of words and phrases over time, making it difficult for NLP
systems to accurately understand the meaning of text.
Linguistic Diversity
The world has over 7,000 languages, each with its unique grammar, syntax, and vocabulary. This linguistic
diversity poses a significant challenge for NLP systems, as they often struggle to handle languages with non-
Latin scripts, tonal languages, or languages with complex grammar systems. Furthermore, many languages lack
sufficient resources, such as labeled training data, which makes it difficult to develop accurate NLP models. As
a result, NLP systems often perpetuate linguistic biases, favoring languages with more resources and attention.
Many NLP systems rely on large amounts of training data to learn patterns and relationships in language.
However, obtaining large amounts of high-quality training data can be challenging, particularly for low-
resource languages or specialized domains. Limited training data can lead to biased or inaccurate NLP models.
NLP systems can be complex and difficult to interpret, making it challenging to understand why they make
certain predictions or decisions. Ensuring that NLP systems are explainable and transparent is essential for
building trust and accountability.
NLP systems often focus on text-based input, but many applications involve multimodal input, such as speech,
images, or video. Handling multimodal input requires integrating multiple AI modalities and developing new
architectures and algorithms.
Real-World Applications
Finally, deploying NLP systems in real-world applications can be challenging. NLP systems often require
significant computational resources, and ensuring that they are scalable, efficient, and reliable in real-world
settings can be difficult. Additionally, integrating NLP systems with other AI modalities and developing user-
friendly interfaces can be a challenge.
The NLP pipeline is a series of stages that enable computers to process, analyze, and understand human
language.
Data Collection
Gather relevant text data from various sources (e.g., web scraping, APIs, databases). In Natural Language
Processing (NLP), various types of data are used to train and evaluate models such as text, speech, multimodal
and time series.
Text Cleaning
Text Analysis
o Lexical Analysis: Examines the meaning of individual words and phrases, including their syntax,
semantics, and morphology.
o Syntax analysis: Examines the grammatical structure of text, including part-of-speech tagging, named
entity recognition, and dependency parsing.
o Semantic analysis: Focuses on the meaning of text, including sentiment analysis, topic modeling, and
entity disambiguation.
o Pragmatic analysis: Considers the context and purpose of text, including discourse analysis, dialogue
analysis, and intent detection.
Feature Extraction
o Bag-of-Words: Represent text as a bag of words, ignoring word order.
o Term Frequency-Inverse Document Frequency (TF-IDF): Weight word importance based on frequency
and rarity.
o Word Embeddings: Represent words as vectors in a high-dimensional space, capturing semantic
relationships.
o Choose an NLP Task: Determine the specific NLP task (e.g., sentiment analysis, text classification).
o Select a Model: Choose a suitable NLP model (e.g., Naive Bayes, Support Vector Machines, Recurrent
Neural Networks).
o Train the Model: Train the model using the labeled training data.
o Tune Hyperparameters: Optimize model hyperparameters for better performance.
o Evaluate Model Performance: Assess the model's performance on the testing data using metrics like
accuracy, precision, recall.
o Refine the Model: Refine the model based on evaluation results and iterate until satisfactory
performance.
o Deploy the Model: Deploy the trained model in a production-ready environment, such as a web
application or API.
o Monitor Model Performance: Continuously monitor the model's performance on new, unseen data.
o Update the Model: Update the model periodically to adapt to changing language patterns, new data, or
emerging trends.
o Re-train the Model: Re-train the model as necessary to maintain its performance and accuracy.
3 Data Collection
In Natural Language Processing (NLP), various types of data are used to train and evaluate models. Here are
some of the most common types of data:
Text Data: This is the most common type of data used in NLP. Text data can be sourced from various places,
such as:
- Books and articles
- Social media platforms (e.g., Twitter, Facebook)
- Online forums and discussions
- Product reviews and feedback
Speech Data: This type of data is used for speech recognition, speech synthesis, and other speech-related tasks.
Speech data can be sourced from:
- Web scraping
- Social media platforms (e.g., Twitter, Facebook)
- Online forums and discussions
These are just a few examples of the types of data used in NLP. The choice of data depends on the specific task,
model, and application.
3.1 Corpus
A corpus is a significant collection of texts written in everyday language that computers can read. When you
have more than one, they’re called ‘corpora.’ Corpora are the backbone of NLP systems. People make them
from things like digital text, audio transcripts, and even scanned documents. Corpora are really important for
studying and understanding how language is used in real life, just like people talk and write every day.
A corpus is an essential tool for Natural Language Processing (NLP), serving as a fundamental resource. A
corpus is a significant, organized collection of text or audio data that often includes a wide range of documents,
texts, or voices in one or more specific languages.
Monolingual Corpora
Contain text data in a single language, such as English, Spanish, or Chinese. It's useful in studying language
patterns, structures, and usage within that particular language.
Multilingual Corpora
Contain text data in multiple languages, often used for machine translation, cross-lingual information retrieval,
and language modeling.
Parallel Corpora
A parallel corpus is a collection of texts in two or more languages, where each text in one language is translated
into the other languages.
Characteristics:
Comparable Corpora
A comparable corpus is a collection of texts in two or more languages, where each text is not a direct translation
of the other, but shares similar characteristics, such as topic, genre, or style.
Characteristics:
Examples: Wikipedia articles on the same topic in different languages, news articles from different countries
on the same topic
Specialized Corpora
o Medical Corpora: contain medical texts, used for medical language processing and information
retrieval.
o Financial Corpora: contain financial texts, used for financial sentiment analysis and forecasting.
o Literary Corpora: contain literary texts, used for literary analysis and stylometry.
Spoken Corpora
Contain spoken language data, such as transcripts of conversations, speeches, or interviews, used for speech
recognition, spoken language understanding, and dialogue systems.
Multimodal Corpora
Contain multiple forms of data, such as text, images, audio, and video, used for multimodal processing,
sentiment analysis, and multimedia information retrieval.
Time-Series Corpora
o Historical Corpora: These corpora, which include writings from many historical periods, allow scholars
to look at the evolution of language and historical patterns.
o Temporal Corpora: They preserve texts over time, which makes them valuable for observing linguistic
evolution and researching the current state of the language.
Annotated Corpora
o Linguistically Annotated Corpora: They are included in the list of comments. These corpora contain
linguistic annotations such as part-of-speech tags, grammatical parses, and named entity annotations that
are done by hand. They are necessary for developing and testing NLP models.
o Sentiment-Annotated Corpora: These corpora’s texts have sentiment or emotion information labeled,
which makes sentiment analysis and emotion detection tasks easier.
Web Corpora
is built by crawling and indexing web pages, used for web search, information retrieval, and language modeling.
These categories are not mutually exclusive, and many corpora can be classified under multiple categories.
Large Corpus Size: In general, a corpus size should be as large as possible. Large-scale specialized datasets
are essential for the training of algorithms that carry out sentiment analysis.
High-Quality Data: When it comes to the data in a corpus, high quality is essential. Even the smallest
inaccuracies in the training data might result in significant faults in the output of the machine learning system.
Clean Data: Building and maintaining a high-quality corpus depends on clean data. To produce a more reliable
corpus for NLP, data purification is essential, as it locates and eliminates any errors or duplicate data.
Diversity: Diverse categories, records, languages, and themes are all part of the wide range of linguistic
diversity that corpora attempt to represent. Because of this variability, NLP models and algorithms are capable
of handling a wide range of linguistic variants.
Annotation: Language-specific annotations, such as part-of-speech tags, grammatical parses, named entities,
sentiment labels, or semantic annotations, are included in many corpora. These annotations help supervise
machine learning and particular NLP tasks.
Metadata: Header information about the texts, such as author names, publication dates, source details, and
document names, is often present in corpora. To provide context and origin, metadata is essential.
3.1.3 Challenges Regarding Creating a Corpus
Creating a corpus, a large collection of text or speech data used for linguistic analysis, comes with several
challenges. Here are some of the key ones:
Data Collection
One challenge in creating a corpus is collecting a representative and diverse set of data. The data must
accurately encompass the target domain and be large enough to support thorough analysis, which may require
overcoming copyright limitations, negotiating agreements, and addressing privacy concerns.
Data Annotation
Annotating data in a corpus can be labor-intensive and time-consuming, especially when it comes to labeling
large amounts of data. The quality of annotations may also vary depending on human factors or agreement on
annotation guidelines.
Maintaining Consistency
Maintaining consistency in the structure and labeling of a corpus is vital for accurate analysis. Developing
standard guidelines and establishing best practices for data organization helps, but ensuring consistency across
all parts of the corpus can be challenging.
Language-specific features and domain-specific jargon can present difficulties in creating a corpus.
Understanding the unique characteristics of the language or domain is crucial in constructing a representative
and useful corpus.
Additionally, creating corpora for less-studied languages may involve addressing fewer resources and limited
existing research.
To remain relevant and useful over time, a corpus may need regular updates and expansion. Keeping up with
evolving language use or domain-specific developments may pose a challenge. This requires ongoing data
collection, annotation, and quality control to ensure the corpus stays updated and accurate.
4 Text Cleaning
Punctuation removal is a text preprocessing step where you remove all punctuation marks (such as periods,
commas, exclamation marks, emojis etc.) from the text to simplify it and focus on the words themselves.
Lowercasing is a text preprocessing step where all letters in the text are converted to lowercase. This step is
implemented so that the algorithm does not treat the same words differently in different situations.
Removing digits
It is important to remove all numerical digits from the text dataset. This is because, in most cases, numerical
values do not provide any significant meaning to the text analysis process.
Moreover, they can interfere with natural language processing algorithms, which are designed to understand and
process text-based information.
Stop-word Removal
Stopwords are words that don’t contribute to the meaning of a sentence such as:
So they can be removed without causing any change in the meaning of the sentence. Removing these can help
focus on the important words.
Removal of URLs
When building a model, URLs are typically not relevant and can be removed from the text data.
5 Text Analysis
Text analysis, also known as text mining, is the process of extracting insights, patterns, and meaningful
information from text data. It involves using various NLP techniques and algorithms to analyze and interpret the
content, structure, and context of text.
Lexical analysis, also known as scanning, is the process of breaking down text into its constituent words or
tokens. This is the first step in the compilation process of programming languages and a fundamental step in
Natural Language Processing (NLP).
Types of Tokenization:
Example:
📄 Input Text: "Natural Language Processing is amazing!"
📌 Real-World Uses:
Search engines break queries into words to match them with relevant documents.
Chatbots process user messages word by word to understand intent.
Stopwords are commonly used words (e.g., is, the, in, and) that do not carry much meaning. Removing them
helps in improving NLP model performance by reducing noise in the text.
Example:
📄 Input: "The quick brown fox jumps over the lazy dog."
After Stopword Removal: ["quick", "brown", "fox", "jumps", "lazy", "dog"]
📌 Real-World Uses:
Stemming is the process of removing prefixes and suffixes to get the root word (stem). It helps reduce word
variations and improves NLP model generalization.
Example:
📄 Input Words: ["playing", "played", "plays"]
📌 Real-World Uses:
Search engines use stemming to return relevant results (e.g., a search for "running" also shows results for
"run").
Chatbots group similar words to recognize different user inputs.
Lemmatization is an advanced version of stemming where words are converted into their dictionary form
(lemma) while maintaining correct meaning. It relies on linguistic rules rather than just chopping off endings.
Example:
📄 Input Words: ["running", "better", "flies"]
📌 Real-World Uses:
Text normalization involves correcting spelling mistakes and converting text to a standard format (e.g.,
handling abbreviations, special characters, and informal text).
Example:
📄 Input: "Ths is an exmpl of txt normlztion."
📌 Real-World Uses:
Auto-correct in smartphones.
Grammar-checking tools like Grammarly.
Example:
📝 Sentence: "The quick brown fox jumps over the lazy dog."
POS tags help parsers build sentence structures for syntax analysis.
Example: "He eats quickly."
o "eats" → Verb
o "quickly" → Adverb (modifies verb)
Syntax refers to the rules and structures that govern the way words are combined to form phrases, clauses, and
sentences.
Syntax analysis, also known as parsing, is the process of analyzing the grammatical structure of text, including
the relationships between words, phrases, and sentences. It involves breaking down text into its constituent
parts, such as words, phrases, and clauses, and identifying their syntactic roles.
Grammar checking ensures that sentences follow correct syntactic rules and flags errors. It detects incorrect
verb tense, word order, and subject-verb agreement mistakes.
🔹 Example:
📄 Incorrect Sentence: "She go to school yesterday."
✅ Corrected Sentence: "She went to school yesterday."
📌 Real-World Uses:
Used in writing assistants like Grammarly to detect incorrect grammar.
Helps in automatic essay scoring systems to evaluate sentence structure.
Syntactic ambiguity arises when a sentence can be interpreted in multiple ways due to its structure. This occurs
when the arrangement of words allows for more than one grammatical parsing, leading to different meanings
Example:
One effective technique for resolving syntactic ambiguity in Natural Language Processing (NLP) is
Probabilistic Parsing. This approach utilizes statistical models to determine the most likely syntactic
structure of a sentence based on probabilities derived from large annotated corpora.
Example:
Consider the sentence: "I saw the man with the telescope."
Possible Interpretations:
1. I used a telescope to see the man.
2. I saw a man who had a telescope.
Probabilistic Parsing Application:
o The parser evaluates the likelihood of each interpretation based on the frequency of similar
structures in the training corpus.
o If the structure corresponding to the first interpretation ("using a telescope to see") is more
common in the corpus, the parser will assign it a higher probability and select it as the preferred
interpretation.
Another effective technique for resolving syntactic ambiguity in Natural Language Processing (NLP) is
Contextual Disambiguation. This approach involves analyzing the context surrounding an ambiguous
sentence to determine the most appropriate syntactic structure.
1. Contextual Analysis:
o Examine the surrounding sentences or paragraphs to gather additional information that can
clarify the ambiguous structure.
2. Semantic Constraints:
o Utilize knowledge about the meanings of words and their typical usage to rule out implausible
interpretations.
3. World Knowledge:
o Apply general knowledge about the world to assess the plausibility of different interpretations.
Example: Consider the sentence: "The boy saw the man with the telescope."
Ambiguity:
o Did the boy use a telescope to see the man?
o Or did the boy see a man who had a telescope?
Contextual Disambiguation Application:
o If the preceding sentences discuss the boy's interest in stargazing, it is more likely that "with the
telescope" describes how the boy saw the man (i.e., using a telescope).
o Conversely, if the context talks about a man known for carrying a telescope, the phrase likely
describes the man.
Parsing is concerned with the rules governing the structure of sentences, including word order, phrase structure,
and grammatical relationships. Grammar is a fundamental aspect of syntactic analysis. The goal is to determine
how words are organized and related to each other, ensuring that the sentence adheres to the rules of formal
grammar.
Grammar
Grammar in formal languages consists of a finite set of rules (productions) that specify how valid strings
(sentences) are formed in a language. These rules are defined in terms of terminals, non-terminals, a start
symbol, and production rules.
Example:
S → aSa
S → bSb
S→ε
o N = {S}
o Σ = {a, b}
o P = S → aSa, S → bSb, S → ε
o S=S
1. Top-Down Parsing:
Begins with the highest-level rule and works down to the input tokens.
Attempts to match the input sentence with the start symbol of the grammar and recursively applies
production rules to generate the sentence.
If a parsing path fails (i.e., the current rule does not match the input), the parser backtracks and tries
alternative rules (backtracking)
Example:
S → NP VP S
NP → Det N
VP → V PP / \
PP → P NP
Det → "the" NP VP
N → "cat" | "mat"
V → "sat" / | / |
P → "on"
Det N V PP
Top-Down Parsing Steps:
1. Start with S. | | | / \
3. Expand NP to Det N. | / \
o Match Det to "the" and N to "cat".
on Det N
4. Expand VP to V PP.
o Match V to "sat".
| |
5. Expand PP to P NP.
o Match P to "on". the mat
6. Expand NP to Det N.
o Match Det to "the" and N to "mat".
2. Bottom-Up Parsing:
Begins with individual words (tokens) and gradually builds the parse tree upward by applying grammar rules.
Types of Bottom-Up Parsing:
(Shift-Reduce Parsing)
Uses two main operations:
• Shift: Reads a word from input.
• Reduce: Combines words into phrases according to grammar rules.
o Shift 'The': S
o Reduce 'The' to Det:
/ \
o Shift 'cat':
o Reduce 'cat' to N: NP VP
o Reduce 'Det N' to NP:
o Shift 'sat': / | / |
o Reduce 'sat' to V:
Det N V PP
o Shift 'on':
o Reduce 'on' to P: | | | / \
o Shift 'the':
o Reduce 'the' to Det: The cat sat P NP
o Shift 'mat':
| / \
o Reduce 'mat' to N:
o Reduce 'Det N' to NP: on Det N
o Reduce 'P NP' to PP:
o Reduce 'V PP' to VP: | |
o Reduce 'NP VP' to S: the mat
3. Hybrid Parsing:
a. Dependency Parsing
b. Constituency parsing
c. Early Parsing
Dependency Parsing
Dependency parsing is a technique in natural language processing (NLP) that focuses on analyzing the
grammatical structure of a sentence by identifying relationships between words. In this framework, each word is
connected to another word, forming a hierarchical structure that reveals how words depend on each other to
convey meaning.
Key Concepts:
Head: In a dependency relationship, the head is the central word that governs the relationship.
For instance, in the sentence "The dog sat on the mat", “sat” is the head because it is the main verb that governs
the sentence's structure.
For instance, in the sentence "The dog sat on the mat", “The,” “dog,” “on,” “the,” and “mat” are dependents, as
they depend on the head word “sat” to complete their syntactic relationships.
Dependency tags/labels: Thy represent the nature of the dependency between the head and
the dependent.
Dependency Tree: Once the parsing algorithm is applied, it starts building the dependency tree
incrementally. The process begins with a root node, typically representing the sentence’s main verb.
As the algorithm progresses, it adds directed edges to connect words in the sentence, establishing
grammatical relationships. This tree illustrates the syntactic structure of the sentence, highlighting how
words are related. You can understand the concept through the following sentence: "The dog sat on the
mat".
o Nodes: These represent individual words in the sentence. For example, each word in "The dog
sat" would be a node.
o Edges: These are directed links between words, showing which word governs another. For
instance, an edge from “sat” to “dog” indicates that “sat” is the head of the subject “dog.”
o Root: The root is the topmost word in the tree, often representing the main verb or the core of
the sentence. In simple sentences, the root is usually the main verb, like “sat” in the example
sentence.
A dependency tree dependency tree is a directed graph that satisfies the following constraints:
Constituency parsing
Constituency parsing is a fundamental task in natural language processing (NLP) that involves analyzing the
syntactic structure of a sentence according to a phrase structure grammar. The goal is to break down a sentence
into its constituent parts (phrases or sub-phrases) and organize them into a hierarchical structure, typically
represented as a tree. This tree is called a constituency parse tree or phrase structure tree.
3. Parse Tree:
o The output of constituency parsing is a tree structure where:
The root node represents the entire sentence (S).
Example:
S
/\
NP VP
/|\ |\
The cat sat on the mat
4. Grammar Rules:
o Constituency parsing uses a set of grammar rules to define valid structures. For example:
S → NP VP (A sentence consists of a noun phrase followed by a verb phrase.)
2. Part-of-Speech Tagging:
o Assign a POS tag to each token (e.g., "The" → Det, "cat" → N).
3. Parsing:
o Apply grammar rules to group tokens into constituents and build the parse tree. This can be done
using algorithms like:
Top-down parsing: Starts with the root (S) and recursively applies rules to break it into
smaller constituents.
Bottom-up parsing: Starts with the words and combines them into larger constituents.
4. Output:
o The final output is a parse tree that represents the syntactic structure of the sentence.
3. Parsing:
o Apply grammar rules to build the tree:
S → NP VP
4. Parse Tree:
S
/\
NP VP
/|\ |\
Det N V PP
| | | /\
The cat sat P NP
| /\
on Det N
| |
the mat
Ambiguity: Sentences can often have multiple valid parse trees (e.g., "I saw the man with the
telescope").
Complexity: Parsing long or complex sentences can be computationally expensive.
Grammar Coverage: Hand-written grammars may not cover all possible sentence structures.
Conclusion
Constituency parsing is a powerful technique for understanding the syntactic structure of sentences. By
breaking down sentences into their constituent parts and organizing them into a hierarchical tree, it provides
valuable insights for various NLP tasks. While it has its challenges, advancements in machine learning and
parsing algorithms continue to improve its accuracy and efficiency.
Earley Parsing
The Earley parser is a dynamic programming algorithm used in natural language processing (NLP) for parsing
sentences according to a context-free grammar (CFG). It is particularly well-suited for handling ambiguous
grammars and can efficiently parse sentences in O(n³) time for the worst case, where n is the length of the input
sentence. The Earley parser is named after its inventor, Jay Earley, who introduced it in 1970.
The Earley parser is widely used in NLP because it can handle left-recursive grammars (unlike the CKY
algorithm) and is flexible enough to work with a wide range of grammars. It is also used in compilers and other
applications where parsing is required.
o Example rules:
S → NP VP
NP → Det N
VP → V NP
2. Chart:
o The Earley parser uses a chart (a data structure) to store partial parse results. The chart is
divided into states, each representing a possible step in the parsing process.
3. State:
o A state in the Earley parser is represented as a tuple: (X → α • β, i, j), where:
X → α β is a production rule from the grammar.
The Earley parser processes the input sentence from left to right, building a chart of possible parses. It consists
of the following steps:
1. Initialization:
o Start with the initial state (S → • α, 0, 0) in the chart, where S is the start symbol of the grammar,
and the dot is at the beginning of the rule.
2. Prediction:
o For every state (X → α • Y β, i, j) in the chart, if Y is a non-terminal, add new states for all
productions of Y (i.e., (Y → • γ, j, j)).
3. Scanning:
o For every state (X → α • a β, i, j) in the chart, if the next input token a matches the terminal after
the dot, advance the dot and add the new state (X → α a • β, i, j+1) to the next chart entry.
4. Completion:
o For every completed state (X → γ •, i, j) in the chart, find all states that were waiting for X (i.e.,
states of the form (Y → α • X β, k, i)) and add the new state (Y → α X • β, k, j).
5. Termination:
o The parse is successful if the final chart entry contains a state of the form (S → α •, 0, n),
where S is the start symbol and n is the length of the input sentence.
Grammar:
S → NP VP
NP → Det N
VP → V NP
Det → "the"
N → "cat" | "mat"
V → "sat"
Step-by-Step Parsing:
1. Initialization:
o Add (S → • NP VP, 0, 0) to the chart.
2. Prediction:
o From (S → • NP VP, 0, 0), predict (NP → • Det N, 0, 0).
3. Scanning:
o Match "the" with Det → "the" and add (Det → "the" •, 0, 1).
4. Completion:
o From (Det → "the" •, 0, 1), complete (NP → Det N •, 0, 2).
5. Prediction:
o From (S → NP • VP, 0, 2), predict (VP → • V NP, 2, 2).
6. Scanning:
o Match "sat" with V → "sat" and add `(V → "sat" •,
Phrase Chunking:
Phrase chunking is widely used in NLP tasks because it strikes a balance between
simplicity and usefulness, providing meaningful syntactic information without the
complexity of full parsing.
Prepositional Phrases (PP): e.g., "on the table," "in the park"
3. IOB Format:
o Chunks are often represented using the Inside-Outside-Beginning
(IOB) format:
B-{CHUNK}: Beginning of a chunk.
o Example
The DT B-NP
cat NN I-NP
sat VBD B-VP
on IN B-PP
the DT B-NP
mat NN I-NP
4. Chunking vs. Parsing:
o Chunking: Identifies non-overlapping phrases without hierarchical structure.
Input Sentence: "The quick brown fox jumps over the lazy dog."
1. Tokenization:
["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]
2. POS Tagging:
The/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN ./.
3. Chunking:
[NP The/DT quick/JJ brown/JJ fox/NN]
[VP jumps/VBZ]
[PP over/IN]
[NP the/DT lazy/JJ dog/NN]
4. IOB Format:
The DT B-NP
quick JJ I-NP
brown JJ I-NP
fox NN I-NP
jumps VBZ B-VP
over IN B-PP
the DT B-NP
lazy JJ I-NP
dog NN I-NP
. . O
Phrase chunking is a fundamental NLP technique that identifies meaningful phrases in text without constructing
a full parse tree. It is widely used in applications like information extraction, question answering, and sentiment
analysis. While rule-based methods are simple and interpretable, machine learning-based approaches offer
higher accuracy and flexibility. With the availability of powerful libraries like NLTK, spaCy, and Hugging
Face, implementing phrase chunking has become accessible for both researchers and practitioners.
Word Sense Disambiguation (WSD) is the process of identifying the correct meaning (sense) of a word in a
given context. Many words have multiple meanings, and choosing the right one is crucial for accurate language
understanding in applications like machine translation, information retrieval, and chatbots.
There are several techniques used to perform WSD, broadly categorized into:
1. Knowledge-Based Methods
o Use lexical resources like WordNet, thesauruses, and ontologies to determine the correct sense of a
word.
o Example: Lesk Algorithm (overlap-based approach).
4. Hybrid Approaches
o Combine knowledge-based and machine learning methods for improved accuracy.
Examples of WSD
Lesk Algorithm: Compares dictionary definitions and looks for overlapping words.
Contextual Word Embeddings (BERT): Uses deep learning to predict the correct sense based on surrounding
words.
Lexical Semantics in NLP focuses on the meaning of individual words and how they contribute
to the overall meaning of a sentence or text. It involves understanding the relationships between
words, their senses, and their usage in different contexts. Lexical semantics is crucial for tasks
like word sense disambiguation, synonym detection, and semantic similarity analysis.
o Example: The word "bank" can mean a financial institution or the side of a
river.
2. Word Senses:
o A single word can have different senses depending on the context.
3. Lexical Relations:
o Words can have specific relationships with other words, such as:
Synonyms: Words with similar meanings (e.g., "happy" and "joyful").
4. Word Embeddings:
o Words are represented as vectors in a high-dimensional space to capture
their meanings and relationships.
o Example: In word2vec or GloVe, the vectors for "king" and "queen" are close
to each other because they share similar contexts.
o Example: In the sentence "I went to the bank to deposit money," the word
"bank" refers to a financial institution, not a riverbank.
3. Semantic Similarity:
o Words: "king" and "queen"
5. Antonym Detection:
o Input: "hot"
Named Entity Recognition (NER) is an NLP task that identifies and classifies entities in text into predefined
categories such as:
Example of NER
Input Sentence:
"Elon Musk, the CEO of Tesla, announced a new AI project in California on January 15, 2025."
NER Output:
Token Entity Type
Tesla ORGANIZATION
AI project MISCELLANEOUS
California LOCATION
Token Entity Type
NER Approaches
Uses predefined lists of names and rules (e.g., "Mr." followed by a capitalized word is likely a person).
Limitations: Cannot handle new or unseen names.
Uses Recurrent Neural Networks (RNN), LSTMs, Transformers (BERT, spaCy, RoBERTa, GPT-3+ etc.).
Can recognize new and complex entity names based on context.
Example: Google's BERT model can capture meaning from context better than traditional methods.
Semantic Role Labeling (SRL), also known as shallow semantic parsing, is the task of determining the
semantic roles of words in a sentence. It helps answer "Who did what to whom, when, where, and how?"
SRL is used to analyze sentence structure beyond syntax by identifying arguments associated with a verb
(predicate) and labeling their roles.
The participants in the event described by the predicate. Common arguments include:
o Agent (A0): Who performs the action.
o Patient/Theme (A1): Who/what is affected by the action.
o Recipient (A2): Who receives the action.
o Instrument (A3): The tool or means used to perform the action.
o Location (AM-LOC): Where the action happens.
o Time (AM-TMP): When the action happens.
Example of SRL
Sentence:
SRL Breakdown:
Phrase Role (Label) Description
Coreference resolution is an NLP task that identifies and links expressions that refer to the same entity in a
text. It helps resolve pronouns, definite noun phrases, and other referring expressions to their
corresponding entities.
3. Bridging Coreference
4. Cataphora Resolution
Possible Resolutions:
Without context, it's ambiguous. Coreference resolution techniques help determine the correct mapping.
6 Feature Extraction
Feature extraction is the process of selecting and transforming raw data into features that are more suitable for
modeling. In NLP, feature extraction involves converting text data into numerical representations that can be
fed into machine learning algorithms.
a) Bag of words
b) TF-IDF(term frequency- Inverse document frequency)
c) N-grams
Bag of Words:
The bag-of-words model is a simple way to represent a document in numerical form before we can feed
it into a machine learning algorithm. For any natural language processing task, we need a way to
accomplish this before any further processing. Machine learning algorithms can’t operate on raw text;
we need to convert the text to some sort of numerical representation. This process is also known as
embedding the text.
How bag of words models works?
Bag of words featurization may, at times, be considered a beginner-level form of text processing, given its
ostensible conceptual simplicity in counting words across a given text set. Bag of words models are more
involved however.
Understanding bag of words featurization demands an, at least beginner, understanding of vector spaces. A
vector space is a multi-dimensional space in which points are plotted. In a bag of words approach, each
individual word becomes a separate dimension (or axis) of the vector space. If a text set has n number of words,
the resulting vector space has n dimensions, one dimension for each unique word in the text set. The model then
plots each separate text document as a point in the vector space. A point’s position along a certain dimension is
determined by the number of times that dimension’s word appears within the point’s document.
For example, assume we have a text set in which the contents of two separate documents are respectively:
Because it is difficult to imagine anything beyond a three-dimensional space, we will limit ourselves to just that.
A vector space for a corpus containing these two documents would have separate dimensions for red, rose,
and violet. A three-dimensional vector space for these words may look like:
Since red, rose, and violet all occur once in Document 1, the vector for that document in this space will be
(1,1,1). In Document 2, red appears twice, rose once, and violet not at all. Thus, the vector point for Document
2 is (2,1,0). Both of these document-points will be mapped in the three-dimensional vector space as:
Note that this figure visualizes text documents as data vectors in a three-dimensional feature space. But bag of
words can also represent words as feature vectors in a data space. A feature vector signifies the value
(occurrence) of a given feature (word) in a specific data point (document). So the feature vectors for red, rose,
and violet in Documents 1 and 2 would look like:
Note that the order of words in the original documents is irrelevant. For a bag of words model, all that matters is
each word’s number of occurrences across the text set.
Let us see an example of how the bag of words technique converts text into vectors
Example(1) without preprocessing:
Sentence 1 Sentence 2
Welcome Learning
to is
Great a
Learning good
, practice
Now
start
learning
Step 1: Go through all the words in the above text and make a list of all of the words in our model vocabulary.
Welcome
To
Great
Learning
,
Now
start
learning
is
a
good
practice
Note that the words ‘Learning’ and ‘ learning’ are not the same here because of the difference in their cases and
hence are repeated. Also, note that a comma ‘ , ’ is also taken in the list.
Because we know the vocabulary has 12 words, we can use a fixed-length document-representation of 12, with
one position in the vector to score each word.
The scoring method we use here is to count the presence of each word and mark 0 for absence. This scoring
method is used more generally.
Word Frequency
Welcome 1
to 1
Great 1
Learning 1
, 1
Now 1
start 1
learning 1
is 0
a 0
good 0
practice 0
Sentence 1 ➝ [ 1,1,1,1,1,1,1,1,0,0,0 ]
Word Frequency
Welcome 0
to 0
Great 0
Learning 1
, 0
Now 0
start 0
learning 0
is 1
a 1
good 1
practice 1
Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]
Sentence1 1 1 1 1 1 1 1 1 0 0 0 0
Sentence2 0 0 0 0 0 0 0 1 1 1 1 1
But is this the best way to perform a bag of words. The above example was not the best example of how to use a
bag of words. The words Learning and learning, although having the same meaning are taken twice. Also, a
comma ’,’ which does not convey any information is also included in the vocabulary.
Let us make some changes and see how we can use ‘bag of words in a more effective way.
Limitations of Bag-of-Words
Although Bag-of-Words is quite efficient and easy to implement, still there are some disadvantages to this
technique which are given below:
1. The model ignores the location information of the word. The location information is a piece of very
important information in the text. For example “today is off” and “Is today off”, have the exact same
vector representation in the BoW model.
2. Bag of word models doesn’t respect the semantics of the word. For example, words ‘soccer’ and
‘football’ are often used in the same context. However, the vectors corresponding to these words are
quite different in the bag of words model. The problem becomes more serious while modeling
sentences. Ex: “Buy used cars” and “Purchase old automobiles” are represented by totally different
vectors in the Bag-of-words model.
3. The range of vocabulary is a big issue faced by the Bag-of-Words model. For example, if the model
comes across a new word it has not seen yet, rather we say a rare, but informative word like
Biblioklept(means one who steals books). The BoW model will probably end up ignoring this word as
this word has not been seen by the model yet.
TF-IDF
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural
language processing and information retrieval. It measures how important a term is within a document relative
to a collection of documents (i.e., relative to a corpus).
Words within a text document are transformed into importance numbers by a text vectorization process. There
are many different text vectorization scoring schemes, with TF-IDF being one of the most common.
As its name implies, TF-IDF vectorizes/scores a word by multiplying the word’s Term Frequency (TF) with the
Inverse Document Frequency (IDF).
Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the
total number of words in the document.
TF=number of times the term appears in the document compared to the total number of terms in the document
Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus that contain the
term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance
values than words common across all documents (e.g., a, the, and).
TF-IDF=TF∗IDF
Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely
in others. In short, commonality within a document measured by TF is balanced by rarity between documents
measured by IDF. The resulting TF-IDF score reflects the importance of a term for a document in the corpus.
TF-IDF is useful in many natural language processing applications. For example, Search Engines use TF-IDF to
rank the relevance of a document for a query. TF-IDF is also employed in text classification, text
summarization, and topic modeling.