0% found this document useful (0 votes)
21 views43 pages

Natural Language Processing

Natural Language Processing (NLP) is a sub-field of AI focused on enabling computers to understand and process human language, comprising two main components: Natural Language Understanding (NLU) and Natural Language Generation (NLG). NLP is essential for breaking language barriers, improving human-computer interaction, automating tasks, and enhancing productivity, with real-world applications in virtual assistants, language translation, sentiment analysis, and more. However, NLP faces challenges such as ambiguity, contextual understanding, linguistic diversity, and the need for large amounts of training data.

Uploaded by

Saba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views43 pages

Natural Language Processing

Natural Language Processing (NLP) is a sub-field of AI focused on enabling computers to understand and process human language, comprising two main components: Natural Language Understanding (NLU) and Natural Language Generation (NLG). NLP is essential for breaking language barriers, improving human-computer interaction, automating tasks, and enhancing productivity, with real-world applications in virtual assistants, language translation, sentiment analysis, and more. However, NLP faces challenges such as ambiguity, contextual understanding, linguistic diversity, and the need for large amounts of training data.

Uploaded by

Saba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Natural Language Processing

1 What is Natural Language Processing (NLP)


Natural Language Processing (NLP) is the sub-field of Computer Science especially Artificial Intelligence (AI)
that is concerned about enabling computers to understand and process human language. Technically, the main
task of NLP would be to program computers for analyzing and processing huge amount of natural language
data. There are the following two components of NLP -

1. Natural Language Understanding (NLU)

Natural Language Understanding (NLU) helps the machine to understand and analyse human language by
extracting the metadata from content such as concepts, entities, keywords, emotion, relations, and semantic
roles.

NLU mainly used in Business applications to understand the customer's problem in both spoken and written
language.

NLU involves the following tasks -

o It is used to map the given input into useful representation.


o It is used to analyze different aspects of the language.
2. Natural Language Generation (NLG)

Natural Language Generation (NLG) acts as a translator that converts the computerized data into natural
language representation. It mainly involves Text planning, Sentence planning, and Text Realization.

1.1 Why do we need NLP?


We need NLP for many reasons:

o Breaking language barriers: NLP helps bridge the gap between languages, enabling people to
communicate across languages and cultures.
o Improving human-computer interaction: NLP enables computers to understand human language,
making it easier for people to interact with technology.
o Automating tasks: NLP enables automation of tasks such as data entry, customer service, and language
translation.
o Enhancing productivity: NLP helps streamline processes, freeing up time for more strategic and creative
work.

These are just a few examples of why NLP is essential in today's world. As technology continues to evolve, the
importance of NLP will only continue to grow.

1.2 Real World Uses of NLP


Virtual Assistants

Virtual assistants like Siri, Alexa, and Google Assistant use NLP to understand voice commands and respond
accordingly. They can perform tasks such as:

o - Answering questions
o - Setting reminders
o - Sending messages
o - Making calls
o - Controlling smart home devices
Language Translation

Language translation software like Google Translate uses NLP to translate text and speech from one language to
another. This enables people to communicate across language barriers.

Sentiment Analysis

Sentiment analysis tools use NLP to analyze customer feedback and sentiment on social media, reviews, and
other online platforms. This helps businesses to:

o Understand customer opinions


o Identify areas for improvement
o Monitor brand reputation
Chatbots

Chatbots use NLP to understand and respond to customer inquiries, providing 24/7 customer support. They can:

o - Answer frequently asked questions


o - Help with transactions
o - Provide product information
o - Route complex issues to human support agents
Speech Recognition

Speech recognition systems use NLP to transcribe spoken words into text, enabling applications like:

o - Voice-to-text messaging
o - Voice-controlled interfaces
o - Transcription services

Information Retrieval

Information retrieval systems use NLP to improve search engine results, enabling users to:

o - Find relevant information quickly


o - Filter out irrelevant results
o - Get personalized search results
Text Summarization

Text summarization tools use NLP to automatically summarize long documents, articles, and other texts into
shorter, more digestible versions.

Social Media Monitoring

Social media monitoring tools use NLP to track social media conversations, identifying trends and potential
issues.

Customer Service Automation

Customer service automation uses NLP to automate customer support tasks, such as:

o - Answering frequently asked questions


o - Routing complex issues to human support agents
o - Providing personalized support
Language Learning

Language learning platforms use NLP to provide personalized language instruction, such as:

o - Grammar correction
o - Vocabulary building
o - Pronunciation practice
Content Generation

Content generation tools use NLP to automatically generate content, such as:

o - News articles
o - Social media posts
o - Product descriptions
Healthcare

NLP is used in healthcare to:

o - Analyze medical texts and reports


o - Identify potential health risks
o - Develop personalized treatment plans
Finance

NLP is used in finance to:

o - Analyze financial texts and reports


o - Identify potential investment opportunities
o - Develop personalized investment plans
These are just a few examples of the many uses of NLP. The field is constantly evolving, and new applications
are being developed all the time.

1.3 Tasks of NLP


Tasks of NLP refer to the specific functions or operations that NLP systems perform on language data, such as:

Text Analysis Tasks

o Tokenization: breaking down text into individual words or tokens. For example, 'The cat sat on the mat'
is tokenized into ['The', 'cat', 'sat', 'on', 'the', 'mat'].
o Stopword Removal: removing common words like "the", "and", etc. that don't add much value.
o Stemming or Lemmatization: reducing words to their base form (e.g., "running" becomes "run").
o Part-of-Speech Tagging: identifying the grammatical category of each word (e.g., noun, verb, adjective).

Entity Recognition Tasks

o Named Entity Recognition (NER): identifying named entities like people, organizations, locations.
o Entity Disambiguation: identifying the correct meaning of entities with multiple possible meanings.

Sentiment and Opinion Analysis Tasks


o Sentiment Analysis: determining the sentiment or emotional tone of text (e.g., positive, negative,
neutral).
o Opinion Mining: extracting opinions or sentiment from text.

Language Translation Tasks

o Machine Translation: translating text from one language to another.


o Language Detection: detecting the language of a piece of text.

Text Classification Tasks

o Text Classification: classifying text into predefined categories (e.g., spam/not spam, positive/negative
sentiment).
o Topic Modeling: identifying the underlying topics in a large corpus of text.

Information Retrieval Tasks


o Information Retrieval: retrieving relevant documents or information from a large corpus of text.
o Question Answering: answering questions based on the content of a text or document.

Dialogue and Conversation Tasks

o Dialogue Management: managing conversations between humans and machines.


o Chatbot Development: developing chatbots that can engage in natural-sounding conversations.

Speech Recognition Tasks

o Speech Recognition: transcribing spoken words into text.


o Speech Synthesis: generating spoken words from text.

1.4 Challenges in NLP


Ambiguity and Uncertainty

One of the biggest challenges in NLP is ambiguity and uncertainty. Natural language is inherently ambiguous,
with words and phrases often having multiple meanings. For example, the word "bank" can refer to a financial
institution or the side of a river. This ambiguity makes it difficult for NLP systems to accurately understand the
meaning of text. Additionally, uncertainty in language can arise from incomplete or noisy data, making it
challenging for NLP systems to make accurate predictions.

Contextual Understanding

Another challenge in NLP is contextual understanding. Natural language is often dependent on the context in
which it is used. For example

“The bat flew across the field." (In one context, this sentence might refer to a flying mammal, while in another
context, it might refer to a sports equipment used in cricket or baseball.)

"The professor is going to kill us tomorrow." (In one context, this sentence might be a literal statement about
violence, while in another context, it might be an idiomatic expression meaning "the professor is going to give
us a very difficult exam tomorrow.")
These examples illustrate how the meaning of a sentence can change depending on the context in which it is
used. NLP systems need to be able to understand the context in order to accurately interpret the meaning of
text.. NLP systems struggle to understand the nuances of language and the context in which it is used, leading to
misinterpretations and inaccuracies.

Sarcasm and Idioms

Sarcasm and idioms are another challenge in NLP. Sarcasm, in particular, can be difficult to detect, as it often
involves saying the opposite of what you mean. Idioms, on the other hand, are phrases or expressions that have
a different meaning than the literal meaning of the individual words. For example, "kick the bucket" means to
die, not to physically kick a bucket. NLP systems struggle to understand sarcasm and idioms, leading to
misinterpretations and inaccuracies.

Language Evolution

Language is constantly evolving, with new words, phrases, and grammar emerging all the time. This makes it
challenging for NLP systems to keep up with the latest language trends and nuances. Additionally, language
evolution can lead to changes in the meaning of words and phrases over time, making it difficult for NLP
systems to accurately understand the meaning of text.

Linguistic Diversity

The world has over 7,000 languages, each with its unique grammar, syntax, and vocabulary. This linguistic
diversity poses a significant challenge for NLP systems, as they often struggle to handle languages with non-
Latin scripts, tonal languages, or languages with complex grammar systems. Furthermore, many languages lack
sufficient resources, such as labeled training data, which makes it difficult to develop accurate NLP models. As
a result, NLP systems often perpetuate linguistic biases, favoring languages with more resources and attention.

Limited Training Data

Many NLP systems rely on large amounts of training data to learn patterns and relationships in language.
However, obtaining large amounts of high-quality training data can be challenging, particularly for low-
resource languages or specialized domains. Limited training data can lead to biased or inaccurate NLP models.

Explainability and Transparency

NLP systems can be complex and difficult to interpret, making it challenging to understand why they make
certain predictions or decisions. Ensuring that NLP systems are explainable and transparent is essential for
building trust and accountability.

Handling Multimodal Input

NLP systems often focus on text-based input, but many applications involve multimodal input, such as speech,
images, or video. Handling multimodal input requires integrating multiple AI modalities and developing new
architectures and algorithms.

Real-World Applications

Finally, deploying NLP systems in real-world applications can be challenging. NLP systems often require
significant computational resources, and ensuring that they are scalable, efficient, and reliable in real-world
settings can be difficult. Additionally, integrating NLP systems with other AI modalities and developing user-
friendly interfaces can be a challenge.

2 NLP Pipeline: Key Stages and Components

The NLP pipeline is a series of stages that enable computers to process, analyze, and understand human
language.

Data Collection

Gather relevant text data from various sources (e.g., web scraping, APIs, databases). In Natural Language
Processing (NLP), various types of data are used to train and evaluate models such as text, speech, multimodal
and time series.

Text Cleaning

Clean and preprocess the data by removing noise such as:

o Removing Punctuation & Special Characters


o Changing text into lowercase
o Removing digits
o Stop-word Removal
o Removal of URLs

Text Analysis

o Lexical Analysis: Examines the meaning of individual words and phrases, including their syntax,
semantics, and morphology.
o Syntax analysis: Examines the grammatical structure of text, including part-of-speech tagging, named
entity recognition, and dependency parsing.
o Semantic analysis: Focuses on the meaning of text, including sentiment analysis, topic modeling, and
entity disambiguation.
o Pragmatic analysis: Considers the context and purpose of text, including discourse analysis, dialogue
analysis, and intent detection.

Feature Extraction
o Bag-of-Words: Represent text as a bag of words, ignoring word order.
o Term Frequency-Inverse Document Frequency (TF-IDF): Weight word importance based on frequency
and rarity.
o Word Embeddings: Represent words as vectors in a high-dimensional space, capturing semantic
relationships.

Model Selection and Training

o Choose an NLP Task: Determine the specific NLP task (e.g., sentiment analysis, text classification).
o Select a Model: Choose a suitable NLP model (e.g., Naive Bayes, Support Vector Machines, Recurrent
Neural Networks).
o Train the Model: Train the model using the labeled training data.
o Tune Hyperparameters: Optimize model hyperparameters for better performance.

Model Evaluation and Deployment

o Evaluate Model Performance: Assess the model's performance on the testing data using metrics like
accuracy, precision, recall.
o Refine the Model: Refine the model based on evaluation results and iterate until satisfactory
performance.
o Deploy the Model: Deploy the trained model in a production-ready environment, such as a web
application or API.

Maintenance and Updates

o Monitor Model Performance: Continuously monitor the model's performance on new, unseen data.
o Update the Model: Update the model periodically to adapt to changing language patterns, new data, or
emerging trends.
o Re-train the Model: Re-train the model as necessary to maintain its performance and accuracy.

3 Data Collection
In Natural Language Processing (NLP), various types of data are used to train and evaluate models. Here are
some of the most common types of data:

Text Data: This is the most common type of data used in NLP. Text data can be sourced from various places,
such as:
- Books and articles
- Social media platforms (e.g., Twitter, Facebook)
- Online forums and discussions
- Product reviews and feedback
Speech Data: This type of data is used for speech recognition, speech synthesis, and other speech-related tasks.
Speech data can be sourced from:

- Audio recordings (e.g., podcasts, audiobooks)


- Speech databases (e.g., LibriSpeech, TED-LIUM)
Multimodal Data: This type of data combines text, images, and/or audio to provide a richer understanding of
the context. Multimodal data can be sourced from:

- Social media platforms (e.g., Instagram, TikTok)


- Multimedia databases (e.g., YouTube, Flickr)
Time-Series Data: This type of data is used for tasks like language modeling, sentiment analysis, and topic
modeling, where the order of the data matters. Time-series data can be sourced from:

- Social media platforms (e.g., Twitter, Facebook)


- Online forums and discussions
Labeled Data: This type of data is used for supervised learning tasks, where the data is annotated with labels or
tags. Labeled data can be sourced from:

- Manually annotated datasets (e.g., IMDB, 20 Newsgroups)


- Crowdsourced labeling platforms (e.g., Amazon Mechanical Turk)
Unlabeled Data: This type of data is used for unsupervised learning tasks, where the data is not annotated with
labels or tags. Unlabeled data can be sourced from:

- Web scraping
- Social media platforms (e.g., Twitter, Facebook)
- Online forums and discussions
These are just a few examples of the types of data used in NLP. The choice of data depends on the specific task,
model, and application.

3.1 Corpus
A corpus is a significant collection of texts written in everyday language that computers can read. When you
have more than one, they’re called ‘corpora.’ Corpora are the backbone of NLP systems. People make them
from things like digital text, audio transcripts, and even scanned documents. Corpora are really important for
studying and understanding how language is used in real life, just like people talk and write every day.

Why Do We Need Corpus in NLP?

A corpus is an essential tool for Natural Language Processing (NLP), serving as a fundamental resource. A
corpus is a significant, organized collection of text or audio data that often includes a wide range of documents,
texts, or voices in one or more specific languages.

3.1.1 Types of Corpora in NLP


Corpora, the plural of corpus, refer to a large collection of written or spoken language based on their
characteristics, composition, and purpose. Let's have a look at different types of corpora:

Monolingual Corpora

Contain text data in a single language, such as English, Spanish, or Chinese. It's useful in studying language
patterns, structures, and usage within that particular language.

Multilingual Corpora

Contain text data in multiple languages, often used for machine translation, cross-lingual information retrieval,
and language modeling.
Parallel Corpora

A parallel corpus is a collection of texts in two or more languages, where each text in one language is translated
into the other languages.

Characteristics:

o Same content, different languages


o Sentence-aligned or paragraph-aligned
o Used for machine translation, bilingual lexicon extraction, and contrastive linguistics

Comparable Corpora

A comparable corpus is a collection of texts in two or more languages, where each text is not a direct translation
of the other, but shares similar characteristics, such as topic, genre, or style.

Characteristics:

o 1. Similar content, different languages


o 2. Not sentence-aligned or paragraph-aligned
o 3. Used for cross-lingual information retrieval, topic modeling, and sentiment analysis

Examples: Wikipedia articles on the same topic in different languages, news articles from different countries
on the same topic

Specialized Corpora

Focus on specific domains, genres, or topics, such as:

o Medical Corpora: contain medical texts, used for medical language processing and information
retrieval.
o Financial Corpora: contain financial texts, used for financial sentiment analysis and forecasting.
o Literary Corpora: contain literary texts, used for literary analysis and stylometry.

Spoken Corpora

Contain spoken language data, such as transcripts of conversations, speeches, or interviews, used for speech
recognition, spoken language understanding, and dialogue systems.

Multimodal Corpora

Contain multiple forms of data, such as text, images, audio, and video, used for multimodal processing,
sentiment analysis, and multimedia information retrieval.

Time-Series Corpora

o Historical Corpora: These corpora, which include writings from many historical periods, allow scholars
to look at the evolution of language and historical patterns.
o Temporal Corpora: They preserve texts over time, which makes them valuable for observing linguistic
evolution and researching the current state of the language.
Annotated Corpora

o Linguistically Annotated Corpora: They are included in the list of comments. These corpora contain
linguistic annotations such as part-of-speech tags, grammatical parses, and named entity annotations that
are done by hand. They are necessary for developing and testing NLP models.
o Sentiment-Annotated Corpora: These corpora’s texts have sentiment or emotion information labeled,
which makes sentiment analysis and emotion detection tasks easier.

Web Corpora

is built by crawling and indexing web pages, used for web search, information retrieval, and language modeling.

These categories are not mutually exclusive, and many corpora can be classified under multiple categories.

3.1.2 Features of Corpus in NLP


The features of a corpus in NLP make it super useful for all sorts of language-related tasks and research. Here
are some of the important features of an NLP corpus:

Large Corpus Size: In general, a corpus size should be as large as possible. Large-scale specialized datasets
are essential for the training of algorithms that carry out sentiment analysis.

High-Quality Data: When it comes to the data in a corpus, high quality is essential. Even the smallest
inaccuracies in the training data might result in significant faults in the output of the machine learning system.

Clean Data: Building and maintaining a high-quality corpus depends on clean data. To produce a more reliable
corpus for NLP, data purification is essential, as it locates and eliminates any errors or duplicate data.

Diversity: Diverse categories, records, languages, and themes are all part of the wide range of linguistic
diversity that corpora attempt to represent. Because of this variability, NLP models and algorithms are capable
of handling a wide range of linguistic variants.

Annotation: Language-specific annotations, such as part-of-speech tags, grammatical parses, named entities,
sentiment labels, or semantic annotations, are included in many corpora. These annotations help supervise
machine learning and particular NLP tasks.

Metadata: Header information about the texts, such as author names, publication dates, source details, and
document names, is often present in corpora. To provide context and origin, metadata is essential.
3.1.3 Challenges Regarding Creating a Corpus
Creating a corpus, a large collection of text or speech data used for linguistic analysis, comes with several
challenges. Here are some of the key ones:

Data Collection

One challenge in creating a corpus is collecting a representative and diverse set of data. The data must
accurately encompass the target domain and be large enough to support thorough analysis, which may require
overcoming copyright limitations, negotiating agreements, and addressing privacy concerns.

Data Annotation

Annotating data in a corpus can be labor-intensive and time-consuming, especially when it comes to labeling
large amounts of data. The quality of annotations may also vary depending on human factors or agreement on
annotation guidelines.

Maintaining Consistency

Maintaining consistency in the structure and labeling of a corpus is vital for accurate analysis. Developing
standard guidelines and establishing best practices for data organization helps, but ensuring consistency across
all parts of the corpus can be challenging.

Language and Domain Specificity

Language-specific features and domain-specific jargon can present difficulties in creating a corpus.
Understanding the unique characteristics of the language or domain is crucial in constructing a representative
and useful corpus.

Additionally, creating corpora for less-studied languages may involve addressing fewer resources and limited
existing research.

Corpus Updates and Expansion

To remain relevant and useful over time, a corpus may need regular updates and expansion. Keeping up with
evolving language use or domain-specific developments may pose a challenge. This requires ongoing data
collection, annotation, and quality control to ensure the corpus stays updated and accurate.

4 Text Cleaning

Clean and preprocess the data by removing noise

Removing Punctuation & Special Characters

Punctuation removal is a text preprocessing step where you remove all punctuation marks (such as periods,
commas, exclamation marks, emojis etc.) from the text to simplify it and focus on the words themselves.

Changing text into lowercase

Lowercasing is a text preprocessing step where all letters in the text are converted to lowercase. This step is
implemented so that the algorithm does not treat the same words differently in different situations.
Removing digits

It is important to remove all numerical digits from the text dataset. This is because, in most cases, numerical
values do not provide any significant meaning to the text analysis process.

Moreover, they can interfere with natural language processing algorithms, which are designed to understand and
process text-based information.

Stop-word Removal

Stopwords are words that don’t contribute to the meaning of a sentence such as:

o Articles (the, a, an)


o Prepositions (of, in, on, at)
o Conjunctions (and, but, or)
o Pronouns (I, you, he, she)
o Auxiliary verbs (is, are, am)
o Other common words (that, this, these, those)

So they can be removed without causing any change in the meaning of the sentence. Removing these can help
focus on the important words.

Removal of URLs

When building a model, URLs are typically not relevant and can be removed from the text data.

5 Text Analysis

Text analysis, also known as text mining, is the process of extracting insights, patterns, and meaningful
information from text data. It involves using various NLP techniques and algorithms to analyze and interpret the
content, structure, and context of text.

5.1 Types of Text Analysis

o Lexical Analysis (Word-Level Processing)


o Syntax Analysis (Sentence Structure Processing)
o Semantic Analysis (Meaning Extraction)
o Pragmatic Analysis (Context Understanding)

5.1.1 Lexical Analysis (Word-Level Processing)

Lexical analysis, also known as scanning, is the process of breaking down text into its constituent words or
tokens. This is the first step in the compilation process of programming languages and a fundamental step in
Natural Language Processing (NLP).

5.1.1.1 Key Tasks in Lexical Analysis

o Tokenization (Breaking Text into Words or Sentences)


o Stopword Removal (Filtering Out Common Words)
o Stemming (Reducing Words to Their Root Form)
o Lemmatization (Converting Words to Base Form Using Linguistic Rules)
o Spelling Correction & Normalization
o Parts-of-Speech Tagging

Tokenization (Breaking Text into Words or Sentences)


Tokenization is the process of splitting text into smaller units (words, subwords, or sentences). It is the first step in NLP
text processing because computers cannot process entire paragraphs efficiently without breaking them down.

Types of Tokenization:

o Word Tokenization – Splits text into individual words.


o Sentence Tokenization – Splits text into complete sentences.
o Subword Tokenization – Breaks words into smaller components (useful for languages with complex words).

Example:
📄 Input Text: "Natural Language Processing is amazing!"

 Word Tokenization: ["Natural", "Language", "Processing", "is", "amazing", "!"]


 Sentence Tokenization: ["Natural Language Processing is amazing!"]

📌 Real-World Uses:

 Search engines break queries into words to match them with relevant documents.
 Chatbots process user messages word by word to understand intent.

Stopword Removal (Filtering Out Common Words)

Stopwords are commonly used words (e.g., is, the, in, and) that do not carry much meaning. Removing them
helps in improving NLP model performance by reducing noise in the text.

Example:
📄 Input: "The quick brown fox jumps over the lazy dog."
 After Stopword Removal: ["quick", "brown", "fox", "jumps", "lazy", "dog"]
📌 Real-World Uses:

 Search engines ignore stopwords to improve search efficiency.


 Sentiment analysis models focus on meaningful words to detect emotions.

Stemming (Reducing Words to Their Root Form)

Stemming is the process of removing prefixes and suffixes to get the root word (stem). It helps reduce word
variations and improves NLP model generalization.

Example:
📄 Input Words: ["playing", "played", "plays"]

 After Stemming: ["play", "play", "play"]


Limitations:
Stemming can sometimes produce incorrect root words. For example, "better" → "bet", which is incorrect.

📌 Real-World Uses:

 Search engines use stemming to return relevant results (e.g., a search for "running" also shows results for
"run").
 Chatbots group similar words to recognize different user inputs.

Lemmatization (Converting Words to Base Form Using Linguistic Rules)

Lemmatization is an advanced version of stemming where words are converted into their dictionary form
(lemma) while maintaining correct meaning. It relies on linguistic rules rather than just chopping off endings.

Example:
📄 Input Words: ["running", "better", "flies"]

 After Lemmatization: ["run", "good", "fly"] (Correct form)

📌 Real-World Uses:

 Machine translation ensures grammatically correct output.


 Text summarization systems maintain linguistic accuracy.

Spelling Correction & Normalization

Text normalization involves correcting spelling mistakes and converting text to a standard format (e.g.,
handling abbreviations, special characters, and informal text).

Example:
📄 Input: "Ths is an exmpl of txt normlztion."

 After Normalization: "This is an example of text normalization."

📌 Real-World Uses:

 Auto-correct in smartphones.
 Grammar-checking tools like Grammarly.

Part-of-Speech (POS) Tagging


Part-of-Speech (POS) tagging is the process of assigning a grammatical category (such as noun, verb,
adjective, etc.) to each word in a sentence. It helps machines understand sentence structure and word
meanings in context.

Example:

📝 Sentence: "The quick brown fox jumps over the lazy dog."

💡 POS Tagged Output:

The → DET (Determiner)


quick → ADJ (Adjective)
brown → ADJ (Adjective)
fox → NOUN (Noun)
jumps → VERB (Verb)
over → ADP (Adposition/Preposition)
the → DET (Determiner)
lazy → ADJ (Adjective)
dog → NOUN (Noun)
Why is POS Tagging Important?

✅ 1. Disambiguates Word Meanings

 The same word can have multiple meanings (homonyms).


 Example:
o "She will book a flight." (book → verb)
o "The book is on the table." (book → noun)

✅ 2. Helps in Parsing & Grammar Checking

 POS tags help parsers build sentence structures for syntax analysis.
 Example: "He eats quickly."
o "eats" → Verb
o "quickly" → Adverb (modifies verb)

✅ 3. Essential for Named Entity Recognition (NER)

 Helps identify proper nouns (e.g., "Paris" as a place, "Apple" as a company).

✅ 4. Improves Sentiment Analysis

 Adjectives ("good", "bad", "amazing") indicate sentiment polarity in text.

✅ 5. Enhances Machine Translation

 Ensures accurate word order across different languages.

5.1.2 Syntax Analysis

Syntax refers to the rules and structures that govern the way words are combined to form phrases, clauses, and
sentences.

Syntax analysis, also known as parsing, is the process of analyzing the grammatical structure of text, including
the relationships between words, phrases, and sentences. It involves breaking down text into its constituent
parts, such as words, phrases, and clauses, and identifying their syntactic roles.

Key Tasks in Syntax Analysis


The primary tasks in Syntax Analysis include:

o Grammar Checking & Sentence Validation


o Syntactic Ambiguity Resolution
o Parsing (Analyzing Sentence Structure)
o Dependency Parsing (Understanding Relationships Between Words)
o Phrase Chunking (Identifying Phrases in a Sentence)

Grammar Checking & Sentence Validation

Grammar checking ensures that sentences follow correct syntactic rules and flags errors. It detects incorrect
verb tense, word order, and subject-verb agreement mistakes.

🔹 Example:
📄 Incorrect Sentence: "She go to school yesterday."
✅ Corrected Sentence: "She went to school yesterday."

📌 Real-World Uses:
 Used in writing assistants like Grammarly to detect incorrect grammar.
 Helps in automatic essay scoring systems to evaluate sentence structure.

Syntactic Ambiguity Resolution

Syntactic ambiguity arises when a sentence can be interpreted in multiple ways due to its structure. This occurs
when the arrangement of words allows for more than one grammatical parsing, leading to different meanings

Example:

 "I saw the man with the telescope."


o Interpretation 1: I used a telescope to see the man.
o Interpretation 2: I saw a man who was holding a telescope.

One effective technique for resolving syntactic ambiguity in Natural Language Processing (NLP) is
Probabilistic Parsing. This approach utilizes statistical models to determine the most likely syntactic
structure of a sentence based on probabilities derived from large annotated corpora.

How Probabilistic Parsing Works:


1. Training Phase:
o A large corpus of text is annotated with syntactic structures (parse trees).
o The parser analyzes these structures to calculate the probabilities of various grammatical
constructions and word sequences.
2. Parsing Phase:
o When a new sentence is encountered, the parser generates all possible parse trees.
o It assigns probabilities to each parse tree based on the learned statistical model.
o The parser selects the parse tree with the highest probability as the most likely interpretation.

Example:

Consider the sentence: "I saw the man with the telescope."

 Possible Interpretations:
1. I used a telescope to see the man.
2. I saw a man who had a telescope.
 Probabilistic Parsing Application:

o The parser evaluates the likelihood of each interpretation based on the frequency of similar
structures in the training corpus.
o If the structure corresponding to the first interpretation ("using a telescope to see") is more
common in the corpus, the parser will assign it a higher probability and select it as the preferred
interpretation.

Another effective technique for resolving syntactic ambiguity in Natural Language Processing (NLP) is
Contextual Disambiguation. This approach involves analyzing the context surrounding an ambiguous
sentence to determine the most appropriate syntactic structure.

How Contextual Disambiguation Works:

1. Contextual Analysis:
o Examine the surrounding sentences or paragraphs to gather additional information that can
clarify the ambiguous structure.
2. Semantic Constraints:
o Utilize knowledge about the meanings of words and their typical usage to rule out implausible
interpretations.
3. World Knowledge:
o Apply general knowledge about the world to assess the plausibility of different interpretations.

Example: Consider the sentence: "The boy saw the man with the telescope."

 Ambiguity:
o Did the boy use a telescope to see the man?
o Or did the boy see a man who had a telescope?
 Contextual Disambiguation Application:
o If the preceding sentences discuss the boy's interest in stargazing, it is more likely that "with the
telescope" describes how the boy saw the man (i.e., using a telescope).
o Conversely, if the context talks about a man known for carrying a telescope, the phrase likely
describes the man.

Parsing (Analyzing Sentence Structure)

Parsing is concerned with the rules governing the structure of sentences, including word order, phrase structure,
and grammatical relationships. Grammar is a fundamental aspect of syntactic analysis. The goal is to determine
how words are organized and related to each other, ensuring that the sentence adheres to the rules of formal
grammar.

Grammar
Grammar in formal languages consists of a finite set of rules (productions) that specify how valid strings
(sentences) are formed in a language. These rules are defined in terms of terminals, non-terminals, a start
symbol, and production rules.

A formal grammar is represented as:


G = (N,Σ,P,S) Where:

 N = Non-terminal symbols (variables)


 Σ = Terminal symbols (actual words/symbols in the language)
 P = Production rules (rules for forming valid strings)
 S = Start symbol (starting point of the language)

Example:

S → aSa
S → bSb
S→ε

o N = {S}
o Σ = {a, b}
o P = S → aSa, S → bSb, S → ε
o S=S

Approaches to Parsing Techniques:

1. Top-Down Parsing:

 Begins with the highest-level rule and works down to the input tokens.
 Attempts to match the input sentence with the start symbol of the grammar and recursively applies
production rules to generate the sentence.
 If a parsing path fails (i.e., the current rule does not match the input), the parser backtracks and tries
alternative rules (backtracking)

Types of Top-Down Parsing: Recursive Descent Parsing

Example:

Sentence: "The cat sat on the mat."

Grammar Rules: Parse Tree

S → NP VP S
NP → Det N
VP → V PP / \
PP → P NP
Det → "the" NP VP
N → "cat" | "mat"
V → "sat" / | / |
P → "on"
Det N V PP
Top-Down Parsing Steps:
1. Start with S. | | | / \

2. Expand S to NP VP. The cat sat P NP

3. Expand NP to Det N. | / \
o Match Det to "the" and N to "cat".
on Det N
4. Expand VP to V PP.
o Match V to "sat".
| |
5. Expand PP to P NP.
o Match P to "on". the mat

6. Expand NP to Det N.
o Match Det to "the" and N to "mat".

2. Bottom-Up Parsing:

Begins with individual words (tokens) and gradually builds the parse tree upward by applying grammar rules.
Types of Bottom-Up Parsing:

(Shift-Reduce Parsing)
Uses two main operations:
• Shift: Reads a word from input.
• Reduce: Combines words into phrases according to grammar rules.

Sentence: "The cat sat on the mat."

Bottom-Up Parsing Steps: Parse Tree

o Shift 'The': S
o Reduce 'The' to Det:
/ \
o Shift 'cat':
o Reduce 'cat' to N: NP VP
o Reduce 'Det N' to NP:
o Shift 'sat': / | / |
o Reduce 'sat' to V:
Det N V PP
o Shift 'on':
o Reduce 'on' to P: | | | / \
o Shift 'the':
o Reduce 'the' to Det: The cat sat P NP
o Shift 'mat':
| / \
o Reduce 'mat' to N:
o Reduce 'Det N' to NP: on Det N
o Reduce 'P NP' to PP:
o Reduce 'V PP' to VP: | |
o Reduce 'NP VP' to S: the mat

3. Hybrid Parsing:
a. Dependency Parsing
b. Constituency parsing
c. Early Parsing
Dependency Parsing

Dependency parsing is a technique in natural language processing (NLP) that focuses on analyzing the
grammatical structure of a sentence by identifying relationships between words. In this framework, each word is
connected to another word, forming a hierarchical structure that reveals how words depend on each other to
convey meaning.

Key Concepts:

 Head: In a dependency relationship, the head is the central word that governs the relationship.

For instance, in the sentence "The dog sat on the mat", “sat” is the head because it is the main verb that governs
the sentence's structure.

 Dependent: It is the word that modifies or complements the head.

For instance, in the sentence "The dog sat on the mat", “The,” “dog,” “on,” “the,” and “mat” are dependents, as
they depend on the head word “sat” to complete their syntactic relationships.

 Dependency tags/labels: Thy represent the nature of the dependency between the head and
the dependent.
 Dependency Tree: Once the parsing algorithm is applied, it starts building the dependency tree
incrementally. The process begins with a root node, typically representing the sentence’s main verb.
As the algorithm progresses, it adds directed edges to connect words in the sentence, establishing
grammatical relationships. This tree illustrates the syntactic structure of the sentence, highlighting how
words are related. You can understand the concept through the following sentence: "The dog sat on the
mat".
o Nodes: These represent individual words in the sentence. For example, each word in "The dog
sat" would be a node.
o Edges: These are directed links between words, showing which word governs another. For
instance, an edge from “sat” to “dog” indicates that “sat” is the head of the subject “dog.”
o Root: The root is the topmost word in the tree, often representing the main verb or the core of
the sentence. In simple sentences, the root is usually the main verb, like “sat” in the example
sentence.

A dependency tree dependency tree is a directed graph that satisfies the following constraints:

o There is a single designated root node that has no incoming arcs.


o With the exception of the root node, each vertex has exactly one incoming arc.
o There is a unique path from the root node to each vertex in V.

Constituency parsing

Constituency parsing is a fundamental task in natural language processing (NLP) that involves analyzing the
syntactic structure of a sentence according to a phrase structure grammar. The goal is to break down a sentence
into its constituent parts (phrases or sub-phrases) and organize them into a hierarchical structure, typically
represented as a tree. This tree is called a constituency parse tree or phrase structure tree.

Key Concepts in Constituency Parsing


1. Constituents:
o A constituent is a group of words that function as a single unit within a sentence. For example, in
the sentence "The cat sat on the mat," "The cat" is a noun phrase (NP), and "on the mat" is a
prepositional phrase (PP).

2. Phrase Structure Grammar:


o Constituency parsing relies on phrase structure grammar, which defines how sentences can be
broken down into constituents. These grammars are often represented using context-free
grammars (CFGs), where each rule defines how a higher-level constituent (e.g., a sentence, S)
can be decomposed into smaller constituents (e.g., NP and VP).

3. Parse Tree:
o The output of constituency parsing is a tree structure where:
 The root node represents the entire sentence (S).

 Intermediate nodes represent phrases (e.g., NP, VP, PP).

 Leaf nodes represent individual words or tokens.

Example:

S
/\
NP VP
/|\ |\
The cat sat on the mat
4. Grammar Rules:
o Constituency parsing uses a set of grammar rules to define valid structures. For example:
 S → NP VP (A sentence consists of a noun phrase followed by a verb phrase.)

 NP → Det N (A noun phrase consists of a determiner followed by a noun.)

 VP → V PP (A verb phrase consists of a verb followed by a prepositional phrase.)

 PP → P NP (A prepositional phrase consists of a preposition followed by a noun phrase.)

5. Part-of-Speech (POS) Tags:


o Before parsing, words in a sentence are often tagged with their part of speech (e.g., noun, verb,
adjective). These tags help the parser apply grammar rules correctly.

Steps in Constituency Parsing


1. Tokenization:
o Split the input sentence into individual tokens (words and punctuation).

2. Part-of-Speech Tagging:
o Assign a POS tag to each token (e.g., "The" → Det, "cat" → N).

3. Parsing:
o Apply grammar rules to group tokens into constituents and build the parse tree. This can be done
using algorithms like:
 Top-down parsing: Starts with the root (S) and recursively applies rules to break it into
smaller constituents.
 Bottom-up parsing: Starts with the words and combines them into larger constituents.

 CKY Algorithm (Cocke-Kasami-Younger): A dynamic programming algorithm for


parsing CFGs.

 Earley Parser: Another efficient algorithm for parsing CFGs.

4. Output:
o The final output is a parse tree that represents the syntactic structure of the sentence.

Example of Constituency Parsing

Sentence: "The cat sat on the mat."

1. Tokenization: ["The", "cat", "sat", "on", "the", "mat", "."]

2. POS Tagging: [Det, N, V, P, Det, N, .]

3. Parsing:
o Apply grammar rules to build the tree:
 S → NP VP

 NP → Det N ("The cat")

 VP → V PP ("sat on the mat")

 PP → P NP ("on the mat")

 NP → Det N ("the mat")

4. Parse Tree:

S
/\
NP VP
/|\ |\
Det N V PP
| | | /\
The cat sat P NP
| /\
on Det N
| |
the mat

Challenges in Constituency Parsing

 Ambiguity: Sentences can often have multiple valid parse trees (e.g., "I saw the man with the
telescope").
 Complexity: Parsing long or complex sentences can be computationally expensive.

 Grammar Coverage: Hand-written grammars may not cover all possible sentence structures.

Conclusion

Constituency parsing is a powerful technique for understanding the syntactic structure of sentences. By
breaking down sentences into their constituent parts and organizing them into a hierarchical tree, it provides
valuable insights for various NLP tasks. While it has its challenges, advancements in machine learning and
parsing algorithms continue to improve its accuracy and efficiency.

Earley Parsing

The Earley parser is a dynamic programming algorithm used in natural language processing (NLP) for parsing
sentences according to a context-free grammar (CFG). It is particularly well-suited for handling ambiguous
grammars and can efficiently parse sentences in O(n³) time for the worst case, where n is the length of the input
sentence. The Earley parser is named after its inventor, Jay Earley, who introduced it in 1970.

The Earley parser is widely used in NLP because it can handle left-recursive grammars (unlike the CKY
algorithm) and is flexible enough to work with a wide range of grammars. It is also used in compilers and other
applications where parsing is required.

Key Concepts in the Earley Parser

1. Context-Free Grammar (CFG):


o A CFG consists of a set of production rules that define how symbols (non-terminals) can be
expanded into sequences of other symbols (terminals or non-terminals).

o Example rules:
 S → NP VP

 NP → Det N

 VP → V NP

2. Chart:
o The Earley parser uses a chart (a data structure) to store partial parse results. The chart is
divided into states, each representing a possible step in the parsing process.

3. State:
o A state in the Earley parser is represented as a tuple: (X → α • β, i, j), where:
 X → α β is a production rule from the grammar.

 The dot (•) indicates the current position in the rule.

 i is the starting position of the state in the input sentence.

 j is the current position in the input sentence.


4. Operations:
o The Earley parser uses three main operations to build the chart:
 Predict: Adds new states based on grammar rules.

 Scan: Advances the dot when a terminal matches the input.

 Complete: Combines completed states to form larger constituents.

How the Earley Parser Works

The Earley parser processes the input sentence from left to right, building a chart of possible parses. It consists
of the following steps:

1. Initialization:
o Start with the initial state (S → • α, 0, 0) in the chart, where S is the start symbol of the grammar,
and the dot is at the beginning of the rule.

2. Prediction:
o For every state (X → α • Y β, i, j) in the chart, if Y is a non-terminal, add new states for all
productions of Y (i.e., (Y → • γ, j, j)).

3. Scanning:
o For every state (X → α • a β, i, j) in the chart, if the next input token a matches the terminal after
the dot, advance the dot and add the new state (X → α a • β, i, j+1) to the next chart entry.

4. Completion:
o For every completed state (X → γ •, i, j) in the chart, find all states that were waiting for X (i.e.,
states of the form (Y → α • X β, k, i)) and add the new state (Y → α X • β, k, j).

5. Termination:
o The parse is successful if the final chart entry contains a state of the form (S → α •, 0, n),
where S is the start symbol and n is the length of the input sentence.

Example of Earley Parsing

Grammar:

S → NP VP
NP → Det N
VP → V NP
Det → "the"
N → "cat" | "mat"
V → "sat"

Input Sentence: "the cat sat the mat"

Step-by-Step Parsing:
1. Initialization:
o Add (S → • NP VP, 0, 0) to the chart.

2. Prediction:
o From (S → • NP VP, 0, 0), predict (NP → • Det N, 0, 0).

3. Scanning:
o Match "the" with Det → "the" and add (Det → "the" •, 0, 1).

o Advance the dot in (NP → Det • N, 0, 1).

4. Completion:
o From (Det → "the" •, 0, 1), complete (NP → Det N •, 0, 2).

o From (NP → Det N •, 0, 2), complete (S → NP • VP, 0, 2).

5. Prediction:
o From (S → NP • VP, 0, 2), predict (VP → • V NP, 2, 2).

6. Scanning:
o Match "sat" with V → "sat" and add `(V → "sat" •,

Phrase Chunking:

Phrase chunking, also known as shallow parsing, is a natural language processing


(NLP) technique used to identify and segment text into syntactically correlated phrases,
such as noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). Unlike full
parsing, which constructs a complete syntactic tree, phrase chunking focuses on
identifying non-overlapping phrases without delving into the internal structure of the
phrases or their relationships.

Phrase chunking is widely used in NLP tasks because it strikes a balance between
simplicity and usefulness, providing meaningful syntactic information without the
complexity of full parsing.

5.1.3 Key Concepts in Phrase Chunking


1. Chunks:
o A chunk is a group of words that form a meaningful unit in a sentence.
Common types of chunks include:
 Noun Phrases (NP): e.g., "the cat," "a big house"

 Verb Phrases (VP): e.g., "is running," "has eaten"

 Prepositional Phrases (PP): e.g., "on the table," "in the park"

 Adjective Phrases (ADJP): e.g., "very happy," "extremely tall"

 Adverb Phrases (ADVP): e.g., "very quickly," "quite slowly"


2. Tags:
o Each word in a sentence is tagged with its part of speech (POS), and chunks
are identified based on these tags. For example:
 "The cat sat on the mat" → (The/DT cat/NN) (sat/VBD) (on/IN the/DT mat/NN)

 Here, DT = determiner, NN = noun, VBD = past tense verb, IN =


preposition.

3. IOB Format:
o Chunks are often represented using the Inside-Outside-Beginning
(IOB) format:
 B-{CHUNK}: Beginning of a chunk.

 I-{CHUNK}: Inside a chunk.

 O: Outside any chunk.

o Example

The DT B-NP
cat NN I-NP
sat VBD B-VP
on IN B-PP
the DT B-NP
mat NN I-NP
4. Chunking vs. Parsing:
o Chunking: Identifies non-overlapping phrases without hierarchical structure.

o Parsing: Constructs a complete syntactic tree with hierarchical relationships


between phrases.

5.1.4 Example of Phrase Chunking

Input Sentence: "The quick brown fox jumps over the lazy dog."

1. Tokenization:
["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]

2. POS Tagging:
The/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN ./.
3. Chunking:
[NP The/DT quick/JJ brown/JJ fox/NN]
[VP jumps/VBZ]
[PP over/IN]
[NP the/DT lazy/JJ dog/NN]
4. IOB Format:
The DT B-NP
quick JJ I-NP
brown JJ I-NP
fox NN I-NP
jumps VBZ B-VP
over IN B-PP
the DT B-NP
lazy JJ I-NP
dog NN I-NP
. . O

Phrase chunking is a fundamental NLP technique that identifies meaningful phrases in text without constructing
a full parse tree. It is widely used in applications like information extraction, question answering, and sentiment
analysis. While rule-based methods are simple and interpretable, machine learning-based approaches offer
higher accuracy and flexibility. With the availability of powerful libraries like NLTK, spaCy, and Hugging
Face, implementing phrase chunking has become accessible for both researchers and practitioners.

5.1.5 Semantic Analysis


Semantic analysis in Natural Language Processing (NLP) focuses on understanding the meaning of words, phrases, and
sentences. It consists of several key steps:

1. Word Sense Disambiguation (WSD)


2. Lexical Semantics
3. Named Entity Recognition (NER)
4. Semantic Role Labeling (SRL)
5. Coreference Resolution

5.1.5.1 Word Sense Disambiguation (WSD)

Word Sense Disambiguation (WSD) is the process of identifying the correct meaning (sense) of a word in a
given context. Many words have multiple meanings, and choosing the right one is crucial for accurate language
understanding in applications like machine translation, information retrieval, and chatbots.

Key Approaches to WSD

There are several techniques used to perform WSD, broadly categorized into:

1. Knowledge-Based Methods
o Use lexical resources like WordNet, thesauruses, and ontologies to determine the correct sense of a
word.
o Example: Lesk Algorithm (overlap-based approach).

2. Supervised Machine Learning Methods


o Train models using labeled datasets where each word is annotated with its correct sense.
o Requires a large amount of labeled data.
o Example: Naïve Bayes, Decision Trees, Neural Networks.

3. Unsupervised Methods (Clustering-Based)


o Uses word usage patterns to group words into senses without pre-labeled data.
o Example: Word Embeddings (Word2Vec, BERT).

4. Hybrid Approaches
o Combine knowledge-based and machine learning methods for improved accuracy.
Examples of WSD

Example 1: "Bank" (Financial vs. Riverbank)

 Sentence 1: I deposited my paycheck at the bank.


o "Bank" refers to a financial institution.
 Sentence 2: The boat was anchored near the river bank.
o "Bank" refers to the side of a river.

➡ How WSD Works?

 Knowledge-Based Approach: Finds that “deposit” is more related to financial institutions.


 Supervised Learning Approach: Trained model identifies the sense based on word embeddings.

Example 2: "Bat" (Animal vs. Sports Equipment)

 Sentence 1: The bat flew out of the cave at night.


o "Bat" refers to a mammal (animal).
 Sentence 2: He hit the ball with a bat.
o "Bat" refers to a cricket/baseball bat.

➡ How WSD Works?

 Lesk Algorithm: Compares dictionary definitions and looks for overlapping words.
 Contextual Word Embeddings (BERT): Uses deep learning to predict the correct sense based on surrounding
words.

5.1.5.2 Lexical Semantics

Lexical Semantics in NLP focuses on the meaning of individual words and how they contribute
to the overall meaning of a sentence or text. It involves understanding the relationships between
words, their senses, and their usage in different contexts. Lexical semantics is crucial for tasks
like word sense disambiguation, synonym detection, and semantic similarity analysis.

Key Concepts in Lexical Semantics


1. Word Meaning:
o Words can have multiple meanings (polysemy) or related meanings
(homonymy).

o Example: The word "bank" can mean a financial institution or the side of a
river.

2. Word Senses:
o A single word can have different senses depending on the context.

o Example: The word "mouse" can refer to an animal or a computer device.

3. Lexical Relations:
o Words can have specific relationships with other words, such as:
 Synonyms: Words with similar meanings (e.g., "happy" and "joyful").

 Antonyms: Words with opposite meanings (e.g., "hot" and "cold").

 Hyponyms: Specific instances of a general category (e.g., "rose" is a


hyponym of "flower").

 Hypernyms: General categories for specific words (e.g., "fruit" is a


hypernym of "apple").

 Meronyms: Part-whole relationships (e.g., "wheel" is a meronym of


"car").

 Holonyms: Whole-part relationships (e.g., "car" is a holonym of


"wheel").

4. Word Embeddings:
o Words are represented as vectors in a high-dimensional space to capture
their meanings and relationships.

o Example: In word2vec or GloVe, the vectors for "king" and "queen" are close
to each other because they share similar contexts.

5. Word Sense Disambiguation (WSD):


o Determining the correct sense of a word in a given context.

o Example: In the sentence "I went to the bank to deposit money," the word
"bank" refers to a financial institution, not a riverbank.

Examples of Lexical Semantics in Action


1. Synonym Detection:
o Input: "buy"

o Output: Synonyms like "purchase," "acquire," "obtain."

2. Word Sense Disambiguation:


o Sentence: "The bat flew out of the cave."

o Disambiguation: "bat" refers to the animal, not the sports equipment.

3. Semantic Similarity:
o Words: "king" and "queen"

o Similarity: High, as they are both royalty-related terms.

4. Hyponymy and Hypernymy:


o Sentence: "A rose is a type of flower."
o Analysis: "rose" is a hyponym of "flower," and "flower" is a hypernym of
"rose."

5. Antonym Detection:
o Input: "hot"

o Output: Antonym "cold."

In summary, lexical semantics is a foundational aspect of NLP that enables


machines to understand and process the meanings of words, their relationships,
and their roles in language. It plays a critical role in improving the accuracy and
effectiveness of NLP systems.

5.1.5.3 Named Entity Recognition (NER)

Named Entity Recognition (NER) is an NLP task that identifies and classifies entities in text into predefined
categories such as:

 Person Names (e.g., "Elon Musk")


 Organizations (e.g., "Google", "United Nations")
 Locations (e.g., "New York", "Amazon Rainforest")
 Dates and Time (e.g., "March 21, 2024", "5 PM")
 Monetary Values (e.g., "$1000", "500 Euros")
 Percentages (e.g., "30%")
 Miscellaneous Entities (e.g., product names, works of art)

How NER Works?

NER is typically performed in three steps:

1. Tokenization: Breaking text into words or subwords.


2. Part-of-Speech (POS) Tagging: Identifying the role of each word in the sentence.
3. Entity Recognition & Classification: Assigning words to predefined categories.

Example of NER

Input Sentence:

"Elon Musk, the CEO of Tesla, announced a new AI project in California on January 15, 2025."

NER Output:
Token Entity Type

Elon Musk PERSON

Tesla ORGANIZATION

AI project MISCELLANEOUS

California LOCATION
Token Entity Type

January 15, 2025 DATE

NER Approaches

There are different methods used to perform NER:

1. Rule-Based & Dictionary-Based Methods

 Uses predefined lists of names and rules (e.g., "Mr." followed by a capitalized word is likely a person).
 Limitations: Cannot handle new or unseen names.

2. Machine Learning-Based Methods

 Uses supervised learning with labeled datasets.


 Common algorithms: CRF (Conditional Random Fields), HMM (Hidden Markov Model), Decision Trees.
 Limitations: Requires a large annotated dataset.

3. Deep Learning-Based Methods

 Uses Recurrent Neural Networks (RNN), LSTMs, Transformers (BERT, spaCy, RoBERTa, GPT-3+ etc.).
 Can recognize new and complex entity names based on context.
 Example: Google's BERT model can capture meaning from context better than traditional methods.

5.1.5.4 Semantic Role Labelling

Semantic Role Labeling (SRL), also known as shallow semantic parsing, is the task of determining the
semantic roles of words in a sentence. It helps answer "Who did what to whom, when, where, and how?"

SRL is used to analyze sentence structure beyond syntax by identifying arguments associated with a verb
(predicate) and labeling their roles.

Key Components of SRL


1. Predicate (Verb Identification)

 The main verb (action) in a sentence that defines the event.

2. Arguments (Semantic Roles)

 The participants in the event described by the predicate. Common arguments include:
o Agent (A0): Who performs the action.
o Patient/Theme (A1): Who/what is affected by the action.
o Recipient (A2): Who receives the action.
o Instrument (A3): The tool or means used to perform the action.
o Location (AM-LOC): Where the action happens.
o Time (AM-TMP): When the action happens.
Example of SRL
Sentence:

"John gave Mary a book yesterday at the library."

SRL Breakdown:
Phrase Role (Label) Description

Gave Predicate (Verb) The main action

John A0 (Agent) The giver (who gave)

Mary A2 (Recipient) The receiver of the book

a book A1 (Theme) The object being given

yesterday AM-TMP (Time) When the action happened

at the library AM-LOC (Location) Where the action happened

➡ Answering "Who did what to whom?"

 Who? → John (A0 - Agent)


 Did what? → gave (Predicate)
 What? → a book (A1 - Theme)
 To whom? → Mary (A2 - Recipient)
 When? → yesterday (AM-TMP - Time)
 Where? → at the library (AM-LOC - Location)

5.1.5.5 Coreference Resolution

What is Coreference Resolution?

Coreference resolution is an NLP task that identifies and links expressions that refer to the same entity in a
text. It helps resolve pronouns, definite noun phrases, and other referring expressions to their
corresponding entities.

Why is Coreference Resolution Important?

 Enhances text understanding for AI models.


 Improves machine translation, question answering, and text summarization.
 Helps in document analysis and knowledge graph construction.
Types of Coreference Resolution
1. Pronominal Coreference

 Resolves pronouns to their respective entities.


 Example:
o "John said he would arrive soon."
o Resolution: "he" → "John"

2. Noun Phrase Coreference (Anaphora Resolution)

 Resolves definite noun phrases referring to previously mentioned entities.


 Example:
o "Barack Obama was the 44th president of the US. The president introduced new policies."
o Resolution: "The president" → "Barack Obama"

3. Bridging Coreference

 Links indirect references to entities based on context.


 Example:
o "I bought a book yesterday. The cover was beautiful."
o Resolution: "The cover" → "the book"

4. Cataphora Resolution

 When a pronoun refers to an entity mentioned later in the sentence.


 Example:
o "Before he could enter, John knocked on the door."
o Resolution: "he" → "John"

Example of Coreference Resolution


Input Sentence:

"Mary told Alice that she would visit her tomorrow."

Possible Resolutions:

1. "She" → Mary OR Alice?


2. "Her" → Mary OR Alice?

Without context, it's ambiguous. Coreference resolution techniques help determine the correct mapping.

Resolved Sentence (After Coreference Resolution):

"Mary told Alice that Mary would visit Alice tomorrow."


Approaches to Coreference Resolution
1. Rule-Based Approaches

 Uses grammar rules, heuristics, and linguistic knowledge.


 Example: Hobbs' Algorithm (Syntax-Based Approach).
 Limitation: Does not generalize well for complex cases.

2. Machine Learning-Based Approaches

 Uses supervised learning with labeled datasets (e.g., OntoNotes).


 Features used:
o Syntactic features (POS, dependency parsing).
o Distance between mentions.
o Gender/Number agreement.
 Common Models: Decision Trees, SVMs, CRF.
 Limitation: Requires large amounts of annotated data.

3. Deep Learning-Based Approaches

 Uses Neural Networks and Transformers (e.g., BERT, SpanBERT, CorefQA).


 Advantages:
o Handles long-range dependencies better.
o Generalizes well across domains.
 Popular Models:
o AllenNLP’s SpanBERT-based Coreference Model.
o Hugging Face’s CorefQA Model.

6 Feature Extraction

Feature extraction is the process of selecting and transforming raw data into features that are more suitable for
modeling. In NLP, feature extraction involves converting text data into numerical representations that can be
fed into machine learning algorithms.

Here are the main techniques:

a) Bag of words
b) TF-IDF(term frequency- Inverse document frequency)
c) N-grams

Bag of Words:
The bag-of-words model is a simple way to represent a document in numerical form before we can feed
it into a machine learning algorithm. For any natural language processing task, we need a way to
accomplish this before any further processing. Machine learning algorithms can’t operate on raw text;
we need to convert the text to some sort of numerical representation. This process is also known as
embedding the text.
How bag of words models works?

Bag of words featurization may, at times, be considered a beginner-level form of text processing, given its
ostensible conceptual simplicity in counting words across a given text set. Bag of words models are more
involved however.

Understanding bag of words featurization demands an, at least beginner, understanding of vector spaces. A
vector space is a multi-dimensional space in which points are plotted. In a bag of words approach, each
individual word becomes a separate dimension (or axis) of the vector space. If a text set has n number of words,
the resulting vector space has n dimensions, one dimension for each unique word in the text set. The model then
plots each separate text document as a point in the vector space. A point’s position along a certain dimension is
determined by the number of times that dimension’s word appears within the point’s document.

For example, assume we have a text set in which the contents of two separate documents are respectively:

Document 1: A rose is red, a violet is blue

Document 2: My love is like a red, red rose

Because it is difficult to imagine anything beyond a three-dimensional space, we will limit ourselves to just that.
A vector space for a corpus containing these two documents would have separate dimensions for red, rose,
and violet. A three-dimensional vector space for these words may look like:

Since red, rose, and violet all occur once in Document 1, the vector for that document in this space will be
(1,1,1). In Document 2, red appears twice, rose once, and violet not at all. Thus, the vector point for Document
2 is (2,1,0). Both of these document-points will be mapped in the three-dimensional vector space as:
Note that this figure visualizes text documents as data vectors in a three-dimensional feature space. But bag of
words can also represent words as feature vectors in a data space. A feature vector signifies the value
(occurrence) of a given feature (word) in a specific data point (document). So the feature vectors for red, rose,
and violet in Documents 1 and 2 would look like:

Note that the order of words in the original documents is irrelevant. For a bag of words model, all that matters is
each word’s number of occurrences across the text set.

Understanding Bag of Words with an example

Let us see an example of how the bag of words technique converts text into vectors
Example(1) without preprocessing:

Sentence 1: ”Welcome to Great Learning, Now start learning”

Sentence 2: “Learning is a good practice”

Sentence 1 Sentence 2

Welcome Learning

to is

Great a

Learning good

, practice

Now

start

learning

Step 1: Go through all the words in the above text and make a list of all of the words in our model vocabulary.

 Welcome
 To
 Great
 Learning
 ,
 Now
 start
 learning
 is
 a
 good
 practice
Note that the words ‘Learning’ and ‘ learning’ are not the same here because of the difference in their cases and
hence are repeated. Also, note that a comma ‘ , ’ is also taken in the list.

Because we know the vocabulary has 12 words, we can use a fixed-length document-representation of 12, with
one position in the vector to score each word.

The scoring method we use here is to count the presence of each word and mark 0 for absence. This scoring
method is used more generally.

The scoring of sentence 1 would look as follows:

Word Frequency

Welcome 1

to 1

Great 1

Learning 1

, 1

Now 1

start 1

learning 1

is 0

a 0

good 0
practice 0

Writing the above frequencies in the vector

Sentence 1 ➝ [ 1,1,1,1,1,1,1,1,0,0,0 ]

Now for sentence 2, the scoring would like

Word Frequency

Welcome 0

to 0

Great 0

Learning 1

, 0

Now 0

start 0

learning 0

is 1

a 1

good 1
practice 1

Similarly, writing the above frequencies in the vector form

Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]

Sentence Welcome to Great Learning , Now start learning is a good practice

Sentence1 1 1 1 1 1 1 1 1 0 0 0 0

Sentence2 0 0 0 0 0 0 0 1 1 1 1 1

But is this the best way to perform a bag of words. The above example was not the best example of how to use a
bag of words. The words Learning and learning, although having the same meaning are taken twice. Also, a
comma ’,’ which does not convey any information is also included in the vocabulary.

Let us make some changes and see how we can use ‘bag of words in a more effective way.

Limitations of Bag-of-Words

Although Bag-of-Words is quite efficient and easy to implement, still there are some disadvantages to this
technique which are given below:

1. The model ignores the location information of the word. The location information is a piece of very
important information in the text. For example “today is off” and “Is today off”, have the exact same
vector representation in the BoW model.
2. Bag of word models doesn’t respect the semantics of the word. For example, words ‘soccer’ and
‘football’ are often used in the same context. However, the vectors corresponding to these words are
quite different in the bag of words model. The problem becomes more serious while modeling
sentences. Ex: “Buy used cars” and “Purchase old automobiles” are represented by totally different
vectors in the Bag-of-words model.
3. The range of vocabulary is a big issue faced by the Bag-of-Words model. For example, if the model
comes across a new word it has not seen yet, rather we say a rare, but informative word like
Biblioklept(means one who steals books). The BoW model will probably end up ignoring this word as
this word has not been seen by the model yet.

TF-IDF
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural
language processing and information retrieval. It measures how important a term is within a document relative
to a collection of documents (i.e., relative to a corpus).

Words within a text document are transformed into importance numbers by a text vectorization process. There
are many different text vectorization scoring schemes, with TF-IDF being one of the most common.

As its name implies, TF-IDF vectorizes/scores a word by multiplying the word’s Term Frequency (TF) with the
Inverse Document Frequency (IDF).

Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the
total number of words in the document.
TF=number of times the term appears in the document compared to the total number of terms in the document

Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus that contain the
term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance
values than words common across all documents (e.g., a, the, and).

The TF-IDF of a term is calculated by multiplying TF and IDF scores.

TF-IDF=TF∗IDF

Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely
in others. In short, commonality within a document measured by TF is balanced by rarity between documents
measured by IDF. The resulting TF-IDF score reflects the importance of a term for a document in the corpus.

TF-IDF is useful in many natural language processing applications. For example, Search Engines use TF-IDF to
rank the relevance of a document for a query. TF-IDF is also employed in text classification, text
summarization, and topic modeling.

You might also like