0% found this document useful (0 votes)

5 views73 pages

NLP Notes

The document provides an overview of Natural Language Processing (NLP), detailing its history, purpose, and the challenges it faces in understanding human language. It discusses various NLP tasks, components, and methods, emphasizing the importance of effective communication and the evolution of NLP systems. Additionally, it highlights the significance of NLP for Indian regional languages to promote digital inclusion and accessibility.

Uploaded by

Rishi Kokil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views73 pages

NLP Notes

Uploaded by

Rishi Kokil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Module 1 Introduction to NLP

History and Origin of NLP

Definition and Impact of NLP:

● Natural Language Processing (NLP) is a field of artificial intelligence that enables

communication with intelligent systems using natural human languages like English.
● NLP is integral to everyday life, facilitating tasks at home and work through applications such
as Alexa, Siri, and Google Assistant.
● It has revolutionized how we work, live, and interact by enabling systems to respond to voice
commands, set reminders, and call contacts.

Role of Communication in NLP:

● Communication is crucial for exchanging information between agents and their environments.
● It involves producing and interpreting signs from a shared system of symbols.
● Effective communication allows agents to acquire and utilize information observed or inferred
by others, enhancing their decision-making and success.

Language and the World:

● Language serves as a means to describe and understand the world.

● Studying language enhances our understanding of the world and helps verify theories about it.
● Building computational models of language provides a powerful tool for simulating and
understanding communication.

Purpose of NLP:

● The field of NLP focuses on enabling computers to process and perform tasks using natural
human languages.
● NLP systems work with inputs like spoken language and written text.
● A key subfield, Natural Language Understanding (NLU), is concerned with machine reading
comprehension, interpreting the meaning from language input.

Goals of NLP:

● The main objective is to create software capable of analyzing, understanding, and generating
human-like language.
● The ultimate aim is for users to communicate with computers as naturally as they would with
another human being.

Overview of NLP Task

Definition and Challenge of NLP :

● Natural Language Processing (NLP) enables computer programs to understand and process
human speech in its natural form.
● It is a subset of artificial intelligence focused on interpreting complex and ambiguous human
language, including slang, dialects, and contextual factors.
● Traditional programming languages are structured and precise, whereas human language can
be ambiguous and context-dependent, posing a challenge for NLP development.

Approaches to NLP:

● Modern NLP relies heavily on machine learning, a subset of AI that identifies patterns in data
to improve understanding and performance.
● Machine learning helps in handling the unpredictability of human language by adapting to
diverse linguistic patterns and improving accuracy over time.

Common NLP Tasks:

● Sentence Segmentation: Dividing a text into meaningful units, such as sentences.

● Part-of-Speech Tagging: Identifying grammatical components in sentences.
● Parsing: Analyzing the structure of sentences.
● Deep Analytics: Extracting deeper insights and meanings from text data.
● Named Entity Extraction: Identifying specific entities like names, dates, and organizations.
● Coreference Resolution: Determining when different words refer to the same entity.

Example of NLP's Benefits:

● NLP can accurately interpret complex sentences, understanding abbreviations, context, and
industry-specific terms. For instance:
○ Recognizing that "cloud" refers to "cloud computing."
○ Identifying "SLA" as an acronym for "Service Level Agreement."

Long-term Goal of NLP:

● The ultimate aim is to eliminate the need for traditional programming languages.
● In the future, all computer interactions may rely solely on natural human language, making
communication with computers as intuitive as speaking with another person.

Evolution of NLP Systems

Brief History of NLP:

● 1950s: NLP research began with Machine Translation (MT), focusing on converting text from
one language to another.
● Turing Test: Introduced by Alan Turing in the 1950s to evaluate a machine's ability to mimic
human conversation indistinguishably from a human.
● Linguistics and Cryptography: Early research included syntactic structures and language
translation.
● 1960s: Introduction of ELIZA, a popular NLP system simulating a psychotherapist's responses.
● Over time, NLP evolved from basic syntax analysis to include knowledge augmentation and
semantics, paving the way for machine learning-based approaches.
● Recent advancements involve multiple NLP systems driven by machine learning, with
competitions centered around the Turing Test.
Pragmatic Analysis in NLP:

● Pragmatics involves analyzing context and purpose, especially when resolving ambiguities that
arise at syntactic or semantic levels.
● Pragmatic analysis supports the interpretation of ambiguous phrases by considering the
context of the utterance.

Components of NLP
1. Natural Language Understanding:
○ Involves converting input in natural language to a meaningful internal representation.
○ Requires multiple levels of analysis:
■ Morphological Analysis: Study of word forms.
■ Syntactic Analysis: Structure of sentences.
■ Semantic Analysis: Meaning of sentences.
■ Discourse Analysis: Context of sentences in a conversation.
2. Natural Language Generation:
○ Producing natural language output from an internal representation.
○ Involves
■ Deep Planning: Deciding what to communicate.
■ Syntactic Generation: Structuring sentences.
○ Natural Language Understanding is generally more complex than generation.
3. Planning in NLP
○ Involves breaking down complex problems into manageable subparts.
○ Refers to computing steps for problem-solving before execution.

Major Methods of NLP Analysis:

1. Pattern Matching:
○ Utilizes predefined patterns to interpret input as a whole rather than breaking it down.
○ Hierarchical pattern matching can reduce complexity by matching sub-phrases
gradually.
○ Semantic primitives (core concepts) can be used instead of specific words to simplify
the matching process.
2. Syntactically Driven Parsing:
○ Focuses on combining words into larger syntactic units like phrases or sentences.
○ Uses grammar rules to interpret sentence structure, contrasting with pattern matching
by starting with smaller components and building up.
3. Semantic Grammars:
○ Combines both syntactic and semantic elements for analysis.
○ Categories in a semantic grammar are defined by their meaning, making it more
flexible.
4. Case Frame Instantiation:
○ An advanced technique that uses a recursive structure for interpretation.
○ Combines bottom-up (starting from small units) and top-down (starting from larger
context) approaches for analysis.
Levels and Task of NLP
NLP tasks can be broadly classified into two main categories:

1. Processing Written Text:

○ Utilizes lexical (word-level), syntactic (structure), and semantic (meaning) knowledge.
○ Requires real-world context to interpret and understand language accurately.
2. Processing Spoken Language:
○ Involves additional analysis compared to written text.
○ Requires understanding of phonology (sound patterns) and handling further
ambiguities that arise in spoken language.

Morphological Analysis:

● Focuses on individual words and their internal structure.

● Involves identifying and interpreting morphemes (the smallest grammatical units) such as
suffixes like "ing" or "ed" which alter tense or meaning.

Syntactic Analysis:

● Deals with sentence structure and grammar.

● Checks if a sentence follows grammatical rules. For example, correcting invalid sentence
structures like "Hari is good not to."
● Syntax analysis is concerned with how words can legally form phrases and sentences.

Semantic Analysis:

● Involves understanding the meaning of sentences.

● Aims to ensure the sentence conveys a logical and valid meaning. For example, “The table is on
the ceiling” is syntactically correct but semantically illogical.

Discourse Integration:

● Analyzes the relationship between sentences.

● The meaning of a sentence may depend on previous sentences, making context crucial. This
level handles how the meaning evolves as the discourse progresses.

Pragmatic Analysis:

● Focuses on the intended meaning behind the text or speech.

● Involves understanding what the speaker/writer meant to convey, rather than the literal
meaning. For example, a question like "Do you know how long it will take?" expects a time
frame rather than a simple "yes" or "no."

Prosody:

● Analyzes rhythm, intonation, and stress patterns in speech.

● Crucial in understanding emotional nuances and structured forms like poetry or chants, where
rhythm plays a vital role.
Phonology:

● Studies the sounds of language.

● Involves recognizing and interpreting sound patterns, which is critical for speech recognition
systems.

Stages in NLP

Lexical Analysis and Morphological Analysis:

● Lexical Analysis is the first phase of NLP. It involves scanning the source code or text as a
stream of characters and converting them into meaningful lexemes (basic units of meaning).
● This phase divides the text into paragraphs, sentences, and words.
● Morphological Analysis examines the structure and formation of words, combining sounds
into minimal units of meaning (morphemes).

Syntactic Analysis (Parsing):

● This phase checks the grammar and arrangement of words in a sentence.

● It identifies the relationship between words by forming phrases, clauses, and complete
sentences.
● For instance, if a sentence like “Pune goes to Gopal” does not follow correct structure, the
syntactic analyzer will reject it.
● Syntactic analysis ensures that the sentence is grammatically valid.

Semantic Analysis:

● Concerned with understanding the literal meaning of words, phrases, and sentences,
regardless of context.
● It focuses on what the words actually mean, leading to the creation of a meaningful
representation.
● Ambiguities may arise during this phase, as words can have multiple meanings.
Pragmatic Knowledge:

● This is the final phase of NLP, dealing with intended effects and the inner meaning behind a
sentence.
● Pragmatic analysis is concerned with how sentences are used in different contexts.
● For example, the command "Open the door" can be interpreted as a request rather than an
order.

Discourse Integration:

● This phase deals with the connection between sentences.

● It involves how the meaning of one sentence can affect the interpretation of the following
sentences.
● Discourse integration handles the study of inter-sentential relationships and ensures
coherence across the text.

World Knowledge:

● Involves using non-linguistic information to interpret words and sentences accurately.

● World knowledge helps in recognizing and understanding things, people, and concepts in the
world.
● The more knowledge one has, the better they can interpret context and meaning.

Factual Knowledge:

● This includes basic facts, details, definitions, and terminology.

● It serves as the foundational information necessary for any professional field.
● Examples: Dates, definitions, specific data points, and building blocks of a subject.

Conceptual Knowledge:

● Involves understanding the relationships and principles that organize information

within a domain.
● It's about knowing the “why” behind things and understanding how concepts are
connected.
● Examples: Understanding theories, models, and how various concepts fit together.

Procedural Knowledge:

● Refers to the skills or processes necessary to carry out tasks or activities within a
domain.
● Often called "know-how," it’s about knowing the “how” to do something, including
techniques, methods, and steps.
● Examples: Solving equations, using software tools, or performing a scientific
experiment.

Meta-cognitive Knowledge:

● Involves awareness of one’s own cognitive processes, learning strategies, and

understanding one's strengths and weaknesses.
● It includes planning, monitoring, and evaluating one's approach to learning.
● Examples: Recognizing that certain study strategies work better for you, or
understanding that some concepts are challenging and require more focus.

Phonetic and Phonological Knowledge

Phonetic and Phonological Knowledge are essential concepts in understanding language
development and the structure of spoken words.

Phonetic Knowledge

● This refers to the understanding of sound-symbol relationships and how sounds are
represented in a language.
● As children learn to talk, they develop phonemic awareness, which is recognizing the distinct
sounds (phonemes) in language.
● Phonemes are the smallest units of sound that can differentiate words (e.g., the difference
between the sounds /b/ and /p/ in "bat" and "pat").
● Example: When a child learns that the sounds /k/, /a/, and /t/ together form the word "cat."

Phonological Knowledge

● This involves the broader ability to recognize and manipulate the sound structure of language,
including words, syllables, and rhymes.
● Phonological awareness includes skills like counting syllables, segmenting words, and
recognizing patterns.
● It encompasses phonemic awareness, but also includes understanding how larger sound units
like syllables and rhymes work together in language.
● Example: Counting the number of syllables in "elephant" or segmenting the sentence "The cat
sleeps" into individual words.

Difference Between Phonological and Phonemic Awareness:

● Phonological Awareness: Ability to recognize that words are made of different sounds, which
includes tasks like syllable counting, rhyming, and breaking down sentences into words.
● Phonemic Awareness: Focuses specifically on understanding and manipulating phonemes,
like identifying the number of sounds in a word.

Examples:

● Phonological Knowledge: Counting syllables in a name, recognizing rhyming words,

segmenting sentences into words, or identifying syllables.
● Phonemic Knowledge: Counting the distinct sounds in a word, such as the three sounds in
"dog" (/d/, /o/, /g/).
Ambiguities in NLP
Ambiguity in natural language is a fundamental challenge in NLP, as the same word or sentence can
have multiple interpretations depending on the context.

Lexical Ambiguity

● Definition: Ambiguity that arises from a single word having multiple meanings.
● Example: The word "silver" can be interpreted as:
○ A noun (a metal or color)
○ An adjective (describing color)
○ A verb (to coat with silver)

Syntactic Ambiguity

● Definition: Ambiguity that occurs when a sentence can be parsed in different ways due to
word arrangement.
● Example: "The man saw the girl with the telescope."
○ Did the man use a telescope to see the girl?
○ Or, was the girl holding the telescope?

Semantic Ambiguity

● Definition: Ambiguity that arises when the meaning of words or phrases is unclear, leading to
multiple interpretations.
● Example: "The car hit the pole while it was moving."
○ Interpretation 1: The car, while moving, hit the pole.
○ Interpretation 2: The pole was moving when the car hit it.

Anaphoric Ambiguity

● Definition: Ambiguity caused by the use of pronouns or other referring expressions that are
unclear.
● Example: "The horse ran up the hill. It was very steep. It soon got tired."
○ Does "it" refer to the hill (steep) or the horse (tired)?

Pragmatic Ambiguity

● Definition: Ambiguity that arises from the context of a phrase, leading to multiple
interpretations based on social or conversational context.
● Example: "I like you too."
○ Interpretation 1: "I like you just as much as you like me."
○ Interpretation 2: "I like you, just like I like someone else."

NLP for Indian Regional Languages

Natural Language Processing (NLP) for Indian regional languages is an essential area of focus,
especially considering the linguistic diversity and the need for inclusivity in digital access across India.
Importance of Local Language Support:

● A significant portion of India's population, especially in rural areas, is literate in local languages
rather than English.
● Enhancing NLP for Indian languages can help bridge the digital divide and ensure wider
accessibility.

Digital Inclusion:

● The goal of a truly inclusive Digital India hinges on providing language support beyond English.
● The language barrier remains a challenge for smartphone usage, which is critical for accessing
information and digital services.

Applications in Agriculture:

● Farmers, who form a substantial part of India's economy, often lack English proficiency,
making it challenging to access modern agricultural knowledge.
● A voice-based application similar to Google Assistant but designed for Indian farmers could
significantly enhance their ability to access relevant information in their native language.

Challenges and Opportunities:

● Effective NLP for Indian languages is crucial for initiatives like precision agriculture, farmer
helplines, and knowledge sharing.
● Understanding farmer issues, including sensitive topics like farmer suicides, also requires
nuanced language processing capabilities.

Assistance for People with Disabilities:

● NLP can play a crucial role in enabling interpretation of sign languages and facilitating
communication through text-to-speech and speech-to-text technologies.
● This makes information more accessible to individuals with hearing or speech impairments.

Digitization of Indian Manuscripts:

● Preserving ancient knowledge contained in Indian manuscripts through digitization is

essential.
● NLP can help in the accurate transcription and translation of these texts, making them
accessible to a broader audience.

Translation of Signboards:

● Translating signboards from local languages to English and other widely spoken languages can
make travel and navigation easier for non-native speakers and tourists.
● This helps create a more inclusive environment for both domestic and international travelers.

Fonts for Indian Scripts:

● Developing high-quality fonts for Indian scripts can significantly enhance the readability and
visual impact of advertisements, signboards, presentations, and reports.
● This ensures that written communication in local languages is clear and effective.

Ideal Scenario for NLP Resources:

● For optimal results, there is a need for high-quality corpora and tools for Indian languages that
match the resources available for English.
● This includes comprehensive datasets, linguistic tools, and robust language models to support
diverse NLP applications.

Challenges to NLP
Language Differences:

● Supporting multiple languages is complex due to differences in vocabulary, phrasing,

inflections, and cultural nuances.
● Each language requires tailored training for the NLP system to handle its specific
characteristics effectively.

Training Data Quality:

● The performance of an NLP system depends heavily on the quality of training data.
● Poor-quality or biased data can lead to inaccurate or skewed results, impacting the system's
overall understanding of language.

Development Time:

● Building and training an NLP system is time-consuming.

● Leveraging distributed deep learning and advanced hardware like multiple GPUs can reduce
training time significantly, but it still requires substantial investment.

Phrasing Ambiguities:

● Natural language often contains ambiguous phrasing that even humans struggle to interpret.
● NLP systems must be adept at understanding context and should be capable of seeking
clarification if needed.

Handling Misspellings:

● Misspelled words are challenging for machines to detect accurately.

● An effective NLP system must recognize and correct common misspellings without impacting
the meaning.

Innate Biases:

● NLP systems can inherit biases from the programmers and the datasets used.
● Eliminating biases to ensure fairness and reliability across diverse user groups is a significant
challenge.
Words with Multiple Meanings (Polysemy):

● Many words have multiple meanings depending on the context, making interpretation
complex.
● Contextual understanding is crucial for accurately deciphering the intended meaning.

Phrases with Multiple Intentions:

● Some user inputs have more than one intention, requiring the NLP to handle each aspect
without oversimplification.
● For example, distinguishing between canceling an order and updating payment details in a
single query is essential.

False Positives and Uncertainty:

● False positives occur when an NLP system misinterprets an input.

● The system must recognize its limitations and seek clarification from the user to resolve
uncertainties.

Maintaining Dialogue Flow:

● Keeping a conversation going in human-machine interaction is challenging.

● The NLP system needs to be responsive, engaging, and capable of following context over
multiple interactions to ensure coherent dialogues.

Applications of NLP
Translation:

● Translating languages involves more than just word-for-word replacement; it requires

understanding grammar and context. NLP helps computers break down sentences and
reassemble them in another language, while maintaining meaning and style.
● Example: Google Translate transitioned from phrase-based machine translation to Google
Neural Machine Translation (GNMT), using NLP and ML to find patterns between languages.

Speech Recognition:

● Speech recognition enables machines to understand spoken language and convert it into text.
It allows for hands-free interaction, such as voice commands.
● Example: Google Now, Siri, and Alexa recognize speech commands like "call Ravi" and
respond accordingly.

Sentiment Analysis:

● NLP is used to analyze emotions in text data (like social media posts or reviews). It can classify
opinions as positive, negative, or neutral, helping companies understand public sentiment
about their products or services.
● Sentiment analysis is particularly important in fields like the stock market, where public
sentiment can impact stock prices.
Chatbots:

● Chatbots are AI-powered tools designed to interact with users and answer queries
automatically. They can range from basic customer support systems to more advanced ones
capable of handling complex requests.
● In healthcare, chatbots can assess symptoms, schedule appointments, and recommend
treatments.

Question-Answer Systems:

● These systems use NLP to answer user queries by understanding context and providing
accurate responses.
● IBM’s Watson famously competed on the quiz show Jeopardy!, showcasing advanced NLP
and AI capabilities by answering complex questions accurately.

Automatic Text Summarization:

● This application condenses large amounts of text into shorter summaries while retaining the
key information.
● It is useful for generating news headlines, search results snippets, and summarizing long
reports.

Market Intelligence:

● NLP helps businesses analyze unstructured data to gain insights into market trends, consumer
behavior, and competitor activities.
● Market intelligence tools can track sentiment, keywords, and intent in data, aiding in strategic
decision-making.

Automatic Text Classification:

● This involves categorizing text based on its content, helping with tasks like organizing
information or filtering spam emails.
● NLP applications are used to classify spam vs non-spam emails or to tag content for
searchability.

Automatic Grammar Checking:

● NLP tools can automatically detect and correct spelling and grammar errors in text, improving
writing quality.
● Example: Tools like Grammarly use NLP to highlight errors and suggest improvements.

Spam Detection:

● NLP and machine learning models are used to detect unwanted emails and classify them as
spam or not.
● This is crucial for managing email inboxes efficiently and preventing malicious content from
reaching users.

Information Extraction:
● This involves extracting structured data from unstructured documents. NLP helps convert
large amounts of unstructured text into a usable format for analysis.
● Example: Extracting data from financial reports or legal documents to facilitate quick
decision-making.

Natural Language Understanding (NLU):

● NLU converts human language into formal representations (e.g., logical structures) that are
easier for computers to process and manipulate.
● This allows machines to better understand complex language constructs and make decisions
based on them.

Advantages of NLP

1. Automation of Routine Tasks:

○ NLP automates tasks like customer support (through chatbots) and email filtering,
saving time and resources. It allows businesses to handle large volumes of customer
interactions efficiently without requiring human involvement at every step.
2. Enhanced User Interaction:
○ NLP allows machines to understand and interact with humans in a more natural,
conversational manner. This leads to improved user experience in applications such as
voice assistants (Siri, Alexa) and customer service systems.
3. Data Processing and Analysis:
○ NLP can process vast amounts of unstructured text data, like social media posts or
customer reviews, and extract valuable insights such as sentiment, trends, and key
information. This helps businesses make informed decisions based on real-time data.
4. Language Translation:
○ NLP-based translation tools, like Google Translate, enable users to communicate across
language barriers, making information accessible to a global audience and facilitating
international business and diplomacy.
5. Accessibility:
○ NLP technologies, such as speech recognition and text-to-speech, significantly improve
accessibility for people with disabilities. Voice-based commands and real-time
translation can empower people with hearing or visual impairments.

Disadvantages of NLP

1. Complexity in Understanding Context:

○ NLP systems often struggle with understanding context, sarcasm, idioms, or ambiguous
phrases. This can lead to misinterpretations, especially in languages with rich
expressions and nuances, reducing the accuracy of results.
2. Data Dependency:
○ NLP systems require vast amounts of data to train effectively. If the training data is
biased or incomplete, the system's performance can be compromised, leading to errors
or biased outcomes.
3. High Computational Resources:
○ NLP models, particularly deep learning-based systems, require significant
computational resources and time for training. This can be expensive and may not be
feasible for smaller organizations with limited access to high-performance hardware.
4. Language Limitations:
○ NLP tools often work better with widely spoken languages (like English) but may
perform poorly with less commonly spoken languages. Many languages may have
limited resources for training models, hindering the effectiveness of NLP tools across
diverse linguistic groups.
5. Privacy Concerns:
○ NLP systems often process sensitive user data, such as personal conversations or
medical information. This raises concerns about data privacy, as improper handling of
such data could lead to breaches or misuse, undermining user trust.
Module 2 : World Level Analysis

Tokenization
Tokenization is a foundational task in Natural Language Processing (NLP). It involves splitting a piece
of text into smaller units called tokens, which can be words, characters, or subwords. Tokenization
types include:

● Word Tokenization: Splits text by words (e.g., "Never give up" → "Never", "give", "up").
● Character Tokenization: Breaks text into individual characters (e.g., "smarter" → "s", "m", "a",
"r", "t", "e", "r").
● Subword Tokenization: Splits words into meaningful parts (e.g., "smarter" → "smart", "er").

Reasons for Tokenization

● Tokens are essential for processing text in NLP models like Transformers, RNNs, GRUs, and
LSTMs.
● Tokenization is used to process sensitive data, allowing for security in credit card processing,
e-commerce, and more by replacing sensitive info with tokens.

Tokens in Security

Token: A non-sensitive representation of sensitive information created via:

● Reversible cryptographic functions with a key.

● Non-reversible functions (e.g., hashing).
● Index functions or random numbers.

Token Vault: Stores sensitive information securely. Some tokens, however, use a vault-less method
by storing data algorithmically.

Token
Tokenization substitutes sensitive information with equivalent non sensitive information. The
nonsensitive, replacement information is called a token.

Word Tokenization
Word Tokenization uses delimiters to split text into words and underpins Word2Vec and GloVe
embeddings. Issues include:

● Out of Vocabulary (OOV): Words not in the training data vocabulary are unrecognized.
○ Solution: Replace rare words with an unknown token (UNK) to manage OOV.
● Vocabulary Size: Large corpora create extensive vocabularies, making memory management
challenging.
Character Tokenization
Character Tokenization represents text as characters, reducing OOV issues and limiting vocabulary
size (e.g., English has 26 letters). Drawbacks:

● Lengthy Sequences: Increases input and output sentence lengths, complicating learning.

Subword Tokenization
Subword Tokenization breaks down words using linguistic rules, capturing affixes that alter meanings
(e.g., "machinating" → "machinat", "ing"). Benefits:

● Manages OOV words by segmenting unknown words and retaining meaning through affixes.

Importance of Tokenization
Tokenization converts unstructured data into numerical vectors for machine learning. Tokenization is
the first step in any NLP pipeline. It has an important effect on the rest of your pipeline. A tokenizer
breaks unstructured data and natural language text into chunks of information that can be
considered as discrete elements. The token occurrences in a document can be used directly as a
vector representing that document.

This immediately turns an unstructured string (text document) into a numerical data structure
suitable for machine learning. They can also be used directly by a computer to trigger useful actions
and responses. Or they might be used in a machine learning pipeline as features that trigger more
complex decisions or behavior.

Tokenization can separate sentences, words, characters, or subwords. When we split the text into
sentences, we call it sentence tokenization.

It enables:

● Text segmentation for processing (e.g., word, sentence, or subword tokenization).

● Simplified ML pipelines by transforming text into structured formats.

Benefits of Tokenization
● Tokenization makes it more difficult for hackers to gain access to cardholder data. In older
systems, credit card numbers were stored in databases and exchanged freely over networks.
● It is more compatible with legacy systems than encryption.
● It is a less resource-intensive process than encryption.
● The risk of the fallout in a data breach is reduced.
● The payment industry is made more convenient by allowing new technologies like

Tokenization Challenges in NLP

Tokenization can be complex, especially for:

● Languages without clear word boundaries (e.g., Chinese, Japanese).

● Symbols like currency signs that affect word meanings.
● Contractions requiring correct segmentation to preserve meaning

Subword Tokenization
Sub-word tokenization is a more granular approach to breaking down text than standard word
tokenization. It involves breaking individual words into smaller units, often using linguistic rules like
affixes (prefixes, suffixes, and infixes). This allows the model to understand how parts of words
function, which is especially useful for handling out-of-vocabulary (OOV) words.

Key Concepts:

1. Affixes: Affixes are parts of words that modify their meaning. They include:
○ Prefixes (e.g., "un-" in "undo"),
○ Suffixes (e.g., "-ing" in "running"),
○ Infixes (less common, inserted within words).
2. Breaking Words into Sub-words: In sub-word tokenization, words are split into smaller
meaningful units. For example, the sentence "What is the tallest building?" might be tokenized
into:
○ 'what', 'is', 'the', 'tall', 'est', 'build', 'ing'.
3. Handling Out-of-Vocabulary (OOV) Words:
○ If a word is not in the model's vocabulary (OOV), it is still tokenized into smaller
subunits.
○ For example, the word "machinating" might be broken down into the unknown token
'machin' and the suffix 'ing'. While 'machin' might not be recognized, 'ing' can
provide valuable information.
4. Inferences from Suffixes:
○ Suffixes like -ing can indicate:
■ Present participle (e.g., "running" from "run"),
■ Noun form (e.g., "building" from "build").
○ The NLP model can infer that "machinating" might function as a verb in its present
participle form, which aids in understanding the world's role in a sentence.

Benefits of Sub-word Tokenization:

● Improved Handling of OOV Words: Instead of replacing an unknown word with a generic
token, sub-word tokenization breaks it down into known sub-units, which can help in making
sense of the word’s meaning and function.
● Better Generalization: The model learns the patterns in how prefixes, suffixes, and other
subunits contribute to meaning, improving its ability to generalize to unseen words.

Stemming
Stemming is a technique in Natural Language Processing (NLP) that reduces inflected words to their
root forms. It simplifies the words by removing their inflections (e.g., tense, gender, or mood) to make
them uniform and easier to process.
Key Points:

1. Inflection:
○ Inflection involves modifying a word to express different grammatical categories, such
as tense or gender.
○ For example, the word “connect” can have various forms like “connections”,
“connected”, and “connects”.
2. Stemming Process:
○ Stemming involves reducing words to their base or root form.
○ For instance, “connections”, “connected”, and “connects” all stem to "connect".
○ In some cases, the result might not be a valid word in itself, such as “troubl” from
"trouble", "troubled", and "troubles", which is not a recognized word but serves as the
stem.
3. Purpose of Stemming:
○ Stemming helps in normalizing text, reducing redundancy, and preventing models
from overfitting due to variations of the same word.
○ It simplifies the words into their basic form, reducing the complexity for NLP models,
especially when analyzing large datasets.
4. Importance:
○ Data Reduction: Stemming reduces the number of unique terms in a dataset by
consolidating different forms of a word into one.
○ Improved Performance: By reducing words to their root form, stemming helps to
avoid redundancy and improves the efficiency of text processing, making NLP models
more effective.
○ Normalization: It ensures that different forms of the same word are treated as the
same, which improves model generalization and understanding of the data.

Challenges in Stemming
Stemming, while useful, has two primary challenges:

1. Overstemming:
○ Occurs when a word is truncated too much, leading to a nonsensical stem.
○ Example: "universal", "university", and "universe" are all reduced to "univers", which can
create confusion, as these words have distinct meanings in modern contexts. This can
negatively affect search results or understanding in NLP applications.
2. Understemming:
○ Occurs when related words are not reduced to the same stem due to linguistic
variations or complexity.
○ Example: "alumnus", "alumni", "alumna", and "alumnae" are all forms of the same word
in Latin, but they are not treated as equivalents in the stemming process, leading to
inconsistent results in NLP tasks.
Text Stemming
Stemming is a process in Natural Language Processing (NLP) where inflected or derived words are
reduced to their base or root form. This helps in treating different forms of a word as the same, thus
simplifying analysis and improving the effectiveness of NLP models. The process of stemming involves
removing prefixes and suffixes added to words, leading to their root form.

Key Concepts in Text Stemming:

1. Root Form: The basic version of a word, from which other forms or variations are derived.
○ Example: The root of "walking," "walks," and "walked" is "walk."
2. Suffixes and Prefixes: These are added to words to change their meaning or grammatical
form.
○ Example: "Consult" can become "consultant," "consulting," "consultative," and
"consultants," but the stem remains "consult."
3. Stemming Algorithm: NLP algorithms called stemmers are used to remove suffixes and
prefixes from words, reducing them to their root form.
○ For example, a stemming algorithm would take words like "walking," "walked," and
"walks" and convert them to "walk."

Example:

Words derived from the base word "Consult":

● CONSULT → CONSULTANT → CONSULTING → CONSULTANTATIVE → CONSULTANTS →

CONSULTING

In this example, the stemming algorithm identifies and reduces all the different forms of "Consult" to
their base form, "consult," despite the addition of different suffixes and prefixes.

Common Types of Errors in Text Stemming

Stemming can introduce errors due to the complexity and variability of language. Two main types of
errors are associated with stemming:

1. Overstemming:
○ Definition: This error occurs when the stemming algorithm removes too much of a
word, resulting in words with different meanings being reduced to the same stem.
○ Problem: The algorithm mistakenly groups words that have different meanings under
the same root, even though they should not be considered equivalent in context.
○ Example: Consider the words "university," "universities," "universal," and "universe." If
a stemmer reduces all these words to the stem "univers," it’s an example of
overstemming. While "universe" and "universal" are closely related, "university" and
"universities" have different meanings and should be stemmed differently.
○ Overstemming can lead to nonsensical results and affect the quality of information
retrieval or text analysis.
2. Understemming:
○ Definition: This error happens when the stemming algorithm fails to reduce a set of
related words to the same stem, treating them as separate words instead.
○ Problem: It occurs when the algorithm does not perform aggressive stemming, leaving
related words as different stems, thus failing to group them effectively.
○ Example: The words "alumnus," "alumni," "alumna," and "alumnae" are all related but
may not be reduced to a common stem, causing them to be treated as distinct entities.

Lemmatization

Lemmatization is the process of reducing words to their base or root form, known as a lemma, by
grouping together inflected forms of a word that share the same meaning. Unlike stemming, which
simply removes prefixes and suffixes, lemmatization involves a more comprehensive approach by
taking the context into account and converting words to their dictionary form.

Key Points about Lemmatization:

● Context Awareness: Unlike stemming, lemmatization considers the context of a word to

ensure that related words with similar meanings are grouped together.
● Dictionary Form: Lemmatization involves returning a word to its base form or lemma (e.g.,
"running" becomes "run").
● Morphological Analysis: It uses vocabulary and morphological analysis to determine the
correct base form, handling more complex variations of words.

Example:

● "leaf" → "leaves"
● "studying" → "study"
● "ran" → "run"

The term "leafs" would be lemmatized to "leaf" and "studying" to "study," helping in understanding
the intended meaning rather than just reducing the word form.

2.3.1 Uses of Lemmatization

Lemmatization has a wide range of applications across various fields:

1. Biomedicine: Lemmatization helps in processing biomedical literature, improving the

efficiency of data retrieval tasks. By grouping together related terms, it enhances the
searchability of relevant information.
2. Search Engines: Lemmatization plays a vital role in improving search engine accuracy by
matching different forms of a word to a common lemma.
3. Compact Indexing: It's used for creating more efficient indexes, making data storage and
retrieval more streamlined.
4. Chatbots and AI: Lemmatization helps chatbots understand user queries better by
recognizing the contextual meaning of words. It aids in the understanding of sentences,
improving the effectiveness of communication between humans and machines.
5. Natural Language Processing (NLP): It enhances the capabilities of NLP applications by
ensuring that different forms of a word are treated as the same word, leading to better
analysis and interpretation.

Example:
● NLTK provides a WordNet Lemmatizer, which uses a Morphy() function from the WordNet
corpus to find the lemma of words.

Importance of Lemmatization

1. Vital for NLU and NLP: Lemmatization plays a key role in Natural Language Understanding
(NLU) and Natural Language Processing (NLP), where accurately processing and
interpreting words is crucial.
2. Artificial Intelligence & Big Data: It's significant in both AI and big data analysis as it helps to
normalize words, improving data processing efficiency.
3. Accuracy: Lemmatization is more accurate than stemming as it ensures that words are
reduced to meaningful forms, making it more suitable for understanding user input in
applications like chatbots.
4. Slower than Stemming: While lemmatization provides higher accuracy, it is computationally
more expensive and slower than stemming due to its reliance on vocabulary and
morphological analysis.

Advantages:

1. More Accurate: Lemmatization is more accurate than stemming because it reduces words to
their root form based on context, ensuring that words with the same meaning are grouped
together, even if their inflections differ.
2. Uses Dictionary Forms: Unlike stemming, which just cuts off prefixes or suffixes,
lemmatization retrieves the root word from a dictionary, ensuring the result has meaning. For
example, "running" becomes "run," which is a valid dictionary word.
3. Better Context Recognition: Lemmatization is particularly beneficial for chatbots, as it
considers the exact and contextual meaning of words, improving the understanding of user
input and generating more accurate responses.

Disadvantages:

1. Time-Consuming and Slow: Lemmatization can be slower than stemming due to the need for
morphological analysis and vocabulary lookup, making it less efficient in real-time
applications.
2. Slower Algorithms: Since lemmatization requires a deeper analysis (e.g., checking the word in
a dictionary or corpus), the algorithms tend to be slower compared to stemming algorithms,
which simply trim the words.

ENGLISH MORPHOLOGY

Morphology is the study of the internal structure of words, focusing on how the components within a
word (such as stems, prefixes, and suffixes) are arranged or modified to convey different meanings. In
English, morphology plays a crucial role in modifying words to express various grammatical aspects
like tense, number, or class.
Key Points about English Morphology:

1. Morphemes: The smallest units of meaning in a language. For instance, in the word cats, "cat"
is the root morpheme, and "s" is a morpheme indicating plurality.
2. Affixes: English morphology frequently involves adding affixes (prefixes, suffixes) to root
words to form new words or alter their meaning. Examples include:
○ Plurality: Adding "s" or "es" to a noun to indicate plurality (e.g., cat → cats).
○ Past Tense: Adding "ed" to a verb to indicate past tense (e.g., walk → walked).
○ Adjective to Adverb: Adding "ly" to an adjective to form an adverb (e.g., happy →
happily).
3. Morphological Analysis in NLP: In Natural Language Processing (NLP), morphological analysis
helps computers understand the internal structure of words and their roles in sentences. This
understanding is essential for tasks like part-of-speech tagging and syntactic parsing.
4. Morphology in English vs. Other Languages: English is considered a "moderate" morphology
language compared to languages like Latin or Russian, which have complex inflection systems.
English relies more on word order than inflections to convey grammatical relationships (e.g.,
subject-object-verb order).

2.4.1 Survey of English Morphology

1. Morphemes: Words are constructed from smaller meaning-bearing units known as

morphemes. A single word can have one or more morphemes. For example, cats have two
morphemes: "cat" (the root) and "s" (plural marker).
2. Morphological Parsing: Parsing refers to breaking down a word into its constituent
morphemes. For instance, the word foxes can be broken into the root "fox" and the plural
morpheme "es." This parsing process helps in handling irregular plurals like geese or mice,
which do not follow the standard pluralization rules.
3. Inflectional vs. Derivational Morphology:
○ Inflectional Morphology: This involves adding grammatical morphemes to a word,
changing its form without altering its part of speech. Examples include:
■ cat → cats (plural),
■ walk → walking (progressive tense).
○ Derivational Morphology: This changes the word class or meaning of a word. For
example:
■ appoint → appointee (changes from verb to noun),
■ clue → clueless (changes adjective meaning).

Kinds of Morphology
Morphology in linguistics is divided into two main categories: Inflectional Morphology and
Derivational Morphology. These categories help in understanding how words change in form and
meaning.

Inflectional Morphology
● Definition: Inflectional morphology involves changes to a word to express grammatical
features, such as tense, number, case, gender, or person, but it does not change the core
meaning or the part of speech of the word.
● Characteristics:
○ Regular: Inflectional morphemes apply to most or all words within a category. For
example, all countable nouns have a singular and plural form, and all verbs can be
conjugated to indicate different tenses.
○ Productivity: Inflectional rules are productive, meaning they can be applied to new
words that fit the category. For example:
■ Count nouns: dog → dogs (plural).
■ Verbs: talk → talked (past tense), running → run (present participle).
● Conveys Grammatical Information: Inflectional morphology provides crucial grammatical
details like number, tense, person, gender, and case. For example:
○ Number: "cat" (singular) → "cats" (plural).
● Meaning and Category Do Not Change: Unlike derivational morphology, inflection does not
change the basic meaning of the word or its part of speech.
○ For instance, the noun "cat" remains a noun even when it is inflected to "cats" (plural).
● Inflection of Root Word: The root word (or stem) can be inflected to form different
grammatical variations, but it stays within the same word class. For example:
○ Nouns: "dog" → "dogs" (plural), "fox" → "foxes" (irregular plural).
● Creation of Different Forms: Inflection produces different forms of the same word, keeping
the word's meaning intact but altering its grammatical properties. For example:
○ "work" (present) → "works" (third-person singular present).
● Examples:
○ Nouns: "cat" → "cats" (plural), "child" → "children" (irregular plural).
○ Verbs: "walk" → "walks" (third-person singular present), "talked" (past tense).

2. Derivational Morphology

● Definition: Derivational morphology changes a word’s form and often alters its part of speech
(form class). It can create new words or change the meaning of existing ones by adding
prefixes or suffixes.
● Characteristics:
○ Changes Part of Speech: Derivational morphemes often change the grammatical
category of a word, such as turning a noun into a verb, an adjective into a noun, etc.
○ Not Always Regular: Derivational morphology is not always as productive as
inflectional morphology. It can be irregular or less commonly applied, especially in
specific contexts or more specialized vocabulary.
○ Useful in Specialized Domains: Derivational morphemes are especially useful for
creating abstract nouns, forming technical terms, or developing scientific registers.
● Creating New Words: Derivation involves combining affixes with root words to form new
words. These new words can then act as roots for further derivations.
○ Example: Adding the suffix "-ness" to the adjective "happy" forms the noun "happiness."
● Derived from Root Words: In derivational morphology, new words are directly derived from
existing root words. The meaning of the derived word can differ significantly from the original
word.
○ For example, "perform" (verb) can be derived into "performance" (noun).
● Complexity in English Derivation: English derivation is complex due to several reasons:
○ Less Productive: Some affixes can only be applied to specific types of words. Not all
verbs or nouns can accept any given derivational affix.
○ Example: The verb "summarize" can combine with the suffix "-ation" to form
"summarization," but not all verbs can take the "-ation" suffix.
● Complex Meaning Differences : Some derivational suffixes can create words with
significantly different meanings, even when derived from the same root.
○ "Conformation" and "conformity" are both derived from the root word "conform," but
they have different meanings:
■ Conformation refers to the shape or structure of something.
■ Conformity refers to the act of adhering to rules, standards, or laws.
● Examples:
○ Noun to Adjective:
■ photograph (noun) → photographic (adjective).
○ Verb to Noun:
■ clear (adjective) + -ance → clearance (noun),
■ clear (adjective) + -ity → clarity (noun).
○ Noun to Verb:
■ nation (noun) + -al (adjective) → national (adjective),
■ national (adjective) + -ize → nationalize (verb),
■ nationalize (verb) + -ation → nationalization (noun).
○ Complex Derivations:
■ denationalization (noun) (process of reversing the nationalization of something).
● Productivity: Some derivational morphemes are highly productive, like -ize, which can be
added to many base words to form verbs (e.g., maximize, minimize, modernize).
Dictionary Lookup in NLP
In Natural Language Processing (NLP), dictionary lookup refers to the process of referencing a
pre-compiled list of unique words (or terms) that appear in a given corpus. A dictionary in NLP
contains not just individual words, but can also include multi-word terms that represent a single
concept. These terms are mapped to their corresponding linguistic representations and annotations,
which can help in further text analysis tasks.
Dictionary Definition:

● A dictionary in NLP is a collection of unique words or terms that occur in the text corpus.
Words are listed only once, even if they appear multiple times across different documents.
● Each term in the dictionary is associated with a term ID, which is a unique identifier.

Types of Terms:

● The dictionary may contain single words or multi-word terms that represent a single concept
(e.g., a list of country names to extract the concept of "country").
● For example, terms like "United States" or "New York" may be included as multi-word terms in
the dictionary for better concept extraction.

Variants of Terms:

● A dictionary can include different forms of a base term, like the plural form of a noun, or
different tenses of a verb. This helps capture variations in how terms are used in different
contexts.

Morphological Parsing:

● Morphological parsing involves associating word forms with their linguistic descriptions. A
dictionary-based approach to this parsing process directly links words to their precomputed
analyses.
● The dictionary or word list is typically structured to enable fast lookups of word forms, allowing
for efficient analysis and retrieval of linguistic features (e.g., tense, number, etc.).

Detailed Explanation
Morphological parsing is an important task in language processing, where word forms are associated
with their corresponding linguistic properties. A dictionary-based approach to this process works by
having an extensive list of word forms and their corresponding linguistic descriptions.

1. Advantages of Dictionary Lookup:

○ The main advantage is that the dictionary lookup approach is simple, quick, and
efficient. Once the dictionary is built, looking up a word and retrieving its analysis is
straightforward.
○ It can be implemented using various data structures, such as lists, binary search
trees, tries, or hash tables, which optimize lookup operations.
2. Limitations of Dictionary Lookup:
○ Finite Coverage: Since dictionary-based systems rely on pre-compiled lists of word
forms, their coverage is finite. This means that they cannot handle new or unseen
words that don't exist in the dictionary.
○ Lack of Generalization: Unlike more sophisticated models, a dictionary-based system
does not generalize well. It is limited to the word forms and rules explicitly listed in the
dictionary, which means that it does not exploit the generative potential of language.
○ Manual and Error-Prone: Building and maintaining the dictionary is labor-intensive,
error-prone, and may be inefficient, especially for languages with rich morphology (like
Korean) or those with large vocabularies.
3. Use in Practice:
○ Dictionary-based approaches can be effective in certain cases, especially when handling
languages with relatively simple morphology or when high coverage of common word
forms is sufficient.
○ For example, large-scale dictionaries can be used for languages with a fixed set of word
forms or in domains where the vocabulary is well-defined and does not change rapidly.
4. Modern Approaches:
○ While dictionary lookup is simple and effective in some contexts, more advanced
models that can automatically learn morphological patterns (using unsupervised
learning, for instance) are gaining popularity. These models can offer better
generalization and handle a wider variety of word forms.

Finite State Morphology

Finite-state morphology refers to a computational approach in which human-written specifications
are transformed into finite-state transducers (FSTs). These transducers are used to model and
analyze the morphological structure of languages, particularly in the context of word formation and
inflection. The finite-state approach is widely used due to its simplicity, efficiency, and ability to handle
regular morphological patterns.

Finite-State Transducers (FSTs):

● Finite-State Automata (FSA): FSTs extend the power of finite-state automata. They consist of
a finite set of states connected by arcs (edges), with each arc labeled with pairs of input and
output symbols.
● The transducer processes an input sequence (e.g., a word form), navigating through states and
producing an output sequence (e.g., the word's lemma or another morphological form).
● The transducer defines a regular relation between input and output languages. For example,
it can translate words like vnuk to grandson, pravnuk to great-grandson, etc.

Surface and Lexical Forms:

● In morphological analysis, surface strings represent the observed forms of words, while
lexical strings (lemmas) represent their underlying or base forms.
● For instance, the surface form "bigger" has the lexical form "big + Adj + comp", indicating that
"bigger" is the comparative form of the adjective "big".

Finite-State Transducers in Morphological Analysis:

● FSTs are used to define relations between surface forms and their corresponding lemmas (e.g.,
the relationship between running and run).
● In these transducers, a path from the initial state to a final state corresponds to a mapping
between a surface form and its lemma. The transducer is constructed by defining regular
expressions to describe these relations, which are then compiled into the transducer.
Two Key Challenges in Morphology:
1. Morphotactics:
○ Morphotactics refers to the rules that govern how morphemes (the smallest units of
meaning) are ordered and combined to form words. For example, in English, the suffix
-ness can combine with pity to form "piteness", but -ness cannot combine with -less as in
"pitilessness".
○ Some languages exhibit non-concatenative processes such as interdigitation
(intercalating morphemes) or reduplication, in addition to simple concatenation.
2. Morphological Alternations :
○ Morphological alternations refer to changes in the shape of morphemes depending on
their environment. For instance, the verb "die" becomes "dying" in the context of the
verb-forming morpheme -ing, but this is a morphophonemic alternation that needs to
be captured in the model.

Regular Relations in Finite-State Morphology:

● The finite-state approach assumes that the relationship between surface forms and their
lemmas is regular, meaning it can be captured by regular expressions.
● Regular expressions are a powerful tool for defining patterns in language, and when applied to
finite-state transducers, they allow efficient morphological analysis and generation.

Morphological Parsing with FST (Finite State Transducer)

Morphological parsing is the process of analyzing a word to determine its morphemes—the smallest
units of meaning. The goal of morphological parsing is to break down complex words into their
constituent morphemes, such as stems and affixes. For instance, the word foxes can be parsed into
the stem fox and the suffix -es indicating plural.

Orthographic Rules:

● These are general rules used for word decomposition. They govern how words are
transformed in written form, such as how fox becomes foxes in the plural form.
● An example is the rule that singular English words ending in -y change to -ies when pluralized
(e.g., city becomes cities).

Morphological Rules:

● Morphological rules refer to exceptions to orthographic rules and are necessary when parsing
more complex word forms.
● These rules account for non-standard word transformations, such as irregular plural forms like
child to children or mouse to mice.

Finite State Transducer (FST):

● The commonly used approach to morphological parsing is through Finite-State Transducers

(FSTs). These are computational models that input a word and output its stem and modifiers.
● FSTs are created through algorithmic parsing, often using resources like a dictionary with
modifier markups to guide the parsing process.
● Indexed Lookup Method: This method uses a constructed radix tree to parse words by
looking them up in a dictionary. However, it may break down in morphologically complex
languages, making FSTs a more reliable choice in many cases.

Neural Networks and FST:

● With the rise of neural networks in natural language processing (NLP), the use of FSTs has
become less common, especially for languages with abundant training data. Neural networks
can perform morphological analysis with high accuracy and handle the complexity of
morphological rules more flexibly.

Applications of Morphological Parsing:

1. Machine Translation:
○ Morphological parsing aids in translating words accurately by identifying the correct
base forms and inflections across languages.
2. Spell Checkers:
○ Morphological analysis helps spell checkers by identifying not only correct spellings but
also valid morphemes, enabling more sophisticated error detection and correction.
3. Information Retrieval:
○ In information retrieval, understanding the morphology of a word helps improve search
queries by recognizing variations of words and retrieving relevant results.
Module 3 : Syntax Analysis

Rule based POS Tagging

Rule-based Part-of-Speech (POS) tagging is a method for determining the grammatical category
(such as noun, verb, adjective) of each word in a sentence using a combination of a lexicon (or
dictionary) and a set of predefined rules.

How Rule-based POS Tagging Works:

1. Dictionary/Lexicon Lookup:
○ The tagger first consults a dictionary or lexicon to assign possible POS tags to each
word in the sentence.
○ A word can have multiple possible tags if it has different meanings or usages in the
language (e.g., run can be a noun or a verb).
2. Disambiguation Using Rules:
○ If a word has multiple potential tags, the tagger uses a set of hand-written rules to
choose the most likely correct tag based on the context.
○ These rules analyze linguistic features such as the preceding and following words to
handle ambiguity.

Example of Rule-based POS Tagging:

● If a word is preceded by an article (e.g., the) or an adjective (e.g., beautiful), then the word is
likely to be a noun.
● Such rules are encoded in a tagger to resolve tagging ambiguities.

Types of Rules in Rule-based Tagging:

1. Context-pattern Rules: Rules based on the position of words in a sentence.

2. Regular Expressions Compiled into Finite State Automata (FSA): These rules use patterns
defined with regular expressions to identify parts of speech, particularly useful for identifying
complex word formations or sequences.
○ These patterns can be intersected with the lexicon to handle ambiguous cases.

Two-stage Architecture of Rule-based POS Tagging:

1. First Stage:
○ Uses a dictionary to assign a list of potential POS tags to each word.
2. Second Stage:
○ Applies a series of manually created disambiguation rules to narrow down the list to a
single POS tag for each word.

Properties of Rule-based POS Tagging:

1. Knowledge-driven Taggers:
○ Rule-based POS taggers rely on expert knowledge to manually define the rules, making
them knowledge-driven.
2. Manual Rule Creation:
○ The rules are created by linguists or experts who understand the language's structure
and grammar.
3. Large Set of Rules:
○ Rule-based taggers typically require a substantial number of rules (around 1000 rules)
to cover various linguistic cases and handle exceptions.
4. Explicit Smoothing and Language Modelling:
○ Rule-based taggers explicitly define smoothing techniques to handle words not found in
the lexicon (out-of-vocabulary words) and ensure proper language modeling.

Advantages of Rule-based POS Tagging:

● Accuracy for Well-Defined Languages: Highly accurate for languages with well-defined
grammar and syntax.
● Interpretability: The rules are interpretable, allowing linguists to understand why a word was
tagged a certain way.
● Consistency: Provides consistent tagging if the rules are comprehensive.

Disadvantages of Rule-based POS Tagging:

● Labor-Intensive: Requires extensive manual effort to create and maintain rules.

● Difficulty in Handling Ambiguity: May struggle with highly ambiguous sentences where
context is not easily captured by simple rules.
● Lack of Flexibility: Can be inflexible in handling new words or colloquial language.

Stochastic POS Tagging

Stochastic POS (Part-of-Speech) Tagging uses statistical methods to assign parts of speech to words
in a sentence based on probabilities. It leverages the frequency or likelihood (probability) of words
and tag sequences within a training dataset to make decisions about tagging.

Approaches to Stochastic POS Tagging:

1. Word-Frequency Approach:
○ This approach disambiguates words by looking at how frequently a word appears with
each possible tag in the training data.
○ The tag that appears most frequently with the word is chosen when tagging.
○ Example: If the word "bank" is tagged as a noun (N) 70% of the time and as a verb (V)
30% of the time in the training data, the tagger will choose Noun if it encounters the
word "bank" again.
○ Limitation: This method can produce inappropriate sequences of tags, as it does not
consider the context of the entire sentence, leading to errors in complex scenarios.

Tag Sequence Probabilities (N-gram Approach):

● Instead of just looking at individual word frequencies, this method calculates the probability
of sequences of tags occurring together.
● It assigns the best tag for a word based on the probability of that word appearing with the
preceding tags in the sentence.
● N-gram Approach:
○ Unigram: Consider each word individually.
○ Bigram: Consider the probability of a tag given the previous tag.
○ Trigram: Consider the probability of a tag given the two preceding tags.
● This approach is more context-aware and often more accurate than the Word-Frequency
approach.

Properties of Stochastic POS Tagging:

1. Probability-based : The process relies on the probability of a tag occurring for a particular
word, based on a statistical analysis of a large corpus.
2. Requires a Training Corpus : A large annotated dataset (training corpus) is needed to
calculate probabilities and determine the best tag assignments.
3. Limitations with Unknown Words : If a word is not present in the training data, the tagger
may not be able to assign a probability, leading to potential errors in tagging unknown words.
4. Separate Testing Corpus : A separate testing corpus is used to evaluate the accuracy of the
model, which should differ from the training corpus to avoid overfitting.
5. Simplest Tagging Method : Stochastic tagging is straightforward because it typically involves
assigning the most frequent tag associated with a word in the training dataset.

Advantages of Stochastic POS Tagging:

● Automated Learning: Uses statistical data from a training corpus, eliminating the need for
manually written rules.
● Adaptable: Can be easily adapted to different languages if a suitable corpus is available.
● Context-aware with N-grams: The N-gram approach improves accuracy by considering the
context of surrounding words.

Disadvantages of Stochastic POS Tagging:

● Data Dependency: Requires a large and high-quality training dataset for accurate tagging.
● Handling Unknown Words: Can struggle with out-of-vocabulary words that were not present
in the training data.
● Language Variability: Accuracy may drop if applied to a language or context significantly
different from the training corpus.

Transformation-Based Learning (TBL)

Transformation-Based Learning (TBL), also known as Brill Tagging, is a rule-based algorithm used
for assigning Parts of Speech (POS) tags to words in a text. It’s a hybrid approach that leverages both
rule-based and statistical techniques. TBL is unique because it allows the use of transformation
rules that convert one state to another, refining the tagging process iteratively.
Key Characteristics of TBL:

● Rule-Based: Like traditional rule-based taggers, TBL utilizes rules to determine which tags to
assign.
● Machine Learning: Similar to stochastic taggers, it incorporates machine learning by
automatically learning rules from a training dataset.
● Readable Rules: TBL maintains the linguistic knowledge in a human-readable form, making it
easy to understand why certain decisions are made.

How TBL Works:

● Initialization: It starts with an initial tagging of the text. This can be a simple method, such as
assigning the most frequent tag from the training data for each word.
● Refinement: The initial tags are refined using transformation rules, which specify how to
change the current tag based on the context. The tagger iteratively applies the most beneficial
transformation.
● Iteration: The process continues in cycles until no further transformations improve the
tagging accuracy.

Working of Transformation-Based Learning (TBL):

1. Begin with an Initial Solution : TBL starts with a basic tagging solution. This initial state might
involve assigning the most common tag for each word based on a training corpus.
2. Selecting the Most Beneficial Transformation: In each cycle, the system evaluates multiple
potential transformations. It selects the transformation rule that results in the most
significant improvement in tagging accuracy. A transformation rule could be: Change a tag from
X to Y if the preceding word is Z.
3. Applying the Transformation: The selected transformation is applied to the text, modifying
the tags accordingly.
4. Stopping Condition: The process repeats until no more beneficial transformations can be
found, indicating that the tagging is as accurate as possible.

Advantages of Transformation-Based Learning (TBL):

1. Small and Simple Rule Set : Only a small number of transformation rules are needed to
achieve effective tagging. These rules are typically straightforward and easy to manage.
2. Ease of Development and Debugging : The rules are human-readable, making it easier to
understand and modify them. Debugging the model becomes simpler since the impact of each
rule is transparent.
3. Reduced Complexity : By combining machine-learned rules with manually written ones, TBL
simplifies the tagging process without sacrificing accuracy.
4. Efficiency : TBL is generally faster than probabilistic models like Markov-Model taggers due to
its simpler rule application.
Disadvantages of Transformation-Based Learning (TBL):

1. No Probability Estimation : TBL does not assign probabilities to the tags. This means it lacks
the statistical foundation found in stochastic models, which makes probabilistic reasoning
impossible.
2. Slow Training Time with Large Corpora : When dealing with large datasets, the training
phase in TBL can be slow, as it involves evaluating numerous transformations over many
cycles.

Difficulties or Challenges in POS Tagging

Ambiguity:

● Definition: The primary challenge in POS tagging is handling ambiguity. Many words in English
can serve multiple functions, leading to uncertainty in tagging.
● Example: The word "shot" can be tagged as a noun (He took a shot) or a verb (He shot the ball).
Disambiguating the correct POS requires understanding the context in which the word
appears.

Words with Multiple Meanings:

● In English, common words often have several meanings, each associated with a different POS.
This can complicate the tagging process since the correct tag is context-dependent.
● Impact: Inaccurate tagging leads to downstream error propagation, affecting subsequent
NLP tasks like parsing, named entity recognition, or machine translation.

Improving Accuracy with Additional Processing:

● To enhance tagging accuracy, POS tagging can be integrated with other processes, such as
dependency parsing. Joint approaches can provide better results than treating POS tagging as
an isolated task.

Clues from Word Forms:

● Some words inherently provide clues to their POS:

○ The article "the" is a clear determiner.
○ Prefixes like "un-" often suggest an adjective (e.g., "unfathomable").
○ Suffixes like "-ly" are indicators of adverbs (e.g., "quickly").
○ Capitalization hints at proper nouns (e.g., "London").

Context Dependency:

● The POS tag for a word is not solely determined by the word itself but is often influenced by
the neighboring words. The surrounding context, such as the preceding and following words,
plays a significant role in disambiguating POS.
Word Probabilities:

● The likelihood of a word being a certain part of speech can help resolve ambiguity. For
instance, "man" is more frequently used as a noun than a verb, making the noun tag more
probable in the absence of strong contextual clues.

Statistical Approaches for Tagging:

● In a statistical POS tagging approach:

○ Unigram Tagging: This approach uses the probability of a tag based solely on the word
itself. For example, the most frequent tag associated with "bank" in a training corpus
might be "noun."
○ Bigram Tagging: This considers the tag of the preceding word to determine the current
word's tag, enhancing accuracy by taking context into account.
○ N-gram Tagging: Extending beyond bigram, this approach can use several preceding
words' tags to make more informed decisions, but it requires more computational
resources.

Generative Models

Hidden Markov Models

Natural Language Processing (NLP) aims to bridge the communication gap between human languages
and computer systems, translating complex human languages into a format that computers can
understand—primarily binary language (1's and 0's). While numerous models exist in NLP, the quest
for a comprehensive generative predictive model that can optimally adjust to a wide range of NLP
problems remains an active research area.

A Generative Model in NLP is a versatile framework designed to handle a variety of language-related

tasks, such as reading text, interpreting speech, analyzing sentiment, and identifying significant parts
of the content. This is typically accomplished through a process of identifying relevant components
and eliminating irrelevant data. The ultimate goal is to create a single platform that not only
generates but also reproduces optimized solutions for diverse linguistic challenges.

Key Remarks about Generative Models:

1. Capability to Generate Data : A generative model has the ability to create new data instances
that resemble real examples. For instance, it could generate images of animals that look
convincingly real based on learned patterns.
2. Joint Probability:
○ Given a set of data instances X and labels Y, generative models are concerned with
capturing the joint probability P(X,Y). This means they can represent the probability of
both the data and the associated labels occurring together.
○ If there are no labels, generative models focus on the probability P(X), which represents
the likelihood of the data itself.
3. Understanding Data Distribution: Generative models aim to learn the underlying
distribution of data, allowing them to assign probabilities to new instances. For example,
models predicting the next word in a sequence are generative because they estimate the
likelihood of a particular word sequence appearing.

Hidden Markov Model (HMM Viterbi) for POS Tagging

Hidden Markov Models (HMMs) are widely used in NLP for Part-of-Speech (POS) tagging, enabling
computers to predict grammatical tags for words in a sentence. HMM is a statistical model that helps
in probability-based prediction, using observable data to make decisions about hidden states.

States : In POS tagging, each possible part-of-speech tag (like noun, verb, adjective) is a state. These
states are "hidden" because the true tag sequence is not directly observed.
Observations : The words in the input sentence are considered as observable events. Based on these
words, the HMM infers the sequence of hidden states (tags).
Transition Probability : This represents the probability of moving from one tag to another. For
example, the probability that a noun is followed by a verb.
Emission Probability : This measures the likelihood of observing a specific word given a particular
tag. For instance, how likely the word "run" is tagged as a verb.
Viterbi Algorithm : This is a dynamic programming technique used with HMMs to find the most
probable sequence of states (tags) for a given sequence of observations (words). It computes the
optimal path through a sequence by maximizing probabilities.

Steps in Viterbi for POS Tagging:

1. Initialize with the start state and calculate initial probabilities.
2. Recursively calculate the highest probability of tag sequences at each step, based on the
previous tags.
3. Keep track of the best path using back-pointers.
4. Terminate with the end state and backtrace to get the optimal sequence of tags.

Markov Models
Markov models are probabilistic models used to describe a sequence of possible events, where the
probability of each event depends only on the state attained in the previous event. There are two
types:

Types of Markov Models:

1. Observable Markov Model (Markov Chain) : Each state is directly visible to the observer, and
there are no hidden variables. An example is predicting weather conditions (sunny, rainy)
where transitions depend only on the current state.
2. Hidden Markov Model (HMM) : The states are not directly visible, and instead, observations
provide indirect evidence about the states. HMM is particularly useful in cases where the
sequence of events is partially hidden.
Markov Chains
A Markov Chain is a way to predict a sequence of events where each event depends only on the
event right before it. In simple terms, it’s a system that moves from one state to another, and the
future state depends only on the present state, not on the entire past history. Imagine you're playing
a simple board game where you roll a dice and move to different spaces. The number you roll decides
where you go next, but it doesn't matter where you started or what you rolled before—only the
current roll matters. That's how a Markov Chain works!

States: These are the different situations you can be in. In the board game example, each space on
the board is a state.
Transition: Moving from one state to another. In the game, each dice roll is a transition from one
space to another.
Probability: Each transition has a probability. For example, if you’re in a certain space, you might
have a 50% chance to go to one space and a 50% chance to go to another, based on your dice roll.

How It Works:

● You start in a state.

● You move to the next state based on a set of probabilities.
● Your next move is only based on where you are right now, not on how you got there.

Hidden Markov Model

Hidden Markov Models (HMM) are a special type of statistical model used to predict a sequence of
events where some parts of the system are hidden or not directly observable. Unlike a regular Markov
model, which relies on visible states, an HMM deals with both hidden states (unknown) and
observable events (known).

Key Concepts of HMM:

1. Hidden States : These are the variables you cannot directly observe. For example, in weather
prediction, the hidden states might be "Rainy" or "Sunny." Although you can't see these states
directly, you can make educated guesses about them based on observable data.
2. Observations : These are the events or data you can see. For instance, someone carrying an
umbrella might be an observable event, which can give a hint about the hidden weather state.
3. Markov Assumption: HMMs rely on the assumption that each hidden state only depends on
the previous hidden state (memoryless property).
4. Components of an HMM:
○ Initial Probability Distribution: This tells you the starting likelihood of each hidden
state.
○ Transition Probability Distribution: The probability of moving from one hidden state
to another. For example, the chance of going from "Rainy" to "Sunny."
○ Emission Probabilities: These define the likelihood of an observable event given a
hidden state. For example, the probability of seeing someone shopping if the weather is
"Sunny."
○ Sequence of Observations: The series of visible events that you use to make guesses
about the hidden states.

Example: Predicting Daily Activities:

Imagine you want to predict what someone is doing based on the weather, but you can’t see the
weather directly. Instead, you can observe activities like "shopping," "walking," or "cleaning."

● Hidden States: "Rainy" or "Sunny" (you can’t see this directly).

● Observations: Activities like "shopping," "walking," or "cleaning" (visible actions).
● Transition Probabilities: Chances of switching between "Rainy" to "Sunny" and vice versa.
● Emission Probabilities: Likelihood of observing specific activities based on whether it’s
"Rainy" or "Sunny."

How It Works:

● HMMs use the current hidden state to predict future observations and hidden states.
● The hidden states help make predictions, but you only get clues about them through
observable events.
● For example, if you notice someone frequently walking outside, it might hint that it's "Sunny"
rather than "Rainy."

Viterbi Algorithm
The Viterbi Algorithm is a dynamic programming technique used to find the most probable
sequence of hidden states in a Hidden Markov Model (HMM), given a sequence of observed events.
It’s often used in applications like speech recognition, part-of-speech tagging, and bioinformatics.

How the Viterbi Algorithm Works:

1. Initialization: Start by setting up the initial probabilities for each hidden state based on the
first observation.
2. Recursion: For each subsequent observation, calculate the probability of each hidden state
using the previous states. This involves choosing the path that maximizes the likelihood.
3. Backtracking: Once all observations are processed, trace back the sequence of hidden states
that led to the highest probability.

The Viterbi Algorithm ensures that you get the optimal hidden state sequence efficiently, even for
complex data sequences, by narrowing down to the most likely paths as it processes the
observations.
Issues in HMM
● The main problem with HMM POS Tagging is ambiguity.
● The POS tagging is based on the probability of tag occurring.
● There is no probability for the words that do not exist in the corpus.
● It uses different testing corpus, other than training corpus.
● It is the simplest POS tagging, since it chooses the most frequent tags associated with a word
in the training corpus.
● An HMM model is the doubly-embedded stochastic model, where the underlying stochastic
process is hidden.
● The hidden stochastic process can only be observed through another set of stochastic
processes that produces the sequence of observations.
Module 4: Semantic Analysis
Introduction to Semantic Analysis

Semantic Analysis is the process of interpreting and finding meaning in text. It helps computers
understand sentences, paragraphs, or documents by analyzing their grammatical structure and
identifying how individual words relate in a particular context. The primary goal is to derive the exact
or dictionary meaning from the text, checking if it makes logical sense.

For instance, consider the sentence, "Govind is great." The context is crucial to determine if "Govind"
refers to Lord Govind or a person named Govind. Semantic analysis aims to resolve such ambiguities.

Use of Semantic Analysis

Semantic analysis is used to help computers achieve human-like understanding in tasks such as:

● Machine translations
● Chatbots
● Search engines
● Text analysis

These applications extract significant information, ensuring the accurate meaning of a sentence.

Syntactic and Semantic Analysis

● Syntactic Analysis focuses on checking if the language instance is "well-formed," analyzing its
grammatical structure without considering meaning.
● Semantic Analysis focuses on whether the content "makes sense," aiming to extract the
sentence’s intended meaning.

While syntactic analysis considers word types, semantic analysis goes deeper into the meanings and
relationships between words.

In Natural Language Processing (NLP), semantic analysis plays a crucial role. It clarifies the context
and emotions behind a sentence, enabling computers to extract relevant information and perform
tasks with human-like accuracy.

Steps to be Carried in Syntactic Analysis

1. Segmentation I: Identify boundaries for clauses and individual words.

2. Classification I: Determine the parts of speech for each word.
3. Segmentation II: Identify groups of words that form meaningful units (constituents).
4. Classification II: Determine the syntactic categories of those constituents.
5. Assign Grammatical Functions: Define the grammatical roles of each constituent (e.g.,
subject, object).
Meaning representation
Meaning Representation is a process in semantic analysis that captures the meaning of a sentence
by organizing and structuring information. This representation helps computers understand the
deeper meaning and context of a text.

Building Blocks of Semantic System

To construct a semantic system, the following core components, or "building blocks," are used:

1. Entities: These are individual, specific items or names, like a person, place, or object.
Examples include "Haryana," "Kejriwal," and "Pune."
2. Concepts: These represent general categories or types to which entities belong, such as
"person," "city," or "country."
3. Relations: This defines the relationships between entities and concepts. For example, in the
sentence "Lata Mangeshkar was a singer," a relation exists between "Lata Mangeshkar" (entity)
and "singer" (concept).
4. Predicates: These are verb structures that define actions or states. Predicates specify roles
within a sentence, such as the subject and object. Examples include case grammar and
semantic roles.

By combining these building blocks—entities, concepts, relations, and predicates—meaning

representation enables understanding of the semantic world, allowing reasoning about various
situations.

Approaches to Meaning Representations

Several methods are used to represent meaning in semantic analysis:

1. First Order Predicate Logic (FOPL): A formal system used to describe the meaning of
sentences through predicates and quantifiers.
2. Frames: Structured representations of knowledge with slots and fillers, often used to describe
typical situations or objects.
3. Rule-based Architecture: Systems based on predefined rules to interpret the meaning of
text.
4. Conceptual Graphs: Graph structures that visually represent the relationships between
concepts.
5. Semantic Nets: Networks that use nodes to represent concepts and edges to show
relationships between them.
6. Conceptual Dependency (CD): A model that represents the meaning of sentences through
actions and states to describe events.
7. Case Grammar: An approach that focuses on the semantic roles of words, such as agent,
object, and instrument.
Need for Meaning Representations

The reasons for using meaning representation in semantic analysis include:

1. Linking Linguistic to Non-linguistic Elements: It connects language elements to real-world

concepts, making it easier for computers to relate words to tangible objects or ideas.
2. Representing Variety at the Lexical Level: Provides clear and unambiguous forms of words,
avoiding confusion at the lexical (word) level.
3. Facilitating Reasoning: Allows systems to verify truths and make inferences, enhancing their
understanding and ability to deduce information from context.

Lexical Semantics
Lexical Semantics is a branch of semantic analysis that focuses on the meanings of individual words
and smaller components, such as prefixes, suffixes, and compound phrases. These components are
collectively referred to as lexical items. Lexical semantics helps in understanding the relationship
between these items, the meaning of sentences, and how they fit into the syntactic structure of a
sentence.

Key Concepts in Lexical Semantics:

1. Lexical Items: These are the building blocks of language, including words, parts of words (like
prefixes and suffixes), and phrases.
2. Relationship Between Lexical Items: Lexical semantics studies how these items interact with
each other and contribute to the overall meaning of a sentence.

Steps Involved in Lexical Semantics:

1. Classification of Lexical Items: This involves organizing words, sub-words, and affixes based
on their characteristics, such as part of speech (noun, verb, adjective, etc.) or word structure.
2. Decomposition of Lexical Items: Breaking down words into smaller parts to understand their
root meanings, prefixes, suffixes, and how they contribute to the overall word meaning.
3. Analyzing Differences and Similarities: Comparing various words and phrases to explore
differences in their meanings or identify similarities in their structure or usage.

Lexical Characteristics
Lexical Characteristics focus on understanding language through the analysis of lexical
units—words, phrases, and their patterns—rather than emphasizing grammatical structures. This
method, known as the Lexical Approach, centers on the idea that meaning in language is primarily
carried by vocabulary rather than syntax.

Advantages of the Lexical Approach

One of the biggest benefits of the lexical approach is that it promotes conscious awareness of words
and phrases. This process of noticing new vocabulary and familiar patterns is a crucial initial step in
language learning, helping learners understand and retain new words effectively.
Main Features of a Lexical Unit

A lexical unit can take different forms:

1. Single Word: A basic lexical unit, like "dog" or "run."

2. Habitual Co-occurrence: The regular pairing of two words, known as collocations or
multi-word units. Examples include phrases like "make a decision" or "take a break."

These units are crucial because they reflect how words are naturally used together in a language.

Limitations of the Lexical Approach

While the lexical approach helps learners quickly grasp useful phrases, it has some drawbacks:

● It may limit creativity because learners rely on fixed expressions rather than constructing
sentences from scratch.
● There is less emphasis on understanding the deeper, intricate structures of the language,
which can affect fluency in novel situations.

This means that vocabulary is at the heart of conveying meaning, while grammar acts as a supportive
structure to manage and organize these words. In essence, learning vocabulary is seen as more
fundamental than mastering grammar for effective communication.

Corpus Study
Corpus Study, also known as corpus linguistics, is a research methodology that involves the
statistical analysis of large collections of written or spoken texts to investigate linguistic phenomena.
A corpus refers to a structured set of "real-world" texts, reflecting how language is used in natural
contexts. This method is crucial for uncovering the rules and patterns of a language by analyzing
authentic data instead of relying solely on theoretical constructs.

Corpus studies have broad applications, including linguistic research, creating dictionaries, and
crafting grammar guides. For example, the American Heritage Dictionary of the English Language
(1969) and A Comprehensive Grammar of the English Language (1985) were developed using
corpus data.

Methods of Corpus Study

Corpus study employs various methods to transition from raw data to theoretical insights. A notable
framework is the 3A Perspective introduced by Wallis and Nelson, which consists of:

1. Annotation: Involves tagging texts with relevant information, like part-of-speech (POS)
tagging, parsing, and other structural details. This helps in organizing the data for further
study.
2. Abstraction: Involves translating annotated data into a theoretically driven model or dataset.
This can include linguist-directed searches or automated rule-learning.
3. Analysis: Focuses on statistically analyzing the data to identify trends, optimize rules, or
discover new insights. This stage may involve statistical evaluations, data manipulation, and
generalization.
Annotated corpora offers the advantage of allowing other researchers to perform further
experiments, facilitating shared linguistic debates and studies.

Corpus Approach
The Corpus Approach is a method that relies on a comprehensive collection of naturally occurring
texts for analysis. These collections can vary by type, such as written, spoken, or specialized academic
texts. The emphasis is on using naturally occurring language to understand its patterns and
variations.

Corpus Linguistic Techniques

Key techniques used in corpus linguistics include:

● Dispersion: Observing the spread of a word or phrase across different contexts.

● Frequency: Counting the occurrence of specific words or phrases.
● Clusters: Analyzing groups of words that appear together frequently.
● Keywords: Identifying words that are unusually frequent in a particular text or dataset.
● Concordance: Viewing a word in multiple contexts to understand its use.
● Collocation: Studying words that commonly co-occur.

These techniques help linguists uncover language use patterns and discourse practices.

Language Dictionaries Like WordNet

A dictionary is a compilation of words, or lexemes, from one or more languages, organized

alphabetically or, in ideographic languages, by radicals and strokes. Dictionaries provide a range of
information including definitions, etymologies, pronunciations, translations, and usage guidelines.
They serve as lexicographical references, often highlighting interrelationships among words and
concepts.

Dictionaries can be general or specialized:

● General Dictionaries cover a broad range of language vocabulary, mapping words to

definitions (semasiological).
● Specialized Dictionaries focus on specific fields and first identify concepts, mapping them to
terms (onomasiological).

Dictionaries vary in scope and structure, with some not fitting neatly into general or specialized
categories. Examples include:

● Bilingual Dictionaries (translation)

● Thesauri (synonyms)
● Rhyming Dictionaries

Additionally, dictionaries can be prescriptive (promoting correct usage) or descriptive (recording

actual language use). Labels such as "informal" or "vulgar" reflect these distinctions.
Types of Dictionaries

In general dictionaries, words might have multiple meanings:

● Definitions are often arranged by frequency of use or in sequential lists.

● Words are usually listed in their base form (e.g., verbs are in infinitive form).
● Dictionaries exist in various formats, from traditional books to software and online resources
(e.g., New Oxford American Dictionary).

Specialized Dictionaries

Specialized dictionaries, also known as technical dictionaries, focus on terminology within a specific
field. Lexicographers divide them into three main categories:

1. Multi-field Dictionary:
○ Covers several subject areas. Example: A business dictionary covering finance,
marketing, and management.
○ Example: Inter-Active Terminology for Europe (covers 23 languages).
2. Single-field Dictionary:
○ Focuses on one domain. Example: A legal dictionary.
○ Example: American National Biography (focused on biographical entries).
3. Sub-field Dictionary:
○ Even more specialized, covering niche areas within a domain. Example: Constitutional
law.
○ Example: African American National Biography (focusing on African American figures).

An alternative to these is a glossary, an alphabetical list of specialized terms, often seen in fields like
medicine.

Defining Dictionaries

A defining dictionary provides the simplest and most fundamental meanings of basic concepts:

● It includes a core glossary—the simplest definitions for the most commonly used words.
● In English, defining dictionaries usually limit their entries to around 2000 basic words, allowing
them to define about 4000 common idioms and metaphors.

RELATIONS AMONG LEXEMES AND THEIR SENSES

Semantic analysis is composed of two main parts:

1. Lexical Semantics, which studies the meanings of individual words.

2. Compositional Semantics, which examines how individual words combine to form the
meanings of phrases and sentences.

Important Elements of Semantic Analysis

Key elements in the analysis of semantic relationships among lexemes (words) include:
1. Hyponymy
○ Definition: A relationship between a general category (hypernym) and its specific
instances (hyponyms).
○ Example: "Colour" is a hypernym, while "red" and "green" are its hyponyms.
2. Homonymy
○ Definition: Words that have the same spelling or pronunciation but different and
unrelated meanings.
○ Example: The word "bat" can refer to both a piece of sports equipment and a flying
mammal.
3. Polysemy
○ Definition: A single word that has multiple meanings that are related by extension.
○ Example: The word "bank" can refer to:
■ (i) A financial institution.
■ (ii) The building that houses such an institution.
■ (iii) A verb meaning "to rely on."
4. Difference Between Polysemy and Homonymy
○ Polysemy involves meanings that are related to each other, even if distinct. For
example, different senses of "bank" are connected by the concept of "reliability" or
"holding."
○ Homonymy deals with meanings that are completely unrelated, such as the "bat" that
flies and the "bat" used in sports, which share no semantic connection apart from the
word form itself.
5. Synonymy
○ Definition: The relationship between two lexical items that have different forms but
express the same or very similar meanings.
○ Examples: "Author" and "writer," "fate" and "destiny.
6. Antonymy
○ Definition: The relationship between two lexical items that possess opposing meanings
relative to a certain axis.
○ Scope of Antonymy:
■ (i) Binary Opposition (Property or Not): Reflects a direct opposition, such as
"life/death" or "certitude/incertitude."
■ (ii) Gradable Opposition (Scalable Property): Involves a spectrum of opposites
where degrees exist, such as "rich/poor" or "hot/cold."
■ (iii) Relational Opposition (Usage-Based): A type of antonymy where the items
are defined by their relationship, such as "father/son" or "moon/sun."

Ambiguity and Uncertainty in Language

Ambiguity refers to the concept of having a "double meaning." In natural language processing (NLP),
ambiguity occurs when a phrase, word, or sentence can be interpreted in more than one way. Natural
language is inherently ambiguous, which poses challenges for computational processing and
understanding.

Lexical Ambiguity

● Definition: The ambiguity arising from a single word that can have multiple meanings.
● Example: The word "walk" can be interpreted as a noun ("I went for a walk") or as a verb ("I
walk every morning").

Syntactic Ambiguity

● Definition: Occurs when a sentence can be parsed in multiple ways due to its structure.
● Example: The sentence "The man saw the girl with the camera" can be interpreted in two
ways:
○ The man saw a girl who had a camera.
○ The man saw the girl through a camera.

Semantic Ambiguity

● Definition: Ambiguity that arises when the meaning of a word or phrase in a sentence can be
misinterpreted.
● Example: The sentence "The bike hit the pole when it was moving" can mean:
○ The bike, while moving, hit the pole.
○ The bike hit the pole while the pole was moving.

Anaphoric Ambiguity

● Definition: Ambiguity that occurs when the use of anaphoric entities (e.g., pronouns) leads to
unclear references.
● Example: "The horse ran up the hill. It was very steep. It soon got tired." The pronoun "it"
could ambiguously refer to the hill or the horse in both instances.

Pragmatic Ambiguity

● Definition: Ambiguity that arises when the context allows for multiple interpretations of a
situation.
● Example: The phrase "I like you too" can have different meanings depending on context:
○ "I like you (just as you like me)."
○ "I like you (just like someone else does)."

Word Sense Disambiguation

Word Sense Disambiguation (WSD) is a crucial method in Natural Language Processing (NLP). It
focuses on identifying the correct meaning of a word based on its context within a sentence. Due to
the diverse usage of words in different contexts, WSD aims to resolve ambiguity, helping NLP systems
understand words accurately.

Applications of Word Sense Disambiguation

WSD is applicable across several NLP fields, aiding in accurate interpretation and processing of
language data:

1. Lexicography (Dictionary) Modern lexicography is often corpus-based. WSD helps identify

precise textual indicators that determine the context of words in dictionaries.
2. Text Mining and Information Extraction : In text mining, WSD enables correct labeling of
words to understand their accurate meaning, facilitating the extraction of relevant
information.
3. Security : WSD is essential for systems to distinguish between similar words with different
meanings, such as a coal "mine" (an industrial term) and a land ”mine" (a security threat).
4. Information Retrieval : Accurate Information Retrieval systems rely on understanding the
relevance of words in sentences. WSD improves retrieval quality by ensuring the correct
interpretation of words based on context.

Challenges in Word Sense Disambiguation

WSD faces several challenges:

1. Variation in Dictionaries and Text-Corpora

○ Different dictionaries provide different meanings for the same word, leading to
inconsistent interpretation.
○ The vast amount of text data makes it difficult to process all available information
accurately.
2. Algorithm Complexity : Various applications require distinct algorithms, adding to the
complexity of creating universally effective WSD systems.
3. Related Meanings : Words often have related, rather than discrete, meanings, complicating
the process of defining them precisely.

Relevance of WSD
WSD is closely related to Part-of-Speech (POS) tagging, a fundamental component of NLP. However,
unlike POS tagging, WSD involves understanding the semantic content of a word, not just its
grammatical category.

● The challenge lies in the contextual and non-binary nature of word meanings. Unlike numerical
quantities, word senses are fluid and depend heavily on context.
● Lexicography, which generalizes language data, may not always provide definitions applicable
to algorithmic processes or data sets, emphasizing the need for adaptable and context-aware
WSD methods.

WSD is vital for achieving higher accuracy in NLP applications, allowing systems to parse and
understand language closer to how humans interpret it.

Knowledge-Based Approach
A knowledge-based system (KBS) refers to a computer system that uses knowledge stored in a
database to reason and solve problems. The behavior of such systems can be designed using the
following approaches:

Declarative Approach
In the declarative approach, an agent begins with an empty knowledge base and progressively adds
information. The agent "Tells" or inserts sentences (facts or rules) one after another until it has
enough knowledge to perform tasks and interact with its environment effectively. This approach
focuses on what the system knows rather than how it processes that knowledge. The agent doesn't
specify the steps or procedures for solving problems explicitly; instead, it describes the necessary
facts and rules in a declarative manner.

For example, a rule like "if it rains, then the ground gets wet" would be added to the knowledge base,
and the system would use that to infer consequences when needed.

Procedural Approach
The procedural approach is quite different. Instead of merely storing facts and rules, this method
focuses on encoding the required behavior directly into the program code. In this approach, the
system specifies how the task is to be performed by translating knowledge into explicit instructions
(procedures or algorithms).

While the declarative approach emphasizes the knowledge itself, the procedural approach focuses on
the process or procedure for handling knowledge and solving problems. This can involve writing
step-by-step instructions or algorithms that define how the system operates in different situations.

Comparison: Declarative vs Procedural

● Declarative: Describes what needs to be known (the facts, rules, and relations), with less
emphasis on the specific process. It is high-level, easier to modify, and often more
human-readable.
● Procedural: Describes how to perform tasks, embedding the logic directly into the system’s
code. It tends to be more efficient for specific tasks but is harder to modify or expand.

Lesk Algorithm

The Lesk Algorithm is a method used in Word Sense Disambiguation (WSD) to determine the
meaning of an ambiguous word based on its context. The core idea of the algorithm is that words
within a given context or "neighborhood" tend to share a common topic or theme, and the dictionary
definition of the word in question can be compared with these neighboring words to help identify the
correct sense.

How the Lesk Algorithm Works:

1. Dictionary Sense Comparison: For each possible sense (meaning) of the ambiguous word,
the algorithm compares its dictionary definition with the surrounding words in the context (i.e.,
its "neighborhood").
2. Counting Overlaps: It counts how many words from the neighborhood appear in the
dictionary definition of the sense being considered.
3. Selecting the Best Sense: The sense with the highest overlap count is chosen as the correct
meaning for the word in that particular context.

Example of the Lesk Algorithm:

For the context "pine cone", consider the dictionary definitions:

● Pine:
1. A kind of evergreen tree with needle-shaped leaves.
2. Waste away through sorrow or illness.
● Cone:
1. Solid body which narrows to a point.
2. Something of this shape whether solid or hollow.
3. Fruit of certain evergreen trees.

In this case:

● The best intersection of senses would be pine #1 (evergreen tree) and cone #3 (fruit of certain
evergreen trees), which gives an overlap count of 2. Therefore, this combination of senses
would be selected as the correct interpretation.

Simplified Lesk Algorithm

The Simplified Lesk Algorithm is a modified version of the original Lesk algorithm, with an emphasis
on efficiency and precision.

How it Works:

● In the simplified version, the sense of each word is determined individually, based on how
much overlap there is between its dictionary definition and the surrounding context.
● Unlike the original Lesk algorithm, which attempts to disambiguate all the words in a given
context together, the simplified approach treats each word independently.

Performance:

● A comparative evaluation of the algorithm on the Senseval-2 English all-words dataset showed
that the simplified Lesk algorithm outperforms the original version in terms of both precision
and efficiency.
● The simplified version achieved 58% precision, while the original version only achieved 42%
precision.

Limitations of Lesk-Based Methods

While Lesk-based methods are useful for WSD, they come with certain limitations:

1. Sensitivity to Exact Wording: Lesk’s approach is highly sensitive to the exact wording of
dictionary definitions. Small changes in the phrasing can significantly alter the disambiguation
results.
2. Absence of Certain Words: If a word is missing from a definition, the overlap count may be
greatly reduced, affecting the accuracy of the algorithm.
3. Limited Glosses: Lesk’s algorithm determines overlaps only among the glosses (brief
definitions) of the senses being considered. These glosses are often short and may not provide
enough vocabulary to distinguish between different senses effectively.
4. Insufficient Vocabulary in Glosses: Since dictionary glosses tend to be very concise, they may
lack enough context to clearly differentiate between multiple senses of a word, especially
when senses are subtle or nuanced.
Modifications and Improvements:

To overcome these limitations, various modifications to the Lesk algorithm have been proposed:

● Synonym Dictionaries: Using synonyms or additional words found in the glosses of senses to
improve the disambiguation process.
● Morphological and Syntactic Models: Incorporating morphological or syntactic analysis to
better understand the context and enhance sense disambiguation.
● Derivatives and Related Words: Using derivatives of words or related terms from the
definitions to find better overlaps.
Module 5 - Pragmatic and Discourse Analysis
REFERENCE RESOLUTION
Reference resolution is the process through which we determine the relationships between referring
expressions and their referents in discourse. For a computer or an automated system, understanding
how pronouns and other referring expressions like "he" or "it" relate to entities previously mentioned
in the text is a challenging task. This section discusses how reference resolution works and introduces
several key terms related to the process.

Key Concepts in Reference Resolution:

Referring Expression and Referent:

1. A referring expression is a natural language expression used to refer to an entity (the

referent).
2. The referent is the entity that the referring expression refers to.
○ Example: In the sentence, "John went to Bill's car dealership to check out an Acura
Integra. He looked at it for about an hour," the referring expressions are John & `he`
and the referent for both is John.

Corefer: When two referring expressions refer to the same entity, they are said to corefer.

● Example: In the sentence above, "John" and "he" corefer because both refer to the same
person, John.

Antecedent: The antecedent of a referring expression is the referring expression that enables the
use of another. In other words, the antecedent is the first mention that allows a subsequent pronoun
or referring expression to be used.

● Example: In the sentence, "John went to Bill's car dealership," "John" is the antecedent of the
pronoun "he" that follows.

Anaphora and Anaphoric:

● Anaphora refers to the use of a referring expression to refer to an entity that has already
been introduced into the discourse.
● A referring expression that does this is called anaphoric.

Example: In the sentence "He looked at it for about an hour," both "he" and "it" are anaphoric as they
refer back to previously introduced entities ("John" and "Acura Integra," respectively).

Discourse Model : A discourse model is a mental representation of the ongoing conversation,

constructed by a listener (or a computer system) that contains information about the entities
mentioned, their characteristics, and their relationships. The discourse model helps to maintain
coherence throughout a conversation by keeping track of what has been said and allowing for
reference resolution. Elicitation and Access of Entities: When an entity is first mentioned, it is
evoked into the discourse model. Future mentions of this entity are then accessed from the discourse
model.
● Example: When the name "John" is first introduced, the discourse model is updated to include
the representation of "John." Later references to "he" are resolved by accessing this
representation from the model.

Reference Phenomena
In natural languages, reference is a key aspect of communication. Different types of referring
expressions and complex referent categories help navigate the relationships between terms and
entities in discourse. The following sections discuss the various types of referring expressions and
challenges in reference resolution.

Types of Referring Expressions

Indefinite Noun Phrases

Indefinite references introduce unfamiliar entities into the conversation. These are often marked by
determiners like a, an, some, or even this. They can signal either specific or non-specific references
based on context.

● Example: "I saw an Acura Integra today." (specific)

● The ambiguity between specific and non-specific readings arises in phrases like "a car" where
the speaker might or might not know which specific car is being referred to.

Definite Noun Phrases

Definite references describe entities that are identifiable to the hearer. These could either have
already been mentioned, be part of the listener's knowledge, or have unique identifiers.

● Examples:
○ "I saw an Acura Integra today. The Integra was white and needed to be washed."
○ "The Indianapolis 500 is the most popular car race in the US."
● Here, the Integra refers to a previously mentioned car, and the Indianapolis 500 is unique
enough to be identified by the listener.

Pronouns
Pronouns simplify reference by replacing noun phrases. They usually refer to entities recently
introduced or activated in the discourse model. Pronouns can be restricted by the salience or
immediacy of the referent.

● Example: "I saw an Acura Integra today. It was white and needed to be washed."

Pronouns often have to be close to their antecedents in the text (e.g., he, she, it referring to entities
mentioned recently). They can also appear before their referent (cataphora).

● Cataphora Example: "Before he bought it, John checked over the Integra very carefully."

In some cases, pronouns appear in quantified contexts and are bound to variables (e.g., Every woman
bought her Acura).

Demonstratives
Demonstrative pronouns and determiners like this and that show proximity and distance. They signal
spatial or temporal distance depending on context.
● Spatial Example: "I like this better than that."
● Temporal Example: "I bought an Integra yesterday. It’s similar to the one I bought five years
ago. That one was really nice, but I like this one even better."

Names
Names refer to specific entities, such as people, places, or organizations. They can refer to both
known and new entities in discourse.

● Example: "Miss Woodhouse certainly had not done him justice."

○ Here, Miss Woodhouse is a reference to a specific individual.

Complicating Factors in Reference Resolution

1. Inferrables : These are entities that the listener can infer from the discourse context but
aren't explicitly mentioned. For example, the listener might infer that a person referred to by a
pronoun (e.g., he or she) is a specific person without needing to repeat their name.
2. Discontinuous Sets : These refer to a set of related entities that aren't mentioned in a
continuous sequence. For example, referring to multiple cars that may not have been
discussed together but still relate to the discourse.
3. Generics These refer to types or categories of things rather than specific instances. For
example, using cars or Acura Integras in a general sense without pointing to any particular one.

Syntactic and Semantic Constraints on Coreference

Coreference resolution in natural language processing (NLP) involves identifying which words or
phrases in a sentence or passage refer to the same entity. To make this process effective, certain
syntactic and semantic constraints must be considered

Number Agreement

Pronouns must match their antecedents in number (singular or plural). For example:

● Correct: "John has a new Acura. It is red."

● Incorrect: "John has three new Acuras. 'It' is red." (The pronoun "it" should be plural here.)

Person and Case Agreement

Pronouns must also match the person (first, second, third) and case (nominative, accusative, genitive).
For example:

● Correct: "John and Mary have Acuras. 'They' love them."

● Incorrect: "You and I have Acuras. 'They' love them." (Here, "they" doesn’t match the
first-person plural subjects "you and I.")

Gender Agreement

Gender in English third-person pronouns (he, she, it) should match the gender of the noun they refer
to. For example:
● Correct: "John has an Acura. 'He' is attractive."
● Incorrect: "John has an Acura. 'It' is attractive." (The pronoun "it" could confuse the reference
to John.)

Syntactic Constraints

Syntactic constraints refer to how pronouns and their potential antecedents interact in sentence
structure. Reflexive pronouns, for instance, refer to the subject of the most immediate sentence. For
example:

● Correct: "John bought himself a new Acura."

● Incorrect: "John bought him a new Acura." ("Him" should refer to a different subject in this
context.)

Syntactic rules can also prevent certain pronouns from referring to certain subjects. For example, in
"John wanted a new car. Bill bought him a new Acura," "him" can refer to John.

Selectional Restrictions

Some verbs impose constraints on what type of object they can take. In sentences like:

● Example: "John parked his Acura in the garage. He had driven it around for hours," The
pronoun "it" clearly refers to the Acura, since "drive" is a verb associated with vehicles, not a
garage.

Metaphors can sometimes break these selectional constraints:

● Example: "John bought a new Acura. It drinks gasoline like you would not believe." (Here,
"drink" is used metaphorically for the car.)

General Semantic Constraints

In addition to syntactic and selectional constraints, semantic knowledge about the world helps
determine which referent is most likely. For instance:

● Example: "John parked his Acura in the garage. It is incredibly messy, with old bike and car
parts lying around everywhere."
● The garage is likely the intended referent for "it" because garages typically contain bike and car
parts, unlike a car.

Anaphora Resolution
The Hobbs Algorithm is a syntactic method for resolving pronouns. It operates by constructing a
parse tree of the sentence and then searching for potential antecedents (referents) to the pronoun.
Jack and Jill went up the hill, To fetch a pail of water. Jack fell down and broke 'his' crown, And Jill
came tumbling after.

Resolution Process:

● The algorithm's primary strategy involves searching left of the target word, restricting the
search to elements that have appeared before the pronoun. In this case, it eliminates 'crown'
as a possible referent because it appears after the pronoun 'his'.
● Next, it applies gender agreement. Since 'his' is a masculine pronoun, Jill (a feminine noun) is
ruled out. Additionally, inanimate objects like hill and water are unsuitable since 'his' typically
refers to animate entities.
● With the recency property, entities closest to the pronoun take precedence. This leaves Jack
as the most likely antecedent, matching both gender and recency constraints.

Algorithm Steps:

1. Input: The algorithm uses:

○ The pronoun to be resolved.
○ The syntactic parse of the sentences up to and including the current sentence.
2. Process:
○ Traverse the parse tree from the target pronoun, moving upward and to the left.
○ Identify potential antecedents, filtering them using syntactic constraints.
○ Apply agreement rules like number, person, and gender to eliminate unsuitable
options.
○ Use the recency and syntactic position to choose the best candidate.
Module 6 : Generative AI , Prompt Engineering and
Large Language Models
Introduction to Generative AI
Generative AI is a subfield of artificial intelligence focused on creating models that generate new
data, similar to the data they have been trained on. Unlike traditional AI, which focuses on prediction
and classification, generative AI aims to produce new content, such as images, text, music, and
videos.

Generative models learn the underlying patterns and structures in the input data and use this
understanding to generate similar data. The applications are vast and include:

● Text Generation (e.g., chatbots, language translation)

● Image Generation (e.g., deep fakes, image enhancement)
● Music and Art Generation
● Data Augmentation (e.g., synthetic data creation for training models)

Types of Generative AI Models

There are several generative models, but two of the most common and impactful ones are
Variational AutoEncoders (VAEs) and Generative Adversarial Networks (GANs).

Variational AutoEncoders (VAEs)

Overview

VAEs are a type of neural network architecture used for generating new data samples. They are an
extension of the traditional AutoEncoders, with a probabilistic twist. VAEs assume that the input data
can be modeled by a latent probability distribution, and they learn to map the input data to this
distribution.

Architecture

The VAE consists of two main components:

● Encoder: Maps the input data to a latent space, producing a mean and variance for the latent
variables.
● Decoder: Generates new data by sampling from the latent space and reconstructing the
original input.

Instead of learning a single latent representation, the VAE learns a probability distribution over the
latent space. This allows for better generalization and the ability to generate diverse samples.

How it Works

1. The input data is passed through the Encoder, which outputs a mean and variance.
2. The latent vector is sampled from this distribution using reparameterization trick: z=μ+σ⊙ϵz
= μ+σ⊙ϵ, where ϵ is sampled from a standard normal distribution.
3. The sampled latent vector Z is fed into the Decoder, which reconstructs the original data.

Applications

● Image generation (e.g., facial image synthesis)

● Anomaly detection
● Data compression

Advantages

● Provides a continuous latent space, allowing for smooth interpolation between generated
samples.
● Offers better regularization due to the probabilistic approach.

Generative Adversarial Networks

Overview

GANs are a type of neural network architecture designed to generate realistic data by using two
competing neural networks: a Generator and a Discriminator. The two networks are trained in a
zero-sum game, where the Generator tries to produce realistic data, and the Discriminator tries to
distinguish between real and generated data.

Architecture

● Generator: Takes random noise as input and generates synthetic data.

● Discriminator: Takes both real and generated data as input and classifies them as real or
fake.

How it Works

1. The Generator creates synthetic data samples from random noise.

2. The generated samples, along with real samples, are fed into the Discriminator.
3. The Discriminator outputs a probability score indicating whether the input data is real or
fake.
4. Both networks are trained simultaneously:
○ The Generator is trained to minimize the Discriminator's ability to classify generated
data as fake (maximize the Discriminator’s error).
○ The Discriminator is trained to maximize its ability to correctly classify real vs. fake
data.

This process is often described as a minimax game:

Applications

● Image synthesis (e.g., Deepfake videos)

● Style transfer (e.g., converting sketches to realistic images)
● Super-resolution (enhancing image quality)
● Data augmentation (creating more training samples)

Advantages

● Capable of generating high-quality and realistic data.

● Can learn complex distributions without explicit probability modeling.

Challenges : GANs can be difficult to train due to instability and mode collapse, where the
Generator produces limited varieties of outputs.

Advantages and Limitations of Generative AI

Advantages

1. Data Generation and Augmentation

○ Generative AI can create synthetic data, which is useful for training machine learning
models when there is a lack of sufficient real-world data.
○ In fields like healthcare, it can generate realistic medical images for training diagnostic
models without compromising patient privacy.
2. Creativity and Content Creation
○ Generative AI can assist in producing original content, including art, music, and text. For
example, AI-generated music compositions and artwork have become popular in
creative industries.
3. Automation of Design Tasks
○ Generative models can be used to design new products, materials, and drug molecules,
accelerating the process of innovation in fields like manufacturing and pharmaceuticals.
4. Enhanced User Experience
○ Generative AI can power chatbots and virtual assistants, providing personalized and
human-like interactions. It can answer questions, summarize content, and assist users
in creative writing.
5. Image Enhancement and Restoration
○ Models like GANs can be used for tasks like image super-resolution, denoising, and
colorization of black-and-white photos, improving the quality of visual data.

Limitations

1. Training Complexity
○ Generative models, especially GANs, require substantial computational resources and
expertise to train effectively. Issues like mode collapse and vanishing gradients can
make training unstable.
2. Data Quality Dependence
○ The performance of generative AI is heavily dependent on the quality and diversity of
the training data. If the data is biased, the generated content may also reflect these
biases.
3. Ethical and Privacy Concerns
○ Generative models can create highly realistic fake content, such as deepfakes, which
can be used for malicious purposes like misinformation or identity theft.
○ Using private or sensitive data for training generative models can lead to privacy
violations.
4. Lack of Control and Interpretability
○ It can be challenging to control the specific output of a generative model. For instance,
in text generation, the model might produce incorrect, inappropriate, or biased
responses.
5. Overfitting and Poor Generalization
○ Generative models may overfit the training data, making them less capable of
producing novel and diverse samples. This is a common issue when the training data is
limited.

What is ChatGPT?
ChatGPT is an advanced conversational AI model developed by OpenAI, based on the GPT
(Generative Pre-trained Transformer) architecture. It belongs to a class of large language models
(LLMs) that utilize deep learning techniques to understand and generate human-like text.

Key Features of ChatGPT

1. Conversational Ability ChatGPT can engage in multi-turn dialogues, understanding the

context of the conversation and providing relevant responses. It can answer questions,
provide explanations, and assist with various tasks.
2. Pre-training and Fine-tuning
○ The model undergoes two main stages of training:
■ Pre-training: The model learns from vast amounts of text data to predict the
next word in a sentence. This helps it capture the nuances of human language,
grammar, and general knowledge.
■ Fine-tuning: After pre-training, the model is further fine-tuned using a smaller,
curated dataset with human feedback to align its responses with user
expectations.
3. Transformer Architecture
○ ChatGPT is built on the Transformer architecture, which uses mechanisms like
self-attention to process and generate text efficiently. This architecture allows it to
understand the relationships between words in a sentence and capture long-range
dependencies in text.
4. Applications
○ ChatGPT can be used in a variety of applications, including:
■ Customer Support: Automated responses to customer queries.
■ Content Creation: Assisting in writing articles, blog posts, and creative content.
■ Code Assistance: Helping programmers with code suggestions and debugging.
■ Education: Providing explanations, tutoring, and language learning assistance.
5. Language Understanding
○ ChatGPT can understand and generate text in multiple languages, making it versatile
for global applications.

Advantages of ChatGPT

● Human-like Responses: ChatGPT generates coherent and contextually relevant responses,

making it suitable for interactive applications.
● Wide Knowledge Base: Due to extensive pre-training on diverse datasets, ChatGPT has a
broad understanding of general knowledge.
● Ease of Integration: ChatGPT can be integrated into various platforms, including websites,
mobile apps, and messaging services.

Limitations of ChatGPT

● Lacks True Understanding: Despite its impressive capabilities, ChatGPT does not have true
comprehension or reasoning. It generates text based on patterns in the data it was trained on.
● May Generate Incorrect Information: ChatGPT can confidently provide responses that are
factually incorrect or misleading.
● Sensitivity to Input Phrasing: The quality of responses can vary depending on how the input
query is phrased.
● Risk of Bias: The model may reflect biases present in the training data, leading to biased or
inappropriate responses.

Prompt Engineering for LLM Interaction

Prompt Engineering is the practice of designing and refining input queries (prompts) to maximize the
performance of large language models (LLMs) like ChatGPT. The prompt is crucial because it guides
the model in generating the desired output, making prompt engineering a key skill in leveraging LLMs
effectively.

Well-designed prompts can help the model produce accurate, relevant, and coherent responses,
while poorly constructed prompts may lead to incorrect, vague, or biased outputs.

Types of Prompts
Instruction-Based Prompts

● These prompts give direct instructions to the model, specifying the task clearly.
● Example: "Summarize the following text in one sentence: [text]."

Context-Based Prompts

● These prompts provide context before asking the main question, helping the model
understand the background better.
● Example: "You are an expert in climate science. Explain the impact of greenhouse gases on
global warming."

Completion Prompts
● The model is given a starting text and asked to continue or complete it.
● Example: "In a world where artificial intelligence has taken over human tasks, the greatest
challenge is..."

Question-Based Prompts

● Direct questions that require the model to provide factual answers.

● Example: "What are the main features of a Variational AutoEncoder?"

Role-Based Prompts

● The prompt assigns a role or persona to the model to tailor the response style.
● Example: "Act as a software development mentor and explain how to use version control in
Git."

Prompt Templates
Prompt Templates are pre-designed structures that can be reused to interact with LLMs for different
tasks. They help maintain consistency and ensure the prompt is clear and effective.

Example of Prompt Templates:

1. Summarization Template:
○ "Summarize the following content in a concise paragraph: [insert content here]."
○ Use Case: Quickly getting a summary of articles, research papers, or long text inputs.
2. Q&A Template:
○ "Based on the given context, answer the following question: [context]. Question: [insert
question here]."
○ Use Case: Helps when extracting information from specific contexts or datasets.
3. Code Assistance Template:
○ "You are a Python expert. Given the code snippet below, provide a detailed explanation
and suggest improvements: [insert code here]."
○ Use Case: Code review and debugging support.
4. Creative Writing Template:
○ "Write a short story about [topic or theme], focusing on the characters [insert character
names]."
○ Use Case: Generating creative content for storytelling or brainstorming ideas.
5. Structured Output Template:
○ "Generate a response in JSON format with fields for 'Summary', 'Key Points', and
'Recommendations': [insert input text here]."
○ Use Case: Obtaining structured data outputs for further processing or integration.

Techniques for Crafting Clear, Concise, and Informative Prompts

Effective prompt engineering involves applying certain techniques to ensure that the input query
guides the model appropriately. Here are some key techniques:

a) Be Specific and Clear

● Vague prompts can lead to ambiguous or irrelevant responses. Clearly state what you want the
model to do.
● Example:
○ Vague: "Tell me about AI."
○ Specific: "Explain the difference between supervised and unsupervised learning in AI."

b) Use Role Assignment

● Assigning a role or persona to the model can help it adopt a specific tone or expertise level,
making the responses more targeted.
● Example:
○ "You are a financial analyst. Analyze the impact of inflation on the stock market."

c) Provide Context

● Including context before the main query can help the model understand the background,
leading to better responses.
● Example:
○ Context: "The article discusses climate change and its effects on polar bears."
○ Query: "Summarize the impact of climate change on polar bear populations."

d) Use Examples in the Prompt

● Providing examples can help the model understand the expected format or style of the
response.
● Example:
○ "Translate the following sentences into French. Example: 'Good morning' → 'Bonjour'.
Sentence: 'How are you?'"

e) Control Output Length

● Specify the desired length of the response if you need a short answer or a detailed
explanation.
● Example:
○ "In two sentences, explain why neural networks are used in deep learning."

f) Use Constraints or Formatting

● Asking the model to format the response in a specific way can help in extracting structured
data.
● Example:
○ "List the advantages of using GANs in bullet points."

g) Iterative Prompt Refinement

● If the response is not satisfactory, refine the prompt iteratively by clarifying the task or
rephrasing the question.
● Example:
○ Initial Prompt: "Explain convolutional neural networks."
○ Refined Prompt: "Explain convolutional neural networks, focusing on their architecture
and applications in image processing."

Zero-Shot Learning
What is Zero-Shot Learning?

Zero-shot learning refers to a scenario where the model is expected to perform a task without having
seen any specific examples of it during training. Instead, the model relies on its general
understanding and the information provided in the prompt to infer what is required.

How it Works

In zero-shot learning, the prompt is designed to be clear and self-explanatory, containing all the
necessary instructions for the model to understand the task. The model uses its vast knowledge base,
acquired during pre-training, to interpret the task and generate an appropriate response.

Examples of Zero-Shot Prompts:

1. Text Summarization:
○ Prompt: "Summarize the following article in one sentence: [Insert article text]."
○ The model is not given any specific examples of summaries but is expected to produce
one based on its understanding.
2. Sentiment Analysis:
○ Prompt: "Analyze the sentiment of this review: 'I absolutely loved this product. It
exceeded my expectations.'"
○ The model infers that it needs to classify the sentiment without being given labeled
examples.
3. Translation:
○ Prompt: "Translate this sentence into Spanish: 'Where is the nearest hospital?'"
○ The model performs the translation task without explicit training on this specific
sentence.

Advantages of Zero-Shot Learning:

● Versatility: It allows the model to handle a wide variety of tasks without additional training
data.
● Ease of Use: No need to provide examples or fine-tune the model for specific tasks.

Challenges:

● Limited Accuracy: The model may not always produce accurate results, especially for
complex tasks or tasks requiring domain-specific knowledge.
● Ambiguity: The model might misinterpret the task if the prompt is not clear enough.
Few-Shot Learning
What is Few-Shot Learning?

Few-shot learning is a technique where the model is provided with a few examples of the task within
the prompt. These examples serve as demonstrations, helping the model understand the expected
format, style, and requirements of the task.

How it Works

The prompt includes a few input-output pairs as examples before asking the model to complete a
similar task. This approach helps the model generalize better because it can learn from the provided
examples and apply the learned pattern to new inputs.

Examples of Few-Shot Prompts:

1. Text Classification:

Example 1: "I hate this movie. It was boring." → Negative

Example 2: "This was an amazing performance. I loved it!" → Positive

Analyze the sentiment of this review: "The plot was dull and predictable."

The model uses the provided examples to determine the sentiment of the new review.

2. Named Entity Recognition:

Example 1: "Barack Obama was born in Hawaii." → Person: Barack Obama, Location:
Hawaii

Example 2: "Apple Inc. is headquartered in Cupertino." → Organization: Apple

Inc., Location: Cupertino

Extract entities from this sentence: "Elon Musk founded SpaceX in California."

The model recognizes the pattern and extracts entities accordingly.

3. Code Completion:

Example 1: "Input: [1, 2, 3], Output: [1, 4, 9]"

Example 2: "Input: [4, 5, 6], Output: [16, 25, 36]"

Provide the output for Input: [7, 8, 9].

The model learns that it needs to square each element in the input list.

Advantages of Few-Shot Learning:

● Improved Accuracy: Providing examples helps the model understand the task better and
increases the likelihood of generating correct responses.
● Flexibility: It can adapt to new tasks without requiring full retraining.

Challenges:

● Prompt Length: Including many examples can make the prompt longer, which may be
inefficient for very large tasks.
● Overfitting to Examples: The model might rely too heavily on the provided examples, limiting
its ability to generalize.

Transformer Architecture
The transformer architecture, central to many modern LLMs like GPT, uses self-attention mechanisms
and multi-layer processing to understand and generate human-like text effectively. It was designed to
address issues in sequence modeling, like those found in recurrent neural networks (RNNs) and long
short-term memory networks (LSTMs), which struggled with long-range dependencies and
parallelization.

Key Components of the Transformer Architecture:

1. Self-Attention Mechanism: This is the core feature allowing the model to weigh the
importance of different words in a sentence regardless of their position. By computing
attention scores, the transformer can identify relevant relationships within the input sequence.
2. Encoder-Decoder Structure: The original transformer model consists of an encoder and a
decoder. However, in LLMs like GPT, only the decoder part is used for autoregressive text
generation, making it efficient for tasks like text completion and dialogue.
3. Positional Encoding: Since transformers do not have a natural sense of word order like RNNs,
positional encoding is added to provide a sense of sequence, helping the model to distinguish
between different positions of words.
4. Multi-Head Attention: Instead of focusing on a single attention score, multi-head attention
allows the model to look at different parts of the sequence simultaneously, capturing various
aspects of the word relationships and context.
5. Feed-Forward Neural Network: Each encoder and decoder layer also includes a feed-forward
neural network for additional nonlinearity and complexity, followed by normalization to
stabilize training.

Advantages of Transformer Architecture:

● Parallelization: Unlike RNNs, transformers can process tokens in parallel, speeding up

training.
● Scalability: Transformers can be scaled to large datasets and models, making them suitable
for extensive pre-training tasks like those used in LLMs.

Understanding Pre Training and Fine tuning

Pre-training and fine-tuning are key stages in the development of Large Language Models (LLMs),
enabling them to understand and generate human-like text. Here's an in-depth look at both
processes:
Pre-training

Pre-training is the first phase of LLM development, where the model is exposed to vast amounts of
text data to learn general language patterns, grammar, facts, and context.

Process:

● Unsupervised Learning: Pre-training usually involves unsupervised learning. The model

learns from raw text data without explicit labels.
● Objective: Common pre-training objectives include:
○ Masked Language Modeling (MLM): Used in models like BERT, where certain words in
the input are masked, and the model is trained to predict these masked tokens.
○ Autoregressive Language Modeling: Used in GPT models, where the model predicts
the next word in a sequence given all previous words.
○ Seq2Seq Learning: Models like T5 use a sequence-to-sequence approach, where tasks
are framed as text-to-text (e.g., "Translate English to French:").
● Large-scale Data: LLMs are trained on diverse datasets like books, articles, and websites to
build a broad understanding of language.

Advantages:

● Knowledge Base: The model develops a foundational understanding of language, facts, and
general knowledge.
● Transferability: The general language skills learned during pre-training can be adapted to a
variety of specific tasks with fine-tuning.

Challenges:

● Computational Cost: Pre-training requires immense computational resources and time.

● Data Quality: The model can inadvertently learn biases present in the training data.

2. Fine-tuning

Fine-tuning is the second phase, where the pre-trained model is adapted for specific tasks using a
smaller, task-specific dataset.

Process:

● Supervised Learning: Fine-tuning typically involves supervised learning, where the model is
trained with labeled examples for a specific task (e.g., sentiment analysis, text classification,
question answering).
● Task-specific Objectives: The objective changes from general language modeling to the
specific task at hand. For instance, during fine-tuning for sentiment analysis, the model learns
to classify text as positive or negative.
● Smaller Dataset: The dataset for fine-tuning is much smaller compared to pre-training, but it
is highly relevant to the specific task.

Advantages:
● Task Adaptation: Fine-tuning allows the model to specialize and perform well on targeted
tasks.
● Efficiency: Since the model has already learned a lot about the language during pre-training,
fine-tuning can be done faster with less data.

Challenges:

● Overfitting: The model can overfit the small fine-tuning dataset, especially if it is not diverse
or large enough.
● Catastrophic Forgetting: The model may lose some of its general knowledge gained during
pre-training if fine-tuning heavily focuses on a specific task.
N gram Numerical and theory

Natural Language Processing
No ratings yet
Natural Language Processing
4 pages
Natural Language Processing With Python A Comprehensive Guide To NLP in The Age of AI For 2024 (Hayden Van Der Post) (Z-Library)
No ratings yet
Natural Language Processing With Python A Comprehensive Guide To NLP in The Age of AI For 2024 (Hayden Van Der Post) (Z-Library)
315 pages
КТП 9-сынып 2022-2023
No ratings yet
КТП 9-сынып 2022-2023
12 pages
Natural Language Processing (NPL) : Group Name: Goal Diggers
No ratings yet
Natural Language Processing (NPL) : Group Name: Goal Diggers
22 pages
NLP PPT
No ratings yet
NLP PPT
41 pages
Unit 1
No ratings yet
Unit 1
35 pages
2-Lecture Two - (Back Ground of NLP)
No ratings yet
2-Lecture Two - (Back Ground of NLP)
65 pages
Harmonizing Humanity and Technology
No ratings yet
Harmonizing Humanity and Technology
10 pages
Brief History of NLP
No ratings yet
Brief History of NLP
7 pages
Guided by Dinesh Sir Presented by Sam
No ratings yet
Guided by Dinesh Sir Presented by Sam
10 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
NLP Insem
No ratings yet
NLP Insem
100 pages
Natural Language Processing
No ratings yet
Natural Language Processing
16 pages
NLP Merged
100% (1)
NLP Merged
975 pages
Unit V
No ratings yet
Unit V
16 pages
Natural Language Processing - Theory and Application
No ratings yet
Natural Language Processing - Theory and Application
20 pages
Introduction To Natural Language Processing-03-01-2024
No ratings yet
Introduction To Natural Language Processing-03-01-2024
27 pages
Foundation For NLP
No ratings yet
Foundation For NLP
14 pages
NLP Introduction Week3
No ratings yet
NLP Introduction Week3
28 pages
Module 1 Lecture 1
No ratings yet
Module 1 Lecture 1
29 pages
NLP Module 1
No ratings yet
NLP Module 1
31 pages
ML QBF
No ratings yet
ML QBF
13 pages
NLP Unit 1 Notes
100% (1)
NLP Unit 1 Notes
19 pages
1 NLP
No ratings yet
1 NLP
26 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
Chapter 6
No ratings yet
Chapter 6
21 pages
Ai TXT Unit1
No ratings yet
Ai TXT Unit1
13 pages
NLP - Natural Language Processing and APPLICATION
No ratings yet
NLP - Natural Language Processing and APPLICATION
31 pages
NLP Notes2
No ratings yet
NLP Notes2
27 pages
NLP PPT1
No ratings yet
NLP PPT1
29 pages
NLP Self
No ratings yet
NLP Self
22 pages
Unit-I NLP
No ratings yet
Unit-I NLP
15 pages
NLP Exam Notes
No ratings yet
NLP Exam Notes
15 pages
NLP Unit 1 1
No ratings yet
NLP Unit 1 1
67 pages
NLP Unit 1 and 2
No ratings yet
NLP Unit 1 and 2
106 pages
Lec1-UNIT5 - MORE SIMPLER
No ratings yet
Lec1-UNIT5 - MORE SIMPLER
28 pages
Natural Language Processing
No ratings yet
Natural Language Processing
30 pages
Introduction To NLP: Prof: Vraj M Hingu Dept: Computer
No ratings yet
Introduction To NLP: Prof: Vraj M Hingu Dept: Computer
87 pages
NLP IAT I Solution
No ratings yet
NLP IAT I Solution
8 pages
NLP Unit-1
No ratings yet
NLP Unit-1
37 pages
Selective Topic Assignment
No ratings yet
Selective Topic Assignment
7 pages
An In-Depth Exploration of Natural Language Processing: Evolution, Applications, and Future Directions
100% (8)
An In-Depth Exploration of Natural Language Processing: Evolution, Applications, and Future Directions
5 pages
Artificial Intelligence: Natural Language Processing
No ratings yet
Artificial Intelligence: Natural Language Processing
41 pages
Natural Language Processing Inside Pages 2
No ratings yet
Natural Language Processing Inside Pages 2
159 pages
AI - Unit 1 - NLP
No ratings yet
AI - Unit 1 - NLP
27 pages
Unit 4
No ratings yet
Unit 4
39 pages
NLP QB
No ratings yet
NLP QB
14 pages
Unit1 A
No ratings yet
Unit1 A
8 pages
Brocode OP
No ratings yet
Brocode OP
133 pages
NLP Textbook Star Edu
No ratings yet
NLP Textbook Star Edu
103 pages
NLP Question and Answers Final
No ratings yet
NLP Question and Answers Final
129 pages
Poeter Stemmer Algorithm
No ratings yet
Poeter Stemmer Algorithm
57 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
Unit 1 NLP Introduction
No ratings yet
Unit 1 NLP Introduction
48 pages
Artificial Intelligence-UNIT-4
No ratings yet
Artificial Intelligence-UNIT-4
37 pages
Presentation 1
No ratings yet
Presentation 1
10 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
NLP Presentation
No ratings yet
NLP Presentation
19 pages
NLP Important Question and Answers Module Wise
No ratings yet
NLP Important Question and Answers Module Wise
101 pages
Natural Language Processing
No ratings yet
Natural Language Processing
24 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Profile Manager
No ratings yet
Profile Manager
3 pages
BC Front Pages Including Index
No ratings yet
BC Front Pages Including Index
5 pages
Donations
No ratings yet
Donations
3 pages
IP QuestionBank 23 24
No ratings yet
IP QuestionBank 23 24
5 pages
Module 2
No ratings yet
Module 2
117 pages
Module 3
No ratings yet
Module 3
187 pages
İngi̇li̇zce Zamanlar Tablosu
No ratings yet
İngi̇li̇zce Zamanlar Tablosu
4 pages
Grammar Test 7
No ratings yet
Grammar Test 7
2 pages
Nouns and Pronouns
No ratings yet
Nouns and Pronouns
6 pages
DocumentThink Not Alaba
No ratings yet
DocumentThink Not Alaba
2 pages
FPSC Inspector Asf Test
100% (2)
FPSC Inspector Asf Test
6 pages
1close Up A2 WB 2017
No ratings yet
1close Up A2 WB 2017
84 pages
Most Important Magazine: Error Detection Quiz 6
No ratings yet
Most Important Magazine: Error Detection Quiz 6
8 pages
Grade 3 Holistic Language Review Packet Q3 13-14 - Rev1
No ratings yet
Grade 3 Holistic Language Review Packet Q3 13-14 - Rev1
17 pages
MODULE 5 (Week 5)
100% (1)
MODULE 5 (Week 5)
36 pages
List-Comparatives and Superlatives
No ratings yet
List-Comparatives and Superlatives
2 pages
Present Tense PDF
No ratings yet
Present Tense PDF
2 pages
Ii Avaliação de Língua Inglesa
No ratings yet
Ii Avaliação de Língua Inglesa
2 pages
A Sanskrit Manual Part 1
100% (1)
A Sanskrit Manual Part 1
173 pages
Lexis 4512
No ratings yet
Lexis 4512
8 pages
Tri Susanti English Tugas 25 Mei
No ratings yet
Tri Susanti English Tugas 25 Mei
3 pages
Separable Prefixes (Trennbare Vorsilben)
No ratings yet
Separable Prefixes (Trennbare Vorsilben)
1 page
Unit: 15 Fairy Meadows Objectives:: Pre-Reading
No ratings yet
Unit: 15 Fairy Meadows Objectives:: Pre-Reading
10 pages
English Tips 1 - Something, Nothing, Anything, Everything (Only English)
No ratings yet
English Tips 1 - Something, Nothing, Anything, Everything (Only English)
4 pages
Put The Verbs in Brackets Into The Gaps in The Correct Tense Past Perfect or Simple Past
No ratings yet
Put The Verbs in Brackets Into The Gaps in The Correct Tense Past Perfect or Simple Past
3 pages
Lesson Plan PTS and PTC 6th Grade
No ratings yet
Lesson Plan PTS and PTC 6th Grade
3 pages
Soal Ranking 1
50% (2)
Soal Ranking 1
2 pages
Difference Between Morphology and Syntax
No ratings yet
Difference Between Morphology and Syntax
7 pages
Unit 12
No ratings yet
Unit 12
27 pages
Chapter 5-Lexicology
No ratings yet
Chapter 5-Lexicology
13 pages
Mentors' Grammar Book
100% (1)
Mentors' Grammar Book
177 pages
English 2 - Q4 - PT
No ratings yet
English 2 - Q4 - PT
4 pages
Sentence Based Error Questions Shortcut Trick With 100 Solved Example
No ratings yet
Sentence Based Error Questions Shortcut Trick With 100 Solved Example
37 pages
LESSON PLAN - "Cities and Countries"
No ratings yet
LESSON PLAN - "Cities and Countries"
4 pages
English Basic Grammar Class
No ratings yet
English Basic Grammar Class
49 pages